AlexLabNotebook/ChrIIIRebuild/9/24/05-9/30/05
9/24/05 - 9/30/05
Last week: 9/19/05-9/23/05
Next week: 10/1/05-10/7/05
Goals this week:
- Figure out 3' ends of all genes, based on 3'-end prediction website -- sent gene sequences [generated by extractgeneseqs.py] to Joel Graber.
- Figure out flanking 5' and 3' sequences required for Ty transposon function -- done.
- Figure out flanking 5' and 3' sequences required for ncRNA function
Reannotation
- There are discrepancies between the positions of the start and stop codons of chr III genes as supplied by SGD and the positions based on the UCSC genome browser and nibFrag. Had to write some code to update the start/stop positions in my MySQL instance -- reannotategenes.py
- Also had to manually update start/stop positions for three genes:
List of overlapping regions
Generated by findoverlaps.py
With 0 bp flanking sequence [ie only true overlaps]:
- Bases 1-4322, length: 4321 bases
- Bases 13282-14119, length: 837 bases
- Bases 15214-16880, length: 1666 bases
- Bases 23523-23981, length: 458 bases
- Bases 44623-46963, length: 2340 bases
- Bases 48653-52340, length: 3687 bases
- Bases 78948-82274, length: 3326 bases
- Bases 84810-90768, length: 5958 bases
- Bases 106970-107413, length: 443 bases
- Bases 108017-110666, length: 2649 bases
- Bases 114318-114936, length: 618 bases
- Bases 131982-133118, length: 1136 bases
- Bases 137740-139043, length: 1303 bases
- Bases 142698-143077, length: 379 bases
- Bases 151518-151856, length: 338 bases
- Bases 193289-200170, length: 6881 bases
- Bases 200434-205389, length: 4955 bases
- Bases 208127-209602, length: 1475 bases
- Bases 210710-211537, length: 827 bases
- Bases 211863-213764, length: 1901 bases
- Bases 228087-228783, length: 696 bases
- Bases 254364-258647, length: 4283 bases
- Bases 263969-264484, length: 515 bases
- Bases 272308-274080, length: 1772 bases
- Bases 294400-295326, length: 926 bases
- Bases 300825-302214, length: 1389 bases
Total overlap length: 55079 bases.
With 500 bp upstream of start codon and 500 bp downstream of stop codon of a gene and all other feature types bounded by their SGD start/end annotations:
- Bases 1-4322, length: 4321 bases
- Bases 9206-14849, length: 5643 bases
- Bases 15214-29436, length: 14222 bases
- Bases 30949-41224, length: 10275 bases
- Bases 41665-78418, length: 36753 bases
- Bases 78448-82774, length: 4326 bases
- Bases 83054-84625, length: 1571 bases
- Bases 84810-90768, length: 5958 bases
- Bases 90823-114936, length: 24113 bases
- Bases 116874-123500, length: 6626 bases
- Bases 127964-142664, length: 14700 bases
- Bases 142698-143077, length: 379 bases
- Bases 143128-149397, length: 6269 bases
- Bases 151102-168491, length: 17389 bases
- Bases 170378-176930, length: 6552 bases
- Bases 176992-178793, length: 1801 bases
- Bases 179012-290290, length: 111278 bases
- Bases 292384-295326, length: 2942 bases
- Bases 300325-303523, length: 3198 bases
Total overlap length: 278316 bases.
Ty elements
- No functional Ty5 elements have been found in S.cerevisiae [per Lesage & Todeschini].
- No additional 5' and 3' flanking sequence outside the full element [ie 5' LTR + coding region + 3' LTR] are required for transcription of the genes in Ty2. Transcription starts 240 bp into the 5' LTR and ends 285 bp into the 3' LTR, per Farabaugh et al, '89
- Insertion of either full Ty elements or just their LTRs can impact gene expression, both increasing and decreasing it [Lesage & Todeschini]. Will need to look at each of the Ty insertions on chr III and make an educated guess about whether and how they could affect gene expression of neighboring genes.
Figuring out [5'] flanking sequence
Via conservation
Look at conservation across evolutionary distance and use that to determine how much of the flanking sequence is "important".
Pseudo-code:
initial block = alignment block that 5' end of gene falls into
if (initial block is high-scoring) label_1: if (block extends all the way to nearest upstream gene) flanking sequence = all of intergenic sequence exit else take up to end of high-scoring block if (another high-scoring block nearby) goto label_1 else exit else if (high-scoring block nearby) goto label_1 else take 500 bp upstream of start codon exit
Need to determine what constitutes a "high-scoring" block and what being "nearby" means.
Via "canonical" promoters
Identify "canonical" promoters eg cell-cycle regulated, GCN4-regulated etc, assign a canonical promoter to each gene according to what's known about the gene and then synthesize the canonical promoters in front of the genes, rather than synthesizing the WT sequence. Using these canonical promoters would go further towards the goal of building a custom, understood chromosome than using the WT sequence and might be a better "engineering" choice. However, it would then be harder to make meaningful comparisons to WT yeast, so it'd be a worse "science" choice.
- Starting point: Segal E et al, Nat Genetics '03: "Module networks: Identifying regulatory modules and their condition-specific regulators from gene expression data."
- Data is available from here.
- List of genes assigned to modules assigned 55 genes on chromosome III to a module, leaving 127 unassigned [per getgenemodules.py]. Could be because they only looked at ~2300 genes to begin with; could possibly re-run their analysis with expression data sets that include the rest of the genes on chr III to map more chr III genes to modules.
Visualization
Need to be able to visualize both existing chromosome and new chromosome. Some possibilities:
- See whether Ben Fry has anything I could use
- Add custom tracks to the UCSC Genome Browser, as described here
- Use VectorNTI
- See whether I can [re]use the visualization stuff developed by David Gifford's group.
- BioBricks Registry
- Write my own visualization tool
- This paper lists some existing genome visualization tools, but none of them seems to have the functionality to show control elements ie promoters etc.
- This site also lists some visualization tools.
Last week: 9/19/05-9/23/05
Next week: 10/1/05-10/7/05