Harvard:Biophysics 101/Notebook:ZS/2007-2-6
Assignment 1, due 2/6/07
Original
Output of original code (for Homo sapiens topoisomerase (DNA) II binding protein 1, mRNA link here):
GenBank id: BC126209.1 total sequence length: 5148 method 1 A 1669 AA 427 AAA 150 AAAA 48 AAAAA 16 AAAAAA 8 AAAAAAA 4 AAAAAAAA 1 AAAAAAAAA 0 method 2 A 1669 AA 589 AAA 218 AAAA 76 AAAAA 29 AAAAAA 13 AAAAAAA 5 AAAAAAAA 1 AAAAAAAAA 0
Different GI
For modified version to display subsection of cyanobacteria PCC7942 genome (GI: 23957787) : File:2607 kaiC 1.txt
Output:
GenBank id: AY120853.2 total sequence length: 74355 method 1 A 16631 AA 3436 AAA 845 AAAA 210 AAAAA 57 AAAAAA 16 AAAAAAA 5 AAAAAAAA 0 AAAAAAAAA 0 method 2 A 16631 AA 4327 AAA 1117 AAAA 288 AAAAA 78 AAAAAA 21 AAAAAAA 5 AAAAAAAA 0 AAAAAAAAA 0
Poly-T instead of Poly-A
For code: File:Code1 forT.txt
Output:
GenBank id: BC126209.1 total sequence length: 5148 method 1 T 1387 TT 309 TTT 102 TTTT 31 TTTTT 10 TTTTTT 4 TTTTTTT 1 TTTTTTTT 0 TTTTTTTTT 0 method 2 T 1387 TT 418 TTT 144 TTTT 46 TTTTT 15 TTTTTT 5 TTTTTTT 1 TTTTTTTT 0 TTTTTTTTT 0
Translating protein sequence
The tricky part with translating the sequence is that the coding sequence is buried in the data from NCBI; there needs to be a second parsing layer to find the start codon, extract the CDS, and find the end codon at which to terminate the signal. For this, an extra parsing layer needs to be made; the code is generalizable for any sequence with 1 CDS in it.
For code:File:Code fortranslating.txt
Output (sorry for formatting, but it matches NCBI):
MSRNDKEPFFVKFLKSSDNSKCFFKALESIKEFQSEEYLQIITEEEALKIKENDRSLYICDPFSGVVFDHLKKLGCRIVGPQVVIFCMHHQRCVPRAEHPVYNMVMSDVTISCTSLEKEKREEVHKYVQMMGGRVYRDLNVSVTHLIAGEVGSKKYLVAANLKKPILLPSWIKTLWEKSQEKKITRYTDINMEDFKCPIFLGCIICVTGLCGLDRKEVQQLTVKHGGQYMGQLKMNECTHLIVQEPKGQKYECAKRWNVHCVTTQWFFDSIEKGFCQDESIYKTEPRPEAKTMPNSSTPTSQINTIDSRTLSDVSNISNINASCVSESICNSLNSKLEPTLENLENLDVSAFQAPEDLLDGCRIYLCGFSGRKLDKLRRLINSGGGVRFNQLNEDVTHVIVGDYDDELKQFWNKSAHRPHVVGAKWLLECFSKGYMLSEEPYIHANYQPVEIPVSHKPESKAALLKKKNSSFSKKDFAPSEKHEQADEDLLSQYENGSSTVVEAKTSEARPFNDSTHAEPLNDSTHISLQEENQSSVSHCVPDVSTITEEGLFSQKSFLVLGFSNENESNIANIIKENAGKIMSLLSRTVADYAVVPLLGCEVEATVGEVVTNTWLVTCIDYQTLFDPKSNPLFTPVPVMTGMTPLEDCVISFSQCAGAEKESLTFLANLLGASVQEYFVRKSNAKKGMFASTHLILKERGGSKYEAAKKWNLPAVTIAWLLETARTGKRADESHFLIENSTKEERSLETEITNGINLNSDTAEHPGTRLQTHRKTVVTPLDMNRFQSKAFRAVVSQHARQVAASPAVGQPLQKEPSLHLDTPSKFLSKDKLFKPSFDVKDALAALETPGRPSQQKRKPSTPLSEVIVKNLQLALANSSRNAVALSASPQLKEAQSEKEEAPKPLHKVVVCVSKKLSKKQSELNGIAASLGADYRWSFDETVTHFIYQGRPNDTNREYKSVKERGVHIVSEHWLLDCAQECKHLPESLYPHTYNPKMSLDISAVQDGRLCNSRLLSAVSSTKDDEPDPLILEENDVDNMATNNKESAPSNGSGKNDSKGVLTQTLEMRENFQKQLQEIMSATSIVKPQGQRTSLSRSGCNSASSTPDSTRSARSGRSRVLEALRQSRQTVPDVNTEPSQNEQIIWDDPTAREERARLASNLQWPSCPTQYSELQVDIQNLEDSPFQKPLHDSEIAKQAVCDPGNIRVTEAPKHPISEELETPIKDSHLIPTPQAPSIAFPLANPPVAPHPREKIITIEETHEELKKQYIFQLSSLNPQERIDYCHLIEKLGGLVIEKQCFDPTCTHIVVGHPLRNEKYLASVAAGKWVLHRSYLEACRTAGHFVQEEDYEWGSSSILDVLTGINVQQRRLALAAMRWRKKIQQRQESGIVEGAFSGWKVILHVDQSREAGFKRLLQSGGAKVLPGHSVPLFKEATHLFSDLNKLKPDDSGVNIAEAAAQNVYCLRTEYIADYLMQESPPHVENYCLPEAISFIQNNKELGTGLSQKRKAPTEKNKIKRPRVH