User talk:Alexander J. Ratner: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 23: Line 23:
[[Image:tom_screenshot2.png|600px|Screenshot 2]]
[[Image:tom_screenshot2.png|600px|Screenshot 2]]


Fil and I split up the work and did very well working as a team; I made the basic upload-put in mysql structure, then he made it look nicer, and made the regexpressions that scrape the info from the python files and into mysql; then I made the download feature.  I made the part on the results display page, that delivered up the data for the python execution, and then Fil did that part.  We have other things on our list to do that we will probably work on in the next few weeks (see "Looking Forwards" section).
Fil and I split up the work on this last bit and did very well working as a team; I made the basic upload-put in mysql structure, then he made it look nicer, and made the regexpressions that scrape the info from the python files and into mysql; then I made the download feature.  I made the part on the results display page, that delivered up the data for the python execution, then Fil implemented the last bit of executing the python script, then I went back over the aesthetics of the data display.  We have other things on our list to do that we will probably work on in the next few weeks (see "Looking Forwards" section).


===Looking Forwards===
===Looking Forwards===

Revision as of 23:38, 17 December 2009

Final Contribution Summary & Notes

My Role in the Project

I worked on the trait-o-matic infrastructure, mainly with Fil or on my own. Basically, by the time we had finished debating and discussing what the project would be, I was itching to do something that would show solid results right away; in addition, I had come into the course from the very beginning especially interested about learning more computer programming, in the context of genetics. Finally, I had always been a little bit pessimistic about the ability (of any group of people, at this stage) to make a black box which would accurately predict how exactly the myriad of genetic and physiological factors combined to make up an epistatic result. So my aim and philosophy was this: make a structure that is completely modular, expandable and open. This is what you guys (Sasha especially) often stressed, and I wanted to run under the same banner.

My Contributions

A good deal of my time outside the course was spent teaching myself more php, python, learning CodeIgniter (the website framework), and figuring out exactly how Trait-o-Matic worked. Also I had a lot of catching up to do in terms of biological knowledge. Here are some of the more concrete things that I have done:

  • Actively contributed to the class discussion leading up to deciding on a project
  • Set up a new VM instance of trait-o-matic, contributed to the documentation of said setup process after having dealt with a few glitches
  • Built a tool on T-o-M that allows you to view an SNP, by chromosomal location, across all available genomes, after the Biology group told me this would be helpful
  • Restructured the data on the T-o-M front end (slightly) so that results from all the different sources (OMIM, snpedia, etc.) are together and able to be sorted as one group (currently they are sorted by trait name)
  • Infrastructure group logistics: I added to the Infrastructure group page, posted notes and a diagram on the area of T-o-M we were dealing with, maintained a list of possible tasks and tried to help others stay involved
  • Built an upload portal for models which then shows up on the results page (see below)

The Final Product

I will ignore the other modifications to T-o-M that I made, and focus on what Fil and I have just finished working on. Basically, this is the final product of something that I worked hard towards the end to push very strongly for- namely, that our end contribution be modular, open, and easily expanded onto. This came from two things at first: one, Sasha's very encouraging insistence, one class, that we could make a valuable contribution to T-o-M by opening it up to modular and individual efforts of manual curation, and two, from my pessimism about our ability to solve the whole entire problem with one do-it-all black box, especially when there were so many publications already out there in the literature. Then Joe suggested that he might make individual models, each based off a paper in the literature; this excited me and spurred me on, and so I worked to hook his effort up with ours, and to insist that the modeling group also hooked into this end standardization.

So, in the end, we have this: a "Multi-Gene Models" tab on the T-o-M website takes you to a page where you can view and upload models of our standardized format, download their code, and eventually rate and comment (see my "Looking Forwards" section further down) them. Then, on our "Results 2.0" page, the models are automatically run on the genome being looked at; a final result is displayed, but you can also click to see exactly what inputs (by rsid) were considered, so the process is totally open.

Here are screenshots:

Screenshot 1 Screenshot 2

Fil and I split up the work on this last bit and did very well working as a team; I made the basic upload-put in mysql structure, then he made it look nicer, and made the regexpressions that scrape the info from the python files and into mysql; then I made the download feature. I made the part on the results display page, that delivered up the data for the python execution, then Fil implemented the last bit of executing the python script, then I went back over the aesthetics of the data display. We have other things on our list to do that we will probably work on in the next few weeks (see "Looking Forwards" section).

Looking Forwards

Current state of our project
Future (wished for) state

Currently, the standardized model-upload format brings our project together: models can either be made using the Modeling group's model-making program (which I will be posting for download on T-o-M as soon as they send me the final version), or by 'curating' the literature, i.e. directly transcribing from a paper into a precise python model. Then, in the standardized format, they are uploaded, and then anyone looking at their genome can see, in a completely open format, what these models imply about their specific genome.


Currently, the modeling team's model generates abstractly, based upon the seed of a few test data sets provided by the biology group; in the future, it will hopefully employ machine-learning off of a growing body of phenotypic data collected by T-o-M and the personal genome project.


As for the infrastructure's future, we have a list of things we'd like to see; some of them are far off, some of them are very near (We were told during the last class that we'd have until the 21st, and this threw us off just a little bit...). So, our goals- at times quite wishfully thinking, and both minor and major- are to:

  • Have the model repository be attractive enough so that scientists would consider spending the extra 15 minutes to upload a model based on their paper when they get published, rather than having others go back and do it independently. This could be encouraged by adding the following features, which would all be relatively simple to implement:
    • Install Restricted Python so that we can remove the password protection- and thus let anyone upload models- without risking security breaches
    • Create an administrative panel for deleting, modifying and updating the library of uploaded models, in a user-friendly fashion
    • Allow people to comment on and rate the uploaded models
  • Incorporate Chromosomal Location compatibility: Right now, the vast majority of snps on trait-o-matic have rsid locations (especially considering the overlap between the set of those that don't and the set of those that do). However, we should work on ways to making available those without rsids as well, by converting from chromosomal location to rsid. The challenge of doing the reverse is even more exciting: say a paper only gives a vague chromosomal location, like '17q12'; we can create a script to search trait-o-matic using both chromosomal location and trait name to narrow it down to, potentially, one specific snp. One thought that I had (which will probably be embarrassing and injurious to share, given how little I know about biology): when I had just finished implementing the results page display of the model output, I was looking at 'Anonymous African' and saw that he didn't seem to have any of the snps looked for by the three available models. So I looked through the papers and models again to check. I found out that the paper on prostate cancer identified one significant snp as rs4430796, which it referred to for the most part as simply 'an snp at 17q12'; and I noticed that our Anonymous African had, according to T-o-M, a Prostate Cancer related snp also in location 17:12, but with rsid rs4792311. So, my thought/question was this: is it possible that both of these, since they were in such a similar location, could possibly have a similar effect? Would it be useful to allow scientists to only specify a general region like '17:12' and a trait name like 'prostate cancer' for their models, and then let T-o-M do the rest?

Over the break Fil and I will finish up a few of these little things, and then hopefully we can find a way to continue working, or at least help out, in the future.

Biophysics 101

Below are assignment and project notes for the Fall 2009-10 Biophysics 101 class taught by Prof. George Church

Project: Update December 7 12:00 am

Sorry, haven't updated the wiki in a bit since mainly its just been me coding and nothing too interesting about that.

General Update: So, I see the project as having two main loci- (1) making a model which will take snps x_1 ... x_n and output a numerical prediction about a resultant trait, and (2) making a categorization scheme which will identify trait A as being dependent on snps x_1 ... x_m. Fil, Anna, Joe and I talked for a while a little bit before now, and this is what we want to do: we want to pursue solving problem (2)- Fil is especially working on that with the online thesaurus. However, this categorization scheme will certainly be far from perfect or universally useful or universally accepted; so, we think it is important to focus on building into trait-o-matic the ability to accept modular models (1), i.e. via manual curation.

So what we will be working on in the immediate future is to make an easy way to upload python files, based on the literature (Joe and others will begin to make these) to the server; the python files will be automatically logged into a mySQL database which will keep track of the rsid numbers of the snps that they accept (and these will override any existing categorization schema; for example, if our system on trait-o-matic groups snps a...d as composing height, while a modular model that Joe uploads uses snps a...f to determine height, then a...f will override).

I'll send out an email but if anyone is looking for something to do talk to Joe about this manual curation project

Project: Personal Update November 23 7:00 pm

Fil and I just re-implemented the sort function to group all the phenotypes from the different databases, this time alphabetically by phenotype, and this time two-steps more front end. Just a small step but should help later. Our next goal is to make a program that accesses the JSON files outputted by the core and modify them right there (rather than changing the way they are written in the first place- this would require us reprocessing all the data each time we made a change in the way we organize it).

Project: Personal Update November 23 5:00 pm

In response to Ridhi & company's emailed request for a tool to view one SNP across all subjects, last night and today I implemented this on [1]. Fil and I are now working on the /results2/ results display section, trying to reorganize data, so this function might be temporarily down over the next couple of hours.

Project: Personal Update November 23

Just before this post I wasted a little bit of time making a dumb diagram of the Trait-o-Matic front end:

T-o-M Front End VERY SIMPLE JSON retrieval only

Obviously this ignores everything except the simple JSON retrieval process. Also I include my sloppy and almost as simple notes here. The main point that I was thinking about was where to enter the process to change the sorting/clustering of data. Fil has been playing around with actually adding to the base database (in 'Caliban') and in this case we would have to modify the core files of trait-o-matic (I am working on this now... just for the OCD sake of symmetry, will probably upload a similar diagram/notes soon for the core process), and then we would have to reprocess the job each time to output new JSON files. Of course, once the reprocessing is done, due especially to the easy and adaptable built in Active Record database functions in the Code Igniter framework, modifying the front end will be easy.

However, I think that we should also consider- perhaps only temporarily- modifying the JSON data right on the front end. For example, I think my next small goal will be to take numbers, i.e. for odds ratios, out of the snpedia names (so that these numbers can then be directly fed into a numeric model). This should probably be done somewhere right on the front end, at least for now.

Project: Personal Update November 16

Not much to report. Waiting for my freelogy VM access, very excited; CodeIgniter was actually very well documented and user friendly so I'm looking forward to messing around with some of the more superficial elements of the trait-o-matic application once i get access. In the meantime I'm looking through all the source code files and taking notes (which I will eventually post, though I doubt they'll be of much use; hopefully when I'm done with them I can at least make a comprehensive schema which would be useful). It's interesting, both as a way to learn about python programming and as a motivation to brush up on the different genetics concepts that each individual function/source code file in the trait-o-matic 'core' directory deals with.

Project: Personal Update November 10

Unfortunately for my talk page, there isn't much to say about my personal progress, since most of it has just been learning Trait-o-Matic and CodeIgniter (the framework that supports Trait-o-Matic); thank god for mySQL, at least, being so user friendly... Basically right now the aim is simply to output phenotypic data grouped in dynamic sets based on a specific user request; then, after this step, we can team up with the modeling people to compress and interpret these groups of data.

I talked with Fil last night about several topics. One of his suggestions was to use the MeSH categorization system (see NIH's introductory web page) for the retrieval/grouping process.

On the broader scale, we talked about the need to focus on the actual processing structure of trait-o-matic as spread over multiple computers, for when we might want to use GWAS over a large set of genes- or at least speed up access of individual genome data. In the near future, looking up data for an individual genome will be a trivial and individual procedure; the real computational/informational challenge- that Trait-o-Matic should anticipate- will be cross-referencing these individual genomes.

To this extent we discussed the importance of modularity and simplicity of internal language in programming Trait-o-Matic. Then our conversation devolved into an increasingly broad, whimsical and tangential discussion of the implications of increased modularity and interfaceablitiy in science.

Once again I also questioned the worth of making such a simplified model as is being proposed by the modeling group to take on such a complex process as the interaction of multiple genes. We should really ask ourselves whether we are indeed going to come up with a self-learning black box that handles a wide array of multiple gene interactions, or just a toy which approximates a few select examples. We should at least consider the hybrid option of incorporating GWAS capability.

My Plea in Response: The Hybrid Project

I have been talking to a lot of people who are both in favor of a GWAS project as outlined in 'Brett's Plea', and those who advocated the initial plan of a predictive model with a machine-learning feedback loop. I strongly think that a hybrid model is best. On one end, if we ignore GWAS- which will begin to deliver immensely powerful results in the next few years as more genomes become available- then we doom the project to be obsolete relatively soon. However, doing a simple GWAS project seems both less interesting and educational- considering the issue heuristically- and also too unoriginal to truly be something that we can contribute to in a worthwhile way.

A hybrid model would incorporate a GWAS approach into the machine-learning feedback loop. This approach would:

  • Hopefully be an improvement over a simple attempt at GWAS, since we would be starting with some predictive model which would be informed of biological knowledge, and which would in turn inform our GWAS
  • Create, through the machine-learning GWAS informed 'hybrid' feedback loop, an improved set of models which would then, hopefully, apply beyond just the specific instance of a single GWAS study
  • Give us the ability to start now, in the absence of 100,000 genomes, but the capacity to grow in response to the future

More to come in support of this plan as we get further into actionable specifics

Other things I support for the project

  • I definitely agree that the matrix of SNPs should be multidimensional (including gene expression location, pathway information, etc.); and with the possibility of compression as Fil mentioned

Notes: October 29

My desired role (to be further refined as we go on): First of all, I would like to be involved in the programming aspect of the project; beyond that, I would like to be involved with making the program/model that describes gene interactions. Since this is the foundational element of the project, I assume almost everyone will want to be directly involved in its inception, in some way; so, beyond being involved with this, I would be more than happy to help build some of the framework, perhaps doing some of the more menial (but perhaps very educational, given my particular background) tasks of data scraping / transfer, i.e. taking data from SNPedia, OMIM, or just directly interfacing with Trait-o-Matic, and then parsing this data into a readable format.

Starting to think about Gene Interaction Examples and Epistasis:

  • The Wikipedia page on Quantitative Trait Loci provides some example of multifactorial diseases and traits.
  • This article cites a source [2] saying that polygenic inheritance follows a bell curve- should look into existing methods of modeling polygenic inheritance, including this one, in more depth.
  • My understanding of the subject is clearly pretty shallow, but in general it seems that much of the effort to associate epistasis with phenotypes is done statistically, by QTL mapping. So, our project looking into gene interactions would certainly be forward looking if we attempt to build into our tool away to statistically analyze large genome samples for QTL mapping.
  • I suppose that it would be best to start, if possible, with a (real or contrived) example of the simplest epistatic example possible, i.e. suppression or enhancement between two SNPs. We could then build some simple computational model for handling this base case before moving on to more complex cases.

Notes: October 27

Notes during class discussion

  • What makes me most likely to support the pharmacogenomics project idea is, actually, the exciting possibility of expanding into nutrigenomics. Prof. Church mentioned the Atkins diet (with tolerance of fats) and the vegan diet (with tolerance of Ketones).
  • Combining phenotypic information with direct genetic information, i.e. Prof. Church's example of heart condition prediction, combining genetic information with already known phenotypic factors (i.e. your gene counselor telling you about high cholesterol risk)
  • Gene interactions! Haplotypes and epistasis. Build an add-on for Trait-o-Matic. Geneinteractions for: reproduction, transplantation, single gene interaction
  • Important to look for specific, predictable, actionable examples

Assignment 6: October 13 - 20

Two new databases:

To start with, I checked out the new databases suggested (clearly my 'checking them out' means next to nothing, but within the local scope of this project...). I looked at GeneCards for the P450 alleles; I thought that much of the focus was laid upon commercial references; it was almost like a hybrid of GeneTests and OMIM or SNPedia. It seemed to have good and easily-formatted information about allele frequency and population correspondence (we should remember this for other projects/general interest use); however, thinking in the scope of the drug metabolism tool project, it didn't have any data directly shown about direct consequences of the SNPs in the P450 alleles. HGMD I did not have access to; but from the indicated general format of entries, it didn't seem like there would be much information relevant to the drug metabolism project that could not be found in SNPedia or OMIM.

Brief discussion on class project:

Ben, Alex and I briefly talked at lunch about general aims and guidelines in choosing a project. Ben talked about SNP cupid. The question here, as I see it, is whether to specifically aim to create a tool that is desirable and practical (like, say, not a eugenics dating tool), or one that would be immensely interesting and quite educational to plan and execute (like, say, a eugenics dating tool). Alex talked about Bioweather map, and about making a computer visualization tool for this project. The two questions/objections were here immediately obvious: (i) is this not already being done by some member of that project?, and (ii) is the genetic mapping of microorganisms too tenuously connected to the broad topic 'Human 2.0'? Either way, at this point, the three of us at least agreed on a desire for several main things in the project chosen: (I) an educational potential: a project that would be able to teach us about genomics; (II) a significant programming element: because this is not only an educational opportunity for some of us, but because it is a skill accessible to everyone in the class, so that we can all contribute. (III) not too broad and philosophical, but not too narrow and unambitious. Clearly we were following this last advice while working on the project of thinking about a project...

Thoughts on others' ideas:

I thought Joe's idea of genetically engineering the human-associated microbes; at very least it would be cool to talk about this as an independent point of interest. However, is this not basically the same thing as genetically engineering a drug using biological material, given that these microbes are, nonetheless, still external bodies? Perhaps this is merely a minor point of categorization preference. I guess I am saying that this seems more like Medicine 2.0 than directly Human 2.0.

Assignment 5: October 8

I talked with Fil, Zach, and Alex about turning Fil's idea for an online drug dosage tool. We talked about how this basic framework could then be expanded to look at the metabolism and 'recommended dosages' of things other than drugs; however, for now, we thought it would be best to have a simple input: a person's DNA sequence, and the name of a drug. The program would use Drug Bank to find out the metabolizing enzymes and target genes of the drug; it would then search the person's DNA for these specific genes, and then check for any SNPs. These SNPs would then be checked against SNPedia (or, for P450, perhaps against this very simplified page, [3]) to see if there is any effect on the metabolizing or reception of the drug. Any such positive/negative effects would then be logged into a simple array, from which a simple raiting would be delivered as output.

For example, consider Tylenol: According to Drug Bank, Cytochrome P450 2E1 (CYP2E1) and Cytochrome P450 1A2 (CYP1A2) (where here the format is Name (Gene Name)) are the first phase metabolizing enzymes, and Prostaglandin G/H synthase 1 (PTGS1) and Prostaglandin G/H synthase 2 (PTGS2) are the primary targets. The aforementioned simplified page, www.cypalleles.ki.se, can be referenced to see, for example, that a change in the 1132 nucleotide from G to A in the CYP2E1 allele of P450 will result in reduced metabolism. This would factor in to make a recommendation for a smaller dosage.SNPedia and OMIM would also be searched. This would be the most difficult part of the project- parsing the databases of SNPs to look for statements or indications that the particular SNP would cause increased or decreased metabolism. In addition, sometimes the charts have more complex interpretations; and some common increase/decrease effects are caused by multiple NPs.

Finally, one feature of our tool would be that it would use the GeneTest database to recommend double checks of the pertinent genes.

Assignment 4

Looking around at the various tools already created for analyzing DNA (for example Promethease, the interpreting program of SNPedia), it seemed to me that a fruitful project would be one that would anticipate the growth of the pool of genomes available to interpret.

If one of our goals is to more effectively interpret the human genome- i.e., to look at whole groups of genes in a person’s genome and predict their more macro-level effect- then we need to work in whole sets of data rather than in just single DNA sequences.

Perhaps making a program that would accept an entire family’s DNA sequences + their medical/personal histories, and then search for strong correlations using statistical analysis? Using the unique relationships and similarities of DNA from a single family might yield more useful statistical information on the importance/effect of different genes, than simply comparing one gene to a random non-family gene pool.

In addition, from a practical perspective, family members might be more willing to have their DNA read and also to give out personal information (health history for predicting diseases, intelligence and personality test scores, etc.) if they knew that it was going to be utilized ‘within the family’.

There are already ‘family history’ tools on the web: http://www.genome.gov/11510372 However they do not integrate actual genomics. Perhaps making a tool that did would be a fun (and, someday, possibly practically interesting) project

Assignment 3

  • When using the seek(a,b) function in python, i noticed problems with large values of a being off by 1. I think that when a >256 a new byte has to be allocated, and somehow there is a disjoint between the entered number and the actual action of the seek function due to the count starting from 0 again...? Anyway, for numbers n > 256 i had to use n+1. Wondering if anyone has a better function to use for seeking large values
  • As my first real programming project using python, it was fun! I liked how easy it was to add characters to strings, and how simple the file I/O was. The random functions combined with the time functions also made for easy random number generation, which was great.
  • Not much else to say, but below are my mutated gene protein sequences (original plus 5 mutations):

RSSSLFTR*EGRRERENVL*AGSSYLAEGGCYSPPAFLFLDYLVMAFAKAGVFVLMQTSIPPLL*MVCPTPRVACNLGGRYHGVDREGKKCAEGKPGGTFKNEHISSSRRKKKKNGTSENEILKECNDGSFDNLSGKTIYLLSSFGLGHSSSRRRLNVVKRKGRARRRPCGPPCSPRPEPVLPGRRRNSNSFLPLPHLLARGCFIPQFLPMHLPRTGHFVPYLRHLFPKSRWHLHTAPVHTASAQEDEFWPLTAP*CLPSHRPFSSSQKRFITSLPSRFPTPPLF*SP*PFCLLENFI*NGIRMGAVAHACTLAHACTLGGRGGRIT*G*EFQTSVANV

RSSSLFTR*EGRRERENVL*AGPSYLAEGGCYSPPAFLFLDYLDMAFAKAGVFVLMQTSIPPLL*MVCPTPRVACNLGGRYHGVDREGKKWAEGKPGGTFKNEHISSSRRKKKKNGTSENEILKECNDGSFYNLSGKTIYLLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPCSPRPEPVLPGKRRNSNSFLPLPHLLARGCFIPQFLPMHLPRTGHFVPYIRHLFPKSRWHLHTAPVHTASAQEDEFWPLTAP*CRPSHRPFSSSQKRFITSLPSRFPTPPLF*SP*PFCLSENFI*NGISMGAVAHACTLAHACTLGGRGGRIT*G*EFQTSVANV

RTSSLFTR*EGRRERENVL*AGPSYLAEGGCYSPPAFLFLDYLDMAFAKAGVFVLMQTSIPPLL*MVCPTPRVACNLGARYHGVDREGKKWAEGKPGGTFKNEHISSSRRKKKKNGTSENEILKECNDGSFYNLSGKTIYLLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPCSPRPEPVLPGKRRNSNSFLPPPHLLARGCFIPQFLPMHLPRTGHFVPYIRHLFPKSRWHLHTTPVHTASAQEDEFWPLTAP*CRPSHRPFSSSQKRFITSLPRRFPTPPLF*SP*PFCLSENFI*NGISMGAVAQACTLAHACTLGGRGGRIT*G*EFQTSVANV

RTSSLFTR*EGRRERENVL*AGPSYLAEGGCYSPRAFLFLDYLDMAFAKAGVFVLMQTSIPPLL*MVCPTPRVACNLGARYHGVDREGK*WAEGKPGGTFKNEHISSSRRKKKKNGTSENEVLKECNDGSFYNLSGKTIYLLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPCGPRPEPVLPGKRRNSNSFLPPPHLLARGCFIPQFLPMHLPRTGHFLPYIRHLFPKSRWHLHTTPVHTASAQEDEFWPPTAP*CRPSHRPFSSSQQRFITSLPRRFPTPPLF*SP*PFCLSENFI*NGISMAAVAQACTLAHACTLGGRGGRIT*G*EFQTSVANV

RTSSLFTR*ERRRERENVL*AGPSYLAEGGCYSPRAFLFLDY*DMAFAKAGVFVLMQTSIPPLL*MVCPSPRVACNLGARYHGVDREGK*WAEGKPGGTFKNEHISSSRRKKKKNGTSENEVLKECNDGSFYNLSGKTIYLLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPSGPRPEPVLPGKRRNSNSFLPPPHLLARGCFIPQFIPMHLPRTGHFLPYIRHLFPKSRWHLHTTPVHTASAQEDEFWPPTAP*CQPSHRPFSSSQQRFITSLPRRFPTPPLFSSP*PFCLSENFI*NGSSMAAVAQACTLAHACTLGGRGGRIT*G*EFQTSVANV

RTSSLFTR*ERRRERENVL*AGPSYLAEGGSYSPRAFLFLDY*DMALAKAGVFVLMQTSIPPLL*MVCPSPRVACNLGARYHGVDREGK*WAEGKHGGTFKNEHISSSRRKKKKNGTSENEVLKECNDGSSYNLSGKTIYVLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPSGTRPEPVLPGKRRNSNSFLPPPHLLARGCFIPQFIPMHLPRTGHFLPYTRHLFPKSRWHLHTTPVHTASAQEDEFWPPTAPRCQPSHRPFSSSQQRFITSLPRRFPTPPLFSSP*PFCLSENFI*NGSSMAAVARACTLAHACTLGGRGGRIT*G*EFQTSVANV

There are two added premature terminations (occurring in the 3rd and 4th mutations) and two removed terminations (4th and 5th)

My Code

I made five python programs for this assignment, one for formatting and then one for each of the four problems:

gene_format.py is to strip a txt file of everything but nucleotide characters (i.e. of spaces and indents):

       import os
       //
       // File input, read file as a string
       file = raw_input( "Enter file to format (strip of everything but a,t,g,c): ")
       gene_in = open(file, "r+")
       str = gene_in.read()
       //
       // Set some variables- counter and output string
       n=0
       string2 = ""
       //
       // Add elements of the old string to the new string only if they are nucleotide letters
       while n in range(len(str)):
               if (str[n]=='a' or str[n]=='A'):
                       string2 += 'a'
               if (str[n]=='c' or str[n]=='C'):
                       string2 += 'c'
               if (str[n]=='g' or str[n]=='G'):
                       string2 += 'g'
               if (str[n]=='t' or str[n]=='T'):
                       string2 += 't'
               n +=1
       //
       // Close file, delete, remake, write in new string, end comments
       gene_in.close()
       os.remove(file)
       gene_out = open(file, "w+")
       gene_out.write(string2)
       gene_out.close()
       length = len(string2)
       print "File formatted, number of nucleotides: ", length


gene1.py calculates the GC content; since I was using my formatting program, I could assume that the inputted string would be formatted, and thus this program was fairly simple

       // Get file of gene and read as input string
       file = raw_input( "Enter the filepath/filename of a txt format nucleotide sequence: " )
       fo = open(file, "r+")
       str = fo.read()
       //
       // Set counter variable
       count=0
       //
       // Search through length of string
       for n in range(len(str)):
               if (str[n]=='g' or str[n] == 'c'):
                       count+=1
       //
       // Calculate and report GC content
       total=len(str)
       gccontent=((float(count)/float(total)) * 100)
       print "\nThe GC Content is ", gccontent, "%\n"
       fo.close()

gene2.py creates the reverse complement of a nucleotide sequence and then outputs it as “file_name_rc.txt”; this was actually made without the formatting program in mind (I made the formatting program after this one) and so it keeps a running total of actual nucleotide characters as an error check; I just left it in because it doesn’t hurt to have

       // Open txt file, convert to string
       file = raw_input( "Enter the filepath/filename of the .txt format sequence: " )
       gene_in = open(file, "r+")
       str = gene_in.read()
       //
       // Set some variables
       length=len(str)
       string2 = ""
       n=1
       //
       // Variable for keeping track of how many actual nucleotides in file
       // to be used for error checking
       total=length
       //
       //create new string of complements, in reverse order
       while n in range(length+1):
               if str[-n]=='a':
                       string2 = string2 + 't'
               elif str[-n]=='t':
                       string2 = string2 + 'a'
               elif str[-n]=='g':
                       string2 = string2 + 'c'
               elif str[-n]=='c':
                       string2 = string2 + 'g'
               elif str[-n]=='A':
                       string2 = string2 + 'T'
               elif str[-n]=='T':
                       string2 = string2 + 'A'
               elif str[-n]=='G':
                       string2 = string2 + 'C'
               elif str[-n]=='C':
                       string2 = string2 + 'G'
               else:
                       total -= 1
               n=n+1
       //
       gene_in.close()
       //
       // make a modified filename for the output file
       l=(len(file)-4)
       file_new = file[0:l] + "_rc.txt"
       //
       // write the output file
       gene_out = open(file_new, "w+")
       gene_out.write(string2);
       gene_out.close()
       print "Modified sequence of ", total, " nucleotides saved in ", file_new

gene3.py parses a standard codon text file to look up protein sequences (this was good file I/O practice for me!) and then outputs as “file_name_proteins.txt”

       // Open txt file, convert to string
       file = raw_input( "Enter the filepath/filename of the .txt format sequence: " )
       gene_in = open(file, "r+")
       str = gene_in.read()
       //
       // Open codon index
       codon = open("std_codon.txt", "r+")
       //
       // Create new file name
       l = len(file)
       file_new = file[0:(l-4)] + "_proteins.txt"
       //
       // Create ouput file
       protein_out = open(file_new, "w+")
       //
       // Set some variables
       length = len(str)
       string2 = ""
       //
       // Loop through the three possible frames (+1, +2, +3)
       c=0
       while (c<3):
               string2 = string2 + "\n\n"
               n=c
       //
       // Search through the gene string
       while n in range(length-3-c):
       //
       // Go to correct paragraph of codon (paragraphs sorted by first letter)
               if str[n]=='t':
                       position = codon.seek(0, 0)
               if str[n]=='c':
                       position = codon.seek(193, 0)
               if str[n]=='a':
                       position = codon.seek(387, 0)
               if str[n]=='g':
                       position = codon.seek(580, 0)
               position = codon.seek(1,1)
       //
       // Search through paragraph
               k=0
               while(k==0):
                       test = codon.read(3)
                       if(test==str[n:(n+3)]):
                               position = codon.seek(4,1)
                               protein = codon.read(1)
                               string2 = string2 + protein
                               k=1
         else:        
                               position = codon.seek(9,1)
               n+=3
       c+=1
       //
       protein_out.write(string2)
       //
       // Close and end comment
       // gene_in.close()
       codon.close()
       protein_out.close()
       print "Protein sequence saved as ", file_new


gene4.py picks a random nucleotide in each 100 character segment of the sequence and switches it to a random other nucleotide, then outputs to a file “file_name_m.txt”. It uses the CPU clock time to set the random number seed

       import random
       import time
       //
       // Open txt file, convert to string
       file = raw_input( "Enter the filepath/filename of the .txt format sequence: " )
       gene_in = open(file, "r+")
       str = gene_in.read()
       //
       // Set random number seed based on CPU time
       random.seed(int(time.time()))
       //
       // Set some variables
       string2 = ""
       //
       // Change random nucleotides
       n=0;
       while(n<(len(str)-100)):
               i=int(random.uniform(0+n,100+n))
               if(str[i]=='a'):
                       j=random.choice(['t','g','c'])
               if(str[i]=='g'):
                       j=random.choice(['t','a','c'])
               if(str[i]=='t'):
                       j=random.choice(['a','g','c'])
               if(str[i]=='c'):
                       j=random.choice(['t','g','a'])
               string2=string2 + str[n:i]+j+str[(i+1):(n+100)]
               print i, str[i], " changed to ", j
               n+=100
       //
       // Include the 'left over' part of the string
       string2=string2+str[n:len(str)]
       //
       // Create new file name
       l = len(file)
       file_new = file[0:(l-4)] + "_m.txt"
       //
       // Create ouput file
       mutated = open(file_new, "w+")
       mutated.write(string2)
       print "Mutated sequence of ", len(string2), " nucleotides saved as ", file_new
       gene_in.close()
       mutated.close()


Assignment 2

Alexander J. Ratner 11:34, 14 September 2009 (EDT)

The assignment documented below was to graph an exponential and several logistic curves using a plotting plugin for Python

Fig. 1: The Exponential Plot


  • This was my first plot using Python, as well as my first use of Python in general; good practice!
  • The first logistic curve, [math]\displaystyle{ \frac{dx}{dt}=k*x }[/math], is of course simply the exponential curve (Fig. 1)





  • To plot the 2nd logistic curve was [math]\displaystyle{ \frac{dx}{dt}=k*x*(1-x) }[/math] which gives the formula [math]\displaystyle{ x(t)=\frac{e^{kt}}{1+e^{kt}} }[/math], which is what I then plotted; shown below is this logistic curve with increasing values of k (0.9, 1.5, 3.67):

Fig. 2: The logistic curve, k=0.9 Fig. 3: The logistic curve, k=1.5 Fig. 4: The logistic curve, k=3.67


  • On the excel graph, one can see that the long term behavior is highly chaotic- notice the difference between k=3.67 and k=3.7