User talk:Alexander J. Ratner: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 5: Line 5:
Just before this post I wasted a little bit of time making a dumb diagram of the Trait-o-Matic front end:
Just before this post I wasted a little bit of time making a dumb diagram of the Trait-o-Matic front end:


[[Image:tomfrontendmap.jpg|300px|Fig. 2: The logistic curve, k=0.9]]
[[Image:tomfrontendmap.jpg|600px|T-o-M Front End VERY SIMPLE JSON retrieval only]]
 
Obviously this ignores everything except the simple JSON retrieval process. Also I include my sloppy and almost as simple notes here.


===Project: Personal Update November 16===
===Project: Personal Update November 16===

Revision as of 23:12, 22 November 2009

Biophysics 101

Below are assignment and project notes for the Fall 2009-10 Biophysics 101 class taught by Prof. George Church

Project: Personal Update November 23

Just before this post I wasted a little bit of time making a dumb diagram of the Trait-o-Matic front end:

T-o-M Front End VERY SIMPLE JSON retrieval only

Obviously this ignores everything except the simple JSON retrieval process. Also I include my sloppy and almost as simple notes here.

Project: Personal Update November 16

Not much to report. Waiting for my freelogy VM access, very excited; CodeIgniter was actually very well documented and user friendly so I'm looking forward to messing around with some of the more superficial elements of the trait-o-matic application once i get access. In the meantime I'm looking through all the source code files and taking notes (which I will eventually post, though I doubt they'll be of much use; hopefully when I'm done with them I can at least make a comprehensive schema which would be useful). It's interesting, both as a way to learn about python programming and as a motivation to brush up on the different genetics concepts that each individual function/source code file in the trait-o-matic 'core' directory deals with.

Project: Personal Update November 10

Unfortunately for my talk page, there isn't much to say about my personal progress, since most of it has just been learning Trait-o-Matic and CodeIgniter (the framework that supports Trait-o-Matic); thank god for mySQL, at least, being so user friendly... Basically right now the aim is simply to output phenotypic data grouped in dynamic sets based on a specific user request; then, after this step, we can team up with the modeling people to compress and interpret these groups of data.

I talked with Fil last night about several topics. One of his suggestions was to use the MeSH categorization system (see NIH's introductory web page) for the retrieval/grouping process.

On the broader scale, we talked about the need to focus on the actual processing structure of trait-o-matic as spread over multiple computers, for when we might want to use GWAS over a large set of genes- or at least speed up access of individual genome data. In the near future, looking up data for an individual genome will be a trivial and individual procedure; the real computational/informational challenge- that Trait-o-Matic should anticipate- will be cross-referencing these individual genomes.

To this extent we discussed the importance of modularity and simplicity of internal language in programming Trait-o-Matic. Then our conversation devolved into an increasingly broad, whimsical and tangential discussion of the implications of increased modularity and interfaceablitiy in science.

Once again I also questioned the worth of making such a simplified model as is being proposed by the modeling group to take on such a complex process as the interaction of multiple genes. We should really ask ourselves whether we are indeed going to come up with a self-learning black box that handles a wide array of multiple gene interactions, or just a toy which approximates a few select examples. We should at least consider the hybrid option of incorporating GWAS capability.

My Plea in Response: The Hybrid Project

I have been talking to a lot of people who are both in favor of a GWAS project as outlined in 'Brett's Plea', and those who advocated the initial plan of a predictive model with a machine-learning feedback loop. I strongly think that a hybrid model is best. On one end, if we ignore GWAS- which will begin to deliver immensely powerful results in the next few years as more genomes become available- then we doom the project to be obsolete relatively soon. However, doing a simple GWAS project seems both less interesting and educational- considering the issue heuristically- and also too unoriginal to truly be something that we can contribute to in a worthwhile way.

A hybrid model would incorporate a GWAS approach into the machine-learning feedback loop. This approach would:

  • Hopefully be an improvement over a simple attempt at GWAS, since we would be starting with some predictive model which would be informed of biological knowledge, and which would in turn inform our GWAS
  • Create, through the machine-learning GWAS informed 'hybrid' feedback loop, an improved set of models which would then, hopefully, apply beyond just the specific instance of a single GWAS study
  • Give us the ability to start now, in the absence of 100,000 genomes, but the capacity to grow in response to the future

More to come in support of this plan as we get further into actionable specifics

Other things I support for the project

  • I definitely agree that the matrix of SNPs should be multidimensional (including gene expression location, pathway information, etc.); and with the possibility of compression as Fil mentioned

Notes: October 29

My desired role (to be further refined as we go on): First of all, I would like to be involved in the programming aspect of the project; beyond that, I would like to be involved with making the program/model that describes gene interactions. Since this is the foundational element of the project, I assume almost everyone will want to be directly involved in its inception, in some way; so, beyond being involved with this, I would be more than happy to help build some of the framework, perhaps doing some of the more menial (but perhaps very educational, given my particular background) tasks of data scraping / transfer, i.e. taking data from SNPedia, OMIM, or just directly interfacing with Trait-o-Matic, and then parsing this data into a readable format.

Starting to think about Gene Interaction Examples and Epistasis:

  • The Wikipedia page on Quantitative Trait Loci provides some example of multifactorial diseases and traits.
  • This article cites a source [1] saying that polygenic inheritance follows a bell curve- should look into existing methods of modeling polygenic inheritance, including this one, in more depth.
  • My understanding of the subject is clearly pretty shallow, but in general it seems that much of the effort to associate epistasis with phenotypes is done statistically, by QTL mapping. So, our project looking into gene interactions would certainly be forward looking if we attempt to build into our tool away to statistically analyze large genome samples for QTL mapping.
  • I suppose that it would be best to start, if possible, with a (real or contrived) example of the simplest epistatic example possible, i.e. suppression or enhancement between two SNPs. We could then build some simple computational model for handling this base case before moving on to more complex cases.

Notes: October 27

Notes during class discussion

  • What makes me most likely to support the pharmacogenomics project idea is, actually, the exciting possibility of expanding into nutrigenomics. Prof. Church mentioned the Atkins diet (with tolerance of fats) and the vegan diet (with tolerance of Ketones).
  • Combining phenotypic information with direct genetic information, i.e. Prof. Church's example of heart condition prediction, combining genetic information with already known phenotypic factors (i.e. your gene counselor telling you about high cholesterol risk)
  • Gene interactions! Haplotypes and epistasis. Build an add-on for Trait-o-Matic. Geneinteractions for: reproduction, transplantation, single gene interaction
  • Important to look for specific, predictable, actionable examples

Assignment 6: October 13 - 20

Two new databases:

To start with, I checked out the new databases suggested (clearly my 'checking them out' means next to nothing, but within the local scope of this project...). I looked at GeneCards for the P450 alleles; I thought that much of the focus was laid upon commercial references; it was almost like a hybrid of GeneTests and OMIM or SNPedia. It seemed to have good and easily-formatted information about allele frequency and population correspondence (we should remember this for other projects/general interest use); however, thinking in the scope of the drug metabolism tool project, it didn't have any data directly shown about direct consequences of the SNPs in the P450 alleles. HGMD I did not have access to; but from the indicated general format of entries, it didn't seem like there would be much information relevant to the drug metabolism project that could not be found in SNPedia or OMIM.

Brief discussion on class project:

Ben, Alex and I briefly talked at lunch about general aims and guidelines in choosing a project. Ben talked about SNP cupid. The question here, as I see it, is whether to specifically aim to create a tool that is desirable and practical (like, say, not a eugenics dating tool), or one that would be immensely interesting and quite educational to plan and execute (like, say, a eugenics dating tool). Alex talked about Bioweather map, and about making a computer visualization tool for this project. The two questions/objections were here immediately obvious: (i) is this not already being done by some member of that project?, and (ii) is the genetic mapping of microorganisms too tenuously connected to the broad topic 'Human 2.0'? Either way, at this point, the three of us at least agreed on a desire for several main things in the project chosen: (I) an educational potential: a project that would be able to teach us about genomics; (II) a significant programming element: because this is not only an educational opportunity for some of us, but because it is a skill accessible to everyone in the class, so that we can all contribute. (III) not too broad and philosophical, but not too narrow and unambitious. Clearly we were following this last advice while working on the project of thinking about a project...

Thoughts on others' ideas:

I thought Joe's idea of genetically engineering the human-associated microbes; at very least it would be cool to talk about this as an independent point of interest. However, is this not basically the same thing as genetically engineering a drug using biological material, given that these microbes are, nonetheless, still external bodies? Perhaps this is merely a minor point of categorization preference. I guess I am saying that this seems more like Medicine 2.0 than directly Human 2.0.

Assignment 5: October 8

I talked with Fil, Zach, and Alex about turning Fil's idea for an online drug dosage tool. We talked about how this basic framework could then be expanded to look at the metabolism and 'recommended dosages' of things other than drugs; however, for now, we thought it would be best to have a simple input: a person's DNA sequence, and the name of a drug. The program would use Drug Bank to find out the metabolizing enzymes and target genes of the drug; it would then search the person's DNA for these specific genes, and then check for any SNPs. These SNPs would then be checked against SNPedia (or, for P450, perhaps against this very simplified page, [2]) to see if there is any effect on the metabolizing or reception of the drug. Any such positive/negative effects would then be logged into a simple array, from which a simple raiting would be delivered as output.

For example, consider Tylenol: According to Drug Bank, Cytochrome P450 2E1 (CYP2E1) and Cytochrome P450 1A2 (CYP1A2) (where here the format is Name (Gene Name)) are the first phase metabolizing enzymes, and Prostaglandin G/H synthase 1 (PTGS1) and Prostaglandin G/H synthase 2 (PTGS2) are the primary targets. The aforementioned simplified page, www.cypalleles.ki.se, can be referenced to see, for example, that a change in the 1132 nucleotide from G to A in the CYP2E1 allele of P450 will result in reduced metabolism. This would factor in to make a recommendation for a smaller dosage.SNPedia and OMIM would also be searched. This would be the most difficult part of the project- parsing the databases of SNPs to look for statements or indications that the particular SNP would cause increased or decreased metabolism. In addition, sometimes the charts have more complex interpretations; and some common increase/decrease effects are caused by multiple NPs.

Finally, one feature of our tool would be that it would use the GeneTest database to recommend double checks of the pertinent genes.

Assignment 4

Looking around at the various tools already created for analyzing DNA (for example Promethease, the interpreting program of SNPedia), it seemed to me that a fruitful project would be one that would anticipate the growth of the pool of genomes available to interpret.

If one of our goals is to more effectively interpret the human genome- i.e., to look at whole groups of genes in a person’s genome and predict their more macro-level effect- then we need to work in whole sets of data rather than in just single DNA sequences.

Perhaps making a program that would accept an entire family’s DNA sequences + their medical/personal histories, and then search for strong correlations using statistical analysis? Using the unique relationships and similarities of DNA from a single family might yield more useful statistical information on the importance/effect of different genes, than simply comparing one gene to a random non-family gene pool.

In addition, from a practical perspective, family members might be more willing to have their DNA read and also to give out personal information (health history for predicting diseases, intelligence and personality test scores, etc.) if they knew that it was going to be utilized ‘within the family’.

There are already ‘family history’ tools on the web: http://www.genome.gov/11510372 However they do not integrate actual genomics. Perhaps making a tool that did would be a fun (and, someday, possibly practically interesting) project

Assignment 3

  • When using the seek(a,b) function in python, i noticed problems with large values of a being off by 1. I think that when a >256 a new byte has to be allocated, and somehow there is a disjoint between the entered number and the actual action of the seek function due to the count starting from 0 again...? Anyway, for numbers n > 256 i had to use n+1. Wondering if anyone has a better function to use for seeking large values
  • As my first real programming project using python, it was fun! I liked how easy it was to add characters to strings, and how simple the file I/O was. The random functions combined with the time functions also made for easy random number generation, which was great.
  • Not much else to say, but below are my mutated gene protein sequences (original plus 5 mutations):

RSSSLFTR*EGRRERENVL*AGSSYLAEGGCYSPPAFLFLDYLVMAFAKAGVFVLMQTSIPPLL*MVCPTPRVACNLGGRYHGVDREGKKCAEGKPGGTFKNEHISSSRRKKKKNGTSENEILKECNDGSFDNLSGKTIYLLSSFGLGHSSSRRRLNVVKRKGRARRRPCGPPCSPRPEPVLPGRRRNSNSFLPLPHLLARGCFIPQFLPMHLPRTGHFVPYLRHLFPKSRWHLHTAPVHTASAQEDEFWPLTAP*CLPSHRPFSSSQKRFITSLPSRFPTPPLF*SP*PFCLLENFI*NGIRMGAVAHACTLAHACTLGGRGGRIT*G*EFQTSVANV

RSSSLFTR*EGRRERENVL*AGPSYLAEGGCYSPPAFLFLDYLDMAFAKAGVFVLMQTSIPPLL*MVCPTPRVACNLGGRYHGVDREGKKWAEGKPGGTFKNEHISSSRRKKKKNGTSENEILKECNDGSFYNLSGKTIYLLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPCSPRPEPVLPGKRRNSNSFLPLPHLLARGCFIPQFLPMHLPRTGHFVPYIRHLFPKSRWHLHTAPVHTASAQEDEFWPLTAP*CRPSHRPFSSSQKRFITSLPSRFPTPPLF*SP*PFCLSENFI*NGISMGAVAHACTLAHACTLGGRGGRIT*G*EFQTSVANV

RTSSLFTR*EGRRERENVL*AGPSYLAEGGCYSPPAFLFLDYLDMAFAKAGVFVLMQTSIPPLL*MVCPTPRVACNLGARYHGVDREGKKWAEGKPGGTFKNEHISSSRRKKKKNGTSENEILKECNDGSFYNLSGKTIYLLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPCSPRPEPVLPGKRRNSNSFLPPPHLLARGCFIPQFLPMHLPRTGHFVPYIRHLFPKSRWHLHTTPVHTASAQEDEFWPLTAP*CRPSHRPFSSSQKRFITSLPRRFPTPPLF*SP*PFCLSENFI*NGISMGAVAQACTLAHACTLGGRGGRIT*G*EFQTSVANV

RTSSLFTR*EGRRERENVL*AGPSYLAEGGCYSPRAFLFLDYLDMAFAKAGVFVLMQTSIPPLL*MVCPTPRVACNLGARYHGVDREGK*WAEGKPGGTFKNEHISSSRRKKKKNGTSENEVLKECNDGSFYNLSGKTIYLLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPCGPRPEPVLPGKRRNSNSFLPPPHLLARGCFIPQFLPMHLPRTGHFLPYIRHLFPKSRWHLHTTPVHTASAQEDEFWPPTAP*CRPSHRPFSSSQQRFITSLPRRFPTPPLF*SP*PFCLSENFI*NGISMAAVAQACTLAHACTLGGRGGRIT*G*EFQTSVANV

RTSSLFTR*ERRRERENVL*AGPSYLAEGGCYSPRAFLFLDY*DMAFAKAGVFVLMQTSIPPLL*MVCPSPRVACNLGARYHGVDREGK*WAEGKPGGTFKNEHISSSRRKKKKNGTSENEVLKECNDGSFYNLSGKTIYLLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPSGPRPEPVLPGKRRNSNSFLPPPHLLARGCFIPQFIPMHLPRTGHFLPYIRHLFPKSRWHLHTTPVHTASAQEDEFWPPTAP*CQPSHRPFSSSQQRFITSLPRRFPTPPLFSSP*PFCLSENFI*NGSSMAAVAQACTLAHACTLGGRGGRIT*G*EFQTSVANV

RTSSLFTR*ERRRERENVL*AGPSYLAEGGSYSPRAFLFLDY*DMALAKAGVFVLMQTSIPPLL*MVCPSPRVACNLGARYHGVDREGK*WAEGKHGGTFKNEHISSSRRKKKKNGTSENEVLKECNDGSSYNLSGKTIYVLSSFGLGHSSSRRRLNVVRRKGRARRRPCGPPSGTRPEPVLPGKRRNSNSFLPPPHLLARGCFIPQFIPMHLPRTGHFLPYTRHLFPKSRWHLHTTPVHTASAQEDEFWPPTAPRCQPSHRPFSSSQQRFITSLPRRFPTPPLFSSP*PFCLSENFI*NGSSMAAVARACTLAHACTLGGRGGRIT*G*EFQTSVANV

There are two added premature terminations (occurring in the 3rd and 4th mutations) and two removed terminations (4th and 5th)

My Code

I made five python programs for this assignment, one for formatting and then one for each of the four problems:

gene_format.py is to strip a txt file of everything but nucleotide characters (i.e. of spaces and indents):

       import os
       //
       // File input, read file as a string
       file = raw_input( "Enter file to format (strip of everything but a,t,g,c): ")
       gene_in = open(file, "r+")
       str = gene_in.read()
       //
       // Set some variables- counter and output string
       n=0
       string2 = ""
       //
       // Add elements of the old string to the new string only if they are nucleotide letters
       while n in range(len(str)):
               if (str[n]=='a' or str[n]=='A'):
                       string2 += 'a'
               if (str[n]=='c' or str[n]=='C'):
                       string2 += 'c'
               if (str[n]=='g' or str[n]=='G'):
                       string2 += 'g'
               if (str[n]=='t' or str[n]=='T'):
                       string2 += 't'
               n +=1
       //
       // Close file, delete, remake, write in new string, end comments
       gene_in.close()
       os.remove(file)
       gene_out = open(file, "w+")
       gene_out.write(string2)
       gene_out.close()
       length = len(string2)
       print "File formatted, number of nucleotides: ", length


gene1.py calculates the GC content; since I was using my formatting program, I could assume that the inputted string would be formatted, and thus this program was fairly simple

       // Get file of gene and read as input string
       file = raw_input( "Enter the filepath/filename of a txt format nucleotide sequence: " )
       fo = open(file, "r+")
       str = fo.read()
       //
       // Set counter variable
       count=0
       //
       // Search through length of string
       for n in range(len(str)):
               if (str[n]=='g' or str[n] == 'c'):
                       count+=1
       //
       // Calculate and report GC content
       total=len(str)
       gccontent=((float(count)/float(total)) * 100)
       print "\nThe GC Content is ", gccontent, "%\n"
       fo.close()

gene2.py creates the reverse complement of a nucleotide sequence and then outputs it as “file_name_rc.txt”; this was actually made without the formatting program in mind (I made the formatting program after this one) and so it keeps a running total of actual nucleotide characters as an error check; I just left it in because it doesn’t hurt to have

       // Open txt file, convert to string
       file = raw_input( "Enter the filepath/filename of the .txt format sequence: " )
       gene_in = open(file, "r+")
       str = gene_in.read()
       //
       // Set some variables
       length=len(str)
       string2 = ""
       n=1
       //
       // Variable for keeping track of how many actual nucleotides in file
       // to be used for error checking
       total=length
       //
       //create new string of complements, in reverse order
       while n in range(length+1):
               if str[-n]=='a':
                       string2 = string2 + 't'
               elif str[-n]=='t':
                       string2 = string2 + 'a'
               elif str[-n]=='g':
                       string2 = string2 + 'c'
               elif str[-n]=='c':
                       string2 = string2 + 'g'
               elif str[-n]=='A':
                       string2 = string2 + 'T'
               elif str[-n]=='T':
                       string2 = string2 + 'A'
               elif str[-n]=='G':
                       string2 = string2 + 'C'
               elif str[-n]=='C':
                       string2 = string2 + 'G'
               else:
                       total -= 1
               n=n+1
       //
       gene_in.close()
       //
       // make a modified filename for the output file
       l=(len(file)-4)
       file_new = file[0:l] + "_rc.txt"
       //
       // write the output file
       gene_out = open(file_new, "w+")
       gene_out.write(string2);
       gene_out.close()
       print "Modified sequence of ", total, " nucleotides saved in ", file_new

gene3.py parses a standard codon text file to look up protein sequences (this was good file I/O practice for me!) and then outputs as “file_name_proteins.txt”

       // Open txt file, convert to string
       file = raw_input( "Enter the filepath/filename of the .txt format sequence: " )
       gene_in = open(file, "r+")
       str = gene_in.read()
       //
       // Open codon index
       codon = open("std_codon.txt", "r+")
       //
       // Create new file name
       l = len(file)
       file_new = file[0:(l-4)] + "_proteins.txt"
       //
       // Create ouput file
       protein_out = open(file_new, "w+")
       //
       // Set some variables
       length = len(str)
       string2 = ""
       //
       // Loop through the three possible frames (+1, +2, +3)
       c=0
       while (c<3):
               string2 = string2 + "\n\n"
               n=c
       //
       // Search through the gene string
       while n in range(length-3-c):
       //
       // Go to correct paragraph of codon (paragraphs sorted by first letter)
               if str[n]=='t':
                       position = codon.seek(0, 0)
               if str[n]=='c':
                       position = codon.seek(193, 0)
               if str[n]=='a':
                       position = codon.seek(387, 0)
               if str[n]=='g':
                       position = codon.seek(580, 0)
               position = codon.seek(1,1)
       //
       // Search through paragraph
               k=0
               while(k==0):
                       test = codon.read(3)
                       if(test==str[n:(n+3)]):
                               position = codon.seek(4,1)
                               protein = codon.read(1)
                               string2 = string2 + protein
                               k=1
         else:        
                               position = codon.seek(9,1)
               n+=3
       c+=1
       //
       protein_out.write(string2)
       //
       // Close and end comment
       // gene_in.close()
       codon.close()
       protein_out.close()
       print "Protein sequence saved as ", file_new


gene4.py picks a random nucleotide in each 100 character segment of the sequence and switches it to a random other nucleotide, then outputs to a file “file_name_m.txt”. It uses the CPU clock time to set the random number seed

       import random
       import time
       //
       // Open txt file, convert to string
       file = raw_input( "Enter the filepath/filename of the .txt format sequence: " )
       gene_in = open(file, "r+")
       str = gene_in.read()
       //
       // Set random number seed based on CPU time
       random.seed(int(time.time()))
       //
       // Set some variables
       string2 = ""
       //
       // Change random nucleotides
       n=0;
       while(n<(len(str)-100)):
               i=int(random.uniform(0+n,100+n))
               if(str[i]=='a'):
                       j=random.choice(['t','g','c'])
               if(str[i]=='g'):
                       j=random.choice(['t','a','c'])
               if(str[i]=='t'):
                       j=random.choice(['a','g','c'])
               if(str[i]=='c'):
                       j=random.choice(['t','g','a'])
               string2=string2 + str[n:i]+j+str[(i+1):(n+100)]
               print i, str[i], " changed to ", j
               n+=100
       //
       // Include the 'left over' part of the string
       string2=string2+str[n:len(str)]
       //
       // Create new file name
       l = len(file)
       file_new = file[0:(l-4)] + "_m.txt"
       //
       // Create ouput file
       mutated = open(file_new, "w+")
       mutated.write(string2)
       print "Mutated sequence of ", len(string2), " nucleotides saved as ", file_new
       gene_in.close()
       mutated.close()


Assignment 2

Alexander J. Ratner 11:34, 14 September 2009 (EDT)

The assignment documented below was to graph an exponential and several logistic curves using a plotting plugin for Python

Fig. 1: The Exponential Plot


  • This was my first plot using Python, as well as my first use of Python in general; good practice!
  • The first logistic curve, [math]\displaystyle{ \frac{dx}{dt}=k*x }[/math], is of course simply the exponential curve (Fig. 1)





  • To plot the 2nd logistic curve was [math]\displaystyle{ \frac{dx}{dt}=k*x*(1-x) }[/math] which gives the formula [math]\displaystyle{ x(t)=\frac{e^{kt}}{1+e^{kt}} }[/math], which is what I then plotted; shown below is this logistic curve with increasing values of k (0.9, 1.5, 3.67):

Fig. 2: The logistic curve, k=0.9 Fig. 3: The logistic curve, k=1.5 Fig. 4: The logistic curve, k=3.67


  • On the excel graph, one can see that the long term behavior is highly chaotic- notice the difference between k=3.67 and k=3.7