# User talk:Zachary Frankel

## Documentation of my contribution/work

### Overview

Throughout the semester I worked in a variety of capacities on the class project. I have consistently dedicated significant time to this project and the work necessary to contribute to it, and have kept a relatively thorough blog type documentation of my work and thoughts throughout the semester that you can find in the second section of this talk page. There are also several posts from me on project work on the project page and some of my thoughts/work are recorded in the modeling team page which was worked on by myself and Alex L. Much of my work and time in the class was involved in positioning myself to contribute in a meaningful way - in my case that meant 1) reading a lot of biology and statistics 2) learning some python and mathematica (for prototyping) 3) installing and fixing a lot of software issues (a nice side benefit of this class is I now have a much cleaner computer running an updated OS). However,I have tried to formalize a documentation of my work in the below section.

### Implementation/Code

I wrote several pieces of code, the last of which was the starting point for the final product from the modeling team. Code I wrote earlier in the semester includes early prototyping of various modeling techniques(mostly non-linear regression in earlier stages) in mathematica, in addition to a mapthreading method to generate datasets according to some rule one specifies - this can be found in this file. Similarly I prototyped and explored logistic regression analysis in mathematica in this brief code. To see early versions of the code I wrote for the groundwork of the team's final output please see the initial code, the datareading methods and the logistic regressions tool files (these use the ols package). I am also currently writing an additional model code for other types of mathematical models that I will post before the deadline on the 21st. Also please find updated versions of the python tools (ones worked on collaboratively with Daniel, Alex L and Ben) for logregoutput and dataread. Other coding initiatives include exploration of Fuzzy C-means Clustering via the python-clustering package which you can find here. I had planned to get a working model of the neural network before the end of class but I had expected an extra couple days before Friday - I will still continue working on this and will post this implementation some time in December (though not before the technical end of class).

### Research

I spent significant time both reading papers, a couple textbooks, and various online sources to develop my background in both biology and modeling - this research was behind all of my contributions. Below are links to all of the resources I have used throughout the course in addition to my takeaway thoughts/lessons from them

• Statistics Resources
1. I read several chapters of an introduction to generalized linear models for guidance on implementing the models we were looking at - note I used a print edition from Cabot science library
2. Hossein sent me this chapter and this chapter from course notes from MIT's STOCHASTIC PROCESSES, DETECTION AND ESTIMATION class - they were an extremely useful introduction to the subjects
3. In researching neural networks further I referred to this website and the textbook Neural Networks for Pattern Recognition
• Papers:
1. Myocardial Infarction GWAS paper - saw casestudy of how to take genome data and find interaction of SNPs on risk of a trait
2. The paper from the Biology team on eyecolor and the supplemental information - strong walkthrough of genetic interaction of SNPs on a non-risk trait - in particular this examines the type of nominal classification which inspired the prototype of the modeling team work
3. Background on epistatis - provided a mathematical framework from which to approach epistasis and think about genetic interaction in general
4. Review of SNP analysis in the context of complex disease prediction. Provided more justification for logistic regression, and introduced me to notion of neural network which I wish to implement
5. Overview of pharmacogenomics - useful in early stage of class project when considering pharmacogenetics. Background was still relevant as it discussed the way to associate genes with traits
6. Background on flux analysis - followup from Prof. Church's lecture
7. Update on sequencing - though not directly related, it motivated a class discussion and helped me keep in mind the speed at which we are becoming able to sequence. This impacted my focus on preparing for the future because it's closer than it seems.
8. Reconstruction of Human Metabolic Network - a terrific paper provided by Harris - inspired a lot of the early stage thinking as well as the desire to shy away from the complete project of reconstructing drug metabolism.
9. Paper on obesity which I presented as a possible case study

### Idea Generation/Modeling Work

From the earliest days of the project I have worked in collaboration with the class on generating ideas for project direction and ultimately put forth (the admittedly obvious/underlying the entire conversation) idea to simply work on Trait-O-Matic rather than pursue an explicitly pharmacogenomics or SNPCupid project. Also presented framework of multiple tiers of models, worked through computational problems with continuous modeling and moved group into python programming from matlab. Additionally, within the group I pursued logistic regression and explored alternative paths for future work (presented findings inside the group) - to see more on these refer to the sections I put together on the Modeling Team Documentation. Finally, I suggested various avenues for improvement detailed throughout my talk section as well in the future ideas section I posted on the Modeling documentation.

### Documentation

I began both the modeling team page as well as the documentation within it. Alex L and I wrote the documentation itself together. He did an excellent job dcoumenting the programming/final implementation aspect of the project, whereas I wrote and posted the initial skeleton of what the documentation would entail and completed documentation on data types, statistical methods, lessons learned, and future plans.

### Cross-group Coordination

I have worked to coordinate the modeling group with both the infrastructure and biology groups. Over the semester I had multiple meetings with the biology group as the sole contact form the modeling group - similarly, I also worked to bring in guidance from the biology group to the modeling group meetings. Additionally, I had several meetings with the infrastructure group to understand the direction of their work and determine how to integrate the modeling work into trait-o-matic. This resulted in

1. The decision to use csv as a way to read in data given that it is easy for computers to handles but intuitive enough for nonprogrammers to input data
2. Decide what kind of dataset to seek out from people in the literature (although at this time we are yet to receive one) - however, in this vein, we also determined how to test our code and what type of data to create
3. Conclusion with infrastructure group on how to control our output. Ie. the coordination of the output file from the modeling group was a result of these meetings and discussions.
4. Consolidated flow of information

### Logistics

1. Organized and coordinated five meetings for the modeling team in the second half of the semester on November 9th, November 15th, November 30th, December 7th, and December 17th(upcoming)
2. Coordinated two meetings and one video conference with Biology team, as well as two sessions with Infrastructure team
3. Facilitated class work functions including note taking and room reseration

### Note On the Class // 12-18

This is my last post in the context of Biophysics 101. I plan to continue to use this space to keep track of my work and thoughts and look forward to continued work on the project. However as this is my last official post I want to say what a pleasure it has been to be in this class. I have certainly enjoyed the opportunity to learn a great deal of biology, mathematics, and programming in the context of a cool project, but more importantly I have enjoyed getting to know everyone in the class. It was really an honor to work with so many great, talented, nice, and brilliant people and I enjoyed learning from everyone. Also thank you very much to Prof. Church and Harris - I know you both are extremely busy but the amazing amount of time and effort you put into this class made it a terrific experience. Thanks again to everyone and I really believe this is only the end of the beginning. Cheers!
Zach

### Importance of Parental Origin // 12-17

There is a terrific paper in this week's issue of Nature entitled Parental origin of sequence variants associated with complex diseases - this also has serious implications for the accuracy of our model. In researching the influence on Type II diabetes of a certain SNP the researchers found that the allele that confers risk when paternally inherited is protective when maternally transmitted. As the authors point out " genome-wide association studies have so far yielded sequence variants that explain only a small fraction of the estimated heritability of most of the human traits studied. " - this certainly makes the current model less useful as we cannot account for such issues as of yet. this is very interesting to me though and I would like to read more about the topic. More on this soon.

### Old Paper New Idea // 12-16

I am posting a paper I read in October but never discussed. Earlier I did not have the experience modeling to appreciate the relevance of this paper but essentially it is a counterpoint to our "blackbox" methodology for determining gene-gene interaction. In particular, it asks how we can find biological rather than just statistical relevance in the interactions we determine. In the words of the paper

For example, if we know two SNPs have significant statistical interaction, how do they interact in biology or genetics? Does each allele of SNPs act? Do dominant or recessive effects exist?


The method of doing this follows three steps

1. Compute entropy of initial data set (H0)
2. Compute gain ratio of single SNP
3. Compute gain ratio of SNPs’ interactions

I had some very ugly non-working mathematica code from a couple months ago attempting to implement this but I might give it a whirl in python.

diagram of process from paper

### Documentation notes // 12-14

I have been exceptionally busy with exams but over the next week I am working to consolidate all documentation both for my own contribution and that of the modeling team. I put up a skeleton of the modeling team's documentation and hope this will facilitate easy collaboration on this issue. Also above I have outlined my contributions and will try to detail them in such a way that each links directly with relevant pages.

### Work To Do Update // 12-11

Class met for the final time yesterday and it was great to see the Biology Group's presentation and discuss both the progress our class has made and other possible directions (such as Brett's idea of an educational platform). Ultimately, I was amazed not only by the brilliance of my peers and their amazing contributions, but also by the ability of the class to come together and build a cohesive project. I think the flow from Biology to Modeling to Infrastructure is very elegant, and I am excited to keep working on the project after the semester. However, in terms of things I will be producing over the next week: I have talked to Hossein about building some more modeling scripts to provide a diversity of choices to upload to trait-o-matic. I also am working on both documenting the entirety of the work we produced as well as documenting my own personal contribution - for complete documentation please see the modeling team page - I just posted a skeleton for all documentation which hopefully the team will start filling out- also on my page I will write some more detailed documentation of certain areas I worked on for anyone interested. Additionally, I will try to go through all of my previous entries and make them more accessible, ie. make sure all things which should be linked are linked, and fixing any obvious errors.

### Penultimate Class Recap + Remaining Goals // 12-8

We met as a class for the second to last time today and the modeling group presented in detail the work we have been doing. As an overview of the current state of the project please see the diagram on the right

diagram of project

After talking briefly with Harris I'd like to set a few remaining goals for the modeling group and myself. First, document everything we have done and make sure others can pick up easily where we left off. I will make sure to go through all of my wiki edits and draw together a comprehensive explanation of what we have done - but also link to this through the larger group documentation. Second, I would like to build more models based on the literature - this way we can upload multiple ones and take advantage of the trait-o-matic infrastructure.

### Work with Modeling Group (and Biology) // 12-7

Today I am working with the modeling my modeling buddies (Alex L., Daniel Jordan, and Ben Leibowicz) - we are starting from the code I wrote over the past week which you can access in the dataread and logreg files. However, we are going through the code and streamlining it, and building upon it. We will later post updated versions containing our combined collaborative work. We have a few main goals for the day

1. Optimize code, make sure it is pythonic, and make sure that it avoids errors - or that if there are inevitable errors we understand the origin of them and how to deal with them
2. Extend code so that it interfaces with trait-o-matic. We met with Fil earlier in the day and now are aiming to create an output that interfaces by producing a python script itself. In particular, after training on the data provided in the csv file, the output will be a script which contains the model and can be run elsewhere with inputs on the genotype of our new prediction
3. Find issues and limitations of the current model, and outline or begin work on alternative models - it is just as important to us to understand how to use the model we have as it is certainly not appropriate for all situations

Update: We worked for several hours and have completed both the first and second goals - we will post the code to the project code and walk through it tomorrow. Using python's flexibility with pattern recognition and printing to strings we produce output which is itself a python script. Ie. as a walkthrough of the process we read in a csv file with genotypic and phenotypic data to create a model - this csv file internally results in certain arrays of probabilities - we can then take the logit of these probabilities to create a curve on which we perform a regression for coefficients corresponding to each snp. The model for this is written into an array of coefficients - then the code for taking this array and making predictions is written to a string which is outputted as another python file. This file takes in arguments for new genotypes and gives the probability of various phenotypes based on these.

At 8:30 I had a conference call with the members of the biology group - we discussed one issue worth noting here: how to deal with heterozygosity. That is, let us assume the baseline for a SNP is A, and the polymorphism in question is T - then one can either be A/A, A/T, or T/T - in our previous interpretation A/A would be zero, T/T would be one, but it is unclear how we would classify the intermediate heterozygous A/T. I suggested a framework wherein each SNP had two terms in the regression, one term for the homozygous case and one term for the heterozygous case. Accordingly 10 would represent homozygous, 01 would represent heterozygous, and 00 would represent no SNP at all. I need to do some reading and thinking about how this would theoretically impact our regression.

### Meeting with Infrastructure Team // 12-6

This evening I met with the infrastructure team including Fil, Alex R., Anna, and Joe. I just wanted to highlight a couple key points of our discussion. The modeling team's work is written in such a way as to be somewhat agnostic about its training. That is it can receive an input file from anyone, whether or not it is produced by Trait-o-matic, and still have a regression useful for predicting phenotypes. Moreover, once initialized it can provide predictions regardless of the training data. This has the clear weakness of not providing confidence levels for the regression in any way, but it has a clear strength in its flexiblity. Accordingly we are going to work with the biology team independently to get training data.

This was somewhat of a digression but the point was to get to where infrastructure and modeling interact - in particular we wish to have it pull up the model for specific trait lookups and calculate probabilities from whether or not SNPs are there compared to the standard.

### Notes on Stats decisions (and details on logistic regression) // 12-6

In an effort to better document the decision making process in this project I am posting an explanation of the work. We are working on a logistic regression, which is excellent for binar responses - that is if we consider traits which one either has or does not have. While the method is flexible enough to consider ordinal data, ie. 3 types of thresholded data. , it is particularly good in the binary case. A couple justifications for this approach

Flexibility for considering continuous traits. In particular, all traits have some granularity and we can establish thresholds to make this data discrete. For instance, when considering height we could create three classifications of short, medium-height, and tall. In other words, continuous data can be expressed as ordinal data (in statistical theory ordinal data is between nominal and continuous data, where there is some natural hierarchy - for instance if you are young, middle-aged, or old, there is a clear order as this results from an underlying continuous age). There is also flexibility in the way we plan to implement code to allow us to use other models with the OLS regression

Binary Data Much of the data we deal with is binary. That is to say, we have many phenotypes where you either have or do not have something - accordingly, we can treat not having the trait as 0 and having it as 1 - which works perfectly. Moreover, we are performing somewhat of a proof of principle here. In that respect we wish to have the simplest possible model (we cannot have fewer than two outcomes), and once we decide to work with binary processes, the logistic regression seems natural.

Also as a bit of background on the logistic regression - starting with the notion of a generalize linear model - we in general model probabilities as $g(\pi_i) = x_i^T \beta$ using the xs as a vector of explanatory variables (in this case genotypes). In our case we take the simplest possible g so that we see π = xTβ, we have to make sure then that we stay in the interval of [0,1] so probabilities are proper. It is for this reason that we use the logistic model, where logit(π) = log(π) − log(1 − π)

### More Work // 12-5/12-6

First a note for anyone else trying to use scipy - make sure to use numpy 1.3 if you are on linux or mac - in particualr the stats package has an issue with cython c code if one uses the new version of numpy. This was rather frustrating but you can uninstall numpy 1.4, revert to numpy 1.3 and generally be fine. Also for everyone's use I am using this file for performing the regressions.
Note from later in the morning (6:00 AM) - I am about to sleep for a bit, but I wanted to post the code in its current version. There are a lot of issues with compatibility of arrays, lists, and ols with various versions of Numpy but I am starting to realize some tricks to work around them. Also for a brief review of my goals before Tuesday it is to finish the code I am currently working on, which at this point seems to be > 90% fixing compatibility issues. Second, to document the work I have done.

Update (6:38 AM) IT WORKS - finally... For people looking to tweak the code please note the importance of array alignment. These arrays are specific to numpy ie. numpy.array and are not the same as the array class in python ( a source of some confusion ) - also for ols you need one array of arrays and one array, in particular the x vector is an array of arrays and the y vector is an array - however you cannot convert from lists to arrays, you must construct arrays and append piece by piece

### Programming update // 12-5

Now that I finally have python with the proper packages on my computer, I figured it was time to update the wiki on the code which I've been writing. There are a few steps which I aim to have fully implemented for the modeling by the end of the semester. All work I am doing is written in python modules so hopefully it will be of use to others beyond our project if they wish to implement something similar.

The first step ( which is implemented already - code is posted at the end of this post ) is reading in the data. Data comes in the form of a CSV file. To explain the form of the CSV file, we have one column for each SNP and a final column for genotype - for an example of a CSV one could read in please see this document. One loads the data via a dict_load method and the module then initializes two objects. The method which does this requires only a filepath and the number of phenotypes one is considering. The first object initialized is a list of dictionaries, one dictionary for each phenotype. The dictionary takes as keys genotypes concatenated - for instance if you have 3 SNPS in consideration, and someone has the first SNP but has neither the second nor third, their genotype would be recorded as '100' and the string '100' would be the key in the dictionary. Each time the reader finds such a datapoint it increments the mapped value. The second object is a dictionary to count the total number of datapoints for a given genotype. Using these two dictionaries it is relatively straightforward to find conditional probabilities of phenotypes given genotypes which is also implemented in the module. The code for this is posted below:  import csv import sys \# stores for each genotype the total number of datapoint for it totaldic = {} \# a list of dictionaries where each phenotype has a dictionary for which genotypes are mapped to how many datapoints they have datalist = [] \# the conditional probability of a phenotype given a genotype def cond_prob(phenotype, genotype): 

 if datalist[phenotype].has_key(genotype): return 1.0 * datalist[phenotype][genotype]/totaldic[genotype] else: return 0 num load data from a csv file with a specified number of phenotypes store totaldic and datalist so probabilities can be calculated also deletes any old stored info def load_data(filename, numphenotypes): # clear totaldic and datalist datalist = [] totaldic = {} # create a reader object to go throug the csv dataread = csv.reader(open(filename, "rb")) # store the rownumber rownum = 0; counter = 0 # fill datalist with a dictionary for each phenotype while counter < numphenotypes: datalist.append({}) counter += 1 # read in data from rows for row in dataread: # store the header as the first row if rownum === 0: header = row # for all other rows go through columns and store data else: column = 0 stringkey = ""; # define the key by the genotypes concatenated for col in row: stringkey += col; column += 1 # adds to the genotype keys in the phenotype's dict if datalist[int(col)].has_key(stringkey[:-1]): datalist[int(col)][stringkey[:-1]]+=1 else: datalist[int(col)][stringkey[:-1]]=1; # adds to the phenotype key mapping in total data if totaldic.has_key(stringkey[:-1]): totaldic[stringkey[:-1]] += 1 else: totaldic[stringkey[:-1]] = 1 rownum += 1 if len(sys.argv) < 1: print "please specify a file name" sys.exit(1) read in filename from command line and open it to read 

filename = sys.argv[1] load_data(filename, 2) print cond_prob(1, '00') 

Second step will take this data and take the logits, then perform a logistic regression - so one can specify a data file, load it, and any time afterwards put in a genotype and find a result.

### Meeting with biogroup // 12-3

Today I met with the folks from the biology group (Ridhi, Jacqueline, and Anu) - I went through with them the progress the modeling team had made, and the way we could interface data. In particular, we talked about how to construct a dataset and send it through to the python code I was writing. We decided that using CSV files which can easily be manipulated in excel would be a nice interface even if temporary.
We then talked a lot about how we could go about making a dataset and more importantly ways we can acquire one. Ridhi posted detailed notes on this matter. However, from my perspective a question which was important was how to treat SNP data - the method we agreed upon was looking at an SNP as a piece of binary data, ie. either there is the SNP or there isn't. That is to say there is no hierarchy of polymorphisms on a single nucleotide but instead we identify particular polymorphisms and if they are there we record the data as a 1 otherwise we record it as a 0

### Getting programming tools // 12-1

External hard-drive and Mac OS X upgrade arrived today - spent a couple hours backing up my files, and getting a clean install. Now I have to get scipy and numpy on this but it should finally work! Update (and guide for anyone in the future trying to install scipy) - after far too much work I have scipy and numpy fully running on Mac OS X 10.6 - if anyone in the future wants to do this I suggest

1. Make sure you have the most recent version of the OS - it will make dependencies (the real problem) a lot easier
2. Uninstall all previous compilers etc... anything on which something can depend - in my case this meant uninstalling old (and recent) versions of Xcode, and reinstalling
4. Make sure to use svn for checking out numpy and scipy, and when you face the inevitable issues in compiling read the errors. Many of these will be dependencies which you can manually change in the code. Some of these can be as silly as changing a 10.5 to a 10.6 or double checking the path of some file to which it refers

In any case, these tools are awesome.

### Overview of Tonight's meeting // 11-30

Tonight the modeling group met for a couple hours to go over the work we have all done and finalize goals we wish to present to class tomorrow. Essentially our main three goals are

1. Complete code for the logistic model involving reading in arbitrary data, plotting logistic curve of probabilities, and regression
2. Run test with either actual or simulated data through bio stream group
3. Coordinate with infrastructure group to link up with trait-o-matic

### Logistic Regression // 11-28

Today I developed a mathematica code that perform the logistic regression as outlined in the paper on eyecolor. I need to do some refinement but the steps are essentially

1. Determine how many SNPs we have in question and prepare datastructures accordingly - the first such structure is the model equation itself - the second is an array in which to store probabilities
2. Count through the dataset to establish probabilities of a trait(in this code we use binary traits) depending on some combination of SNPs - we can consider here an associative array which has for instance "0110" as its key, indicating that the first and fourth SNPs are not present, but the second and third are. The entry in the probability array for this key is the number of observed of the trait for which we have the array divided by the total number of such SNP combinations
3. Perform a regression with the probability as the determined outcome
4. Determine probabilities for a given genotype by exponentiating appropriately

This is very nice for traits with only a few possibilities. However, it has one area about which I am concerned and would like to improve. In particular, we will likely be working with relatively small datasets (for now) with possibly many SNPs. One can imagine if there are 10 SNPs in question there are 210 = 1024 possible genotypes, and it is likely we will have some genotypes with no data, and others with sometimes only one datapoint. This will lead to probabilities which are not accurate - ie. if we only have one person with genotype "0011101001" and he is positive for the trait we are looking for, we will be performing the regression with "0011101001" mapping to 100% probability, when in reality it could be anything greater than 0%. I would like to resolve this by finding a way to weight the certainty of a probability in the regression. Ie. if we have 1000 datapoints on a different genotype and only 1 for another we should not weight these probabilities equally in the regression. I am not sure if there is a smart way of dealing with this, but I am looking through the literature.

After using mathematica a bit more on the models from Ridhi's paper - in particular, ordinal and multinomial regression - I think we might want to implement this for traits we do not explicitly quantify (ie. with a genetic disease, you have it or you don't, or in the case of the paper there are only 3 options for eyecolor). However, I would like to make a modification to the regression given our limited data. That is we use 'probabilities' of various traits to perform the regression, but given our limited data we might need weigh the certainty of these probabilities. In particular for diseases where we might only have 2 or 3 positives, we should find a way to weight these less strongly. The other suggestion I have for a paradigm is multiple models built into traitomatic and depending on the classification of the trait we use a different model. A possible classification would be

binary traits - this would include genetic diseases (though there may be ways to quantify some of these and there is some sort of threshhold) - however, surely here we can consider probabilities of risks depending on an array of SNPs and whether or not they are there, as it does not make sense to consider for instance 6 different associations and assume the observed risk is somehow independent of the other actions. This is an area where I think the logistic regression model makes a great deal of sense

continuous traits Here we would have completely quantifiable things such as body mass index or heigh. In this area we coudl consider the model we have been working on and refining, but with constrained parameters for epistatic interactions.

### More on Continuous trait modeling // 11-26

I continued to work with the type of model discussed in my post below, working on the generalization to 3 gene interaction. I should note, that anything we get working needs to easily generalize beyond three genes, but at the same time if it cannot do three it is unlikely to generalize. The major modification I made to the model is the removal of the delta's in the multiplicative factor. Instead there is merely some variable (ax − 1) in from of each multiplicative term with the implication that if ax is 1, then the factor does not impact the multiplicative term, and otherwise (ax − 1) is zero and the multiplicative term is impacted. Note, there is an assumption here (which we should discuss) that there is only one multiplicative term - the justification for this is that nonmultiplicative terms arise from different pathways, at least in questions of risk - so while possible, multiplicative effects should be captured in a larger term to a reasonable degree of accuracy. Moreover, we can assume any correction arising from misinterpreting that can come from some unconsidered multiplicative effects would be captured by the linear terms. The new term for the model in the mathematica code is  

 model = a (((ax - 1) x/baseline + 1) ((ay - 1) y/baseline + 1) ((az - 1) z/baseline + 1) - 1) 

 baseline + (bx + bxy KroneckerDelta[0, y ] + bxz KroneckerDelta[0, z]) x + ( by + byx KroneckerDelta[0, x ] + byz KroneckerDelta[0, z]) y + (bz + bzx KroneckerDelta[0, x ] + bzy KroneckerDelta[0, y]) z + c; 

### Generalized Modeling - Python issues // 11-25

First, happy thanksgiving! I am thankful for biophysics 101. My biggest frustration is still python, and I've somewhat resigned myself to needing a new OS. Anyhwo, I decided to work on the generalization as Ben and I discussed the logarithmic model a bit and it is unclear if we will be able to generate sufficient data. $a((1 - d_x \delta(x))(\frac x {baseline} + 1)(1 - d_y \delta(y))(\frac y {baseline} + 1)(1- d_z \delta(z))(\frac z {baseline} + 1)-1)baseline$
+ (bx + bx,yδ(y) + bx,zδ(z))x
+ (by + by,xδ(x) + by,zδ(z))y
+ (bz + bz,xδ(x) + bz,yδ(y))z

Given n parameters our multiplicative term has n + 1 coefficients, and each additive term has $n$, so we get a total of n2 + n + 1 coefficients. This isn't an unreasonable growth in coefficients, and it is about as close to Ω(n2) as we can get for an O(n2) growth. I wrote up the 3 gene version of this on Mathematica baseline = 150;
model = a ((ax1 + ax2 KroneckerDelta[0, x]) (x/baseline + 1) (ay1 + 

 
        ay2 KroneckerDelta[0, y]) (y/baseline + 1) (az1 +
az2 KroneckerDelta[0, z]) (z/baseline + 1) - 1)
baseline + (bx + bxy KroneckerDelta[0, y ] +
bxz KroneckerDelta[0, z]) x + (by +
byx KroneckerDelta[0, x ] +
byz KroneckerDelta[0, z]) y + (bz +
bzx KroneckerDelta[0, x ] +  bzy KroneckerDelta[0, y]) z + c; pracmodel[i_, j_, k_] := 3.139 ((i/baseline + 1) (j/baseline + 1) - 1)
baseline +  (1.453 - 1.453 KroneckerDelta[0, j]) i +
2.52 j + .05 + k +  2*RandomReal[];rando1 = RandomInteger[{-10, 10}, 250000];rando2 = RandomInteger[{-8, 8}, 250000];rando3 = RandomInteger[{-8, 8}, 250000];data2 = MapThread[
List, { rando1, rando2, rando3,

 fit = FindFit[data2, model,{{a, 1}, {ax1, 0}, {ax2, 0}, {ay1, 0}, {ay2, 0}, {az1, 
   0}, {az2, 0},  {bx, 1}, {bxy, 1}, {bxz, 1}, {by, 1}, {byx,
1}, {byz, 1}, {bz, 1}, {bzx, 1}, {bzy, 1},  {c, 1}}, {x, y, z}]


- however when trying to simulate data given a relatively small noise factor, I could not get accurate coefficients. The reason being, by the time I scaled up to an order of magnitude at which we would get reasonable data I got the following error message
No more memory available.
 Mathematica kernel has shut down.
 See http://support.wolfram.com/mathematica/systems/macintosh/osx/64bit.html
 for information on 64-bit memory operation on Macintosh. I will investigate if it is possible to run these simulations on the cloud where I assume we have a lot more memory than on my computer.

### Update // 11-23

Spent a while today trying to figure out what tools to use for the model at this point. I feel like prototyping isn't so useful, so I might as well write in python from the start. However, I faced great frustration in trying to get the scipy package working on my computer due to some issue with a fortran compiler - no idea what is causing this and it is taking a while to figure out. However, once I have this working it shouldn't be too hard to translate my mathematica code into python. Once this is done it should be possible for my program to 'talk to' whatever is in traitomatic

### Goals // 11-22

I just wanted to take a minute on my talk page to consolidate some of what I have been working on and lay out some goals of mine before the term is over. First, I went through the papers Ridhi sent out - two things on my mind with this

1. Quantifying a trait better - I think what they do might work or us (they say Blue = 1, Brown = 3, Intermediate = 2) but that quantification seems somewhat arbitrary - if we decided on eye color I'm cool with it, but the argument for something like body mass index is that it is a continuously distributed variable
1. Getting actual data -- this should be possible through the PGP but we should make sure - we would have to classify eye color as they do, but I think it could work

In terms of goals for the week, I'd like to get something in python working that replicated what I've done in mathematica - I think this is wiser than prototyping because much of the work is in figuring out an interface. Second I'd like to write up a couple of the logistical models presented in the paper, and we can compare which we think will be more effective.

### Genetics of Obesity // 11-17

Most of my efforts are focused on modeling right now but I anted to take a minute to post this link to an article from this week's nature about obesity and diabetes. For more on diabetes, see the paper posted by Prof. Church a few weeks ago. I think lookin into obesity could be a great test case if we can find more information/datasets, which I ham doing right now.

First, there are a lot of pathways which could influence obesity, including POMC deficiency, MC4R deficiency, leptin deficiency, neurobehavioral pathways that make us want to eat, etc... Finding a single predictive metric could be a great diagnostic tool to understand the influence of various levels of each of these.

Second, I think there would be a great deal of interest in getting a single prediction of something like body mass index. This is because, unlike height, it is something over which patients would have some influence. In that respect, I think there would be a huge market for people to learn what their body/genes 'want their weight to be' -- it could be used for guidelines for what types of drugs to take, how much to exercise, and what kind of diet is necessary.

Third, the possible measurement of body mass index provides an easy way to quantify obesity, and it does not require larger population studies as we'd need to do for odds ratios/disease risks. Even if this data doesn't exist yet, suggesting that data on obesity be collected in a quantified way (ie. Body Mass Index, and average BMI for subpopulation) rather than a binary way (obese or not obese) could be a useful contribution if we create a compelling enough case for the possibility of quantification. Moreover, finding the possible epistasis of these various pathways could be very interesting (although very hard). Ie. if there is one gene which makes you gain a great deal of weight when you consume a lot of simple sugars, but another gene which affects your neurobiology and makes you dislike sugary foods would probably be epistatic to the former. This case is clearly an exaggeration, but given the host of interactions in nutrition and metabolism I imagine there could be some very intriguing findings.

### 11-16

First, I thought this article was pretty interesting - it reminded me of one of our first lectures when we were discussing computational power and exponential growths. Things like this are exactly why it is important to think ahead, and assume computing/resources will be exponentially more powerful than we are today, and make sure what we develop and think about will be relevant to that. On a note more relevant to the project, I have been thinking more about the question of determining what kind of sample size we need for our learning to be worthwhile. That is, how many genome/phenotypes do we need in order for our nonlinear regression to be accurate. This raises two questions 1) how much noise do we expect - ie. how much interaction is there we cannot capture and how much measurement error will there be b) what do we consider significant. The latter question is interesting in that we can approach it two different ways. Either we wish to view the coefficients as a means to gain some sort of holistic understanding of the interaction or we wish to provide something highly quantified. I imagine for something like height we are more interested in the former (as very few people would pay money to learn what their 'predicted' height would be if they can simply measure themselves), whereas with something like disease risk people could be very interested in something quantified. To start working on this question I want to make a table of data using the mathematica notebook I posted where we vary noise levels and vary what error we are willing to consider non-significant, and find corresponding sample sizes. This should define a surface in three dimensions with noise and significance as the x-y axes and sample size on the z axis. We would want to make sure all studies we did lay above this surface given specifications. Also, in a brief survey of the literature, I found this which addresses similar issues. Not all of it is directly relevant (as he assumes linear regressions and makes claims to what type of study to conduct) but I think the approach is still interesting. A few parts I think are salient

"In designing an epidemiologic study of genetic risk factors for a complex disease (e.g., cancer, diabetes), an investigator is likely to focus on hypotheses regarding the main effects of specific candidate genes. However, the proposal of candidate genes in the first place is often due to an underlying hypothesis regarding some pathway that leads to disease, a pathway that probably involves more than one gene. For this reason, it may also be important to develop and test hypotheses related to gene-gene (G x G) interaction."

I think this point addresses some of the issues we discussed in class. That is how do incorporate knowledge of pathways? Though somewhat simple, I like the answer above: the very fact that we are looking at interactions is often because our hypothesis is that a certain pathway, where we know some genes are located, is in play.
However, this approach does not assume genes to be quantifiable, but rather looks at prevalence within a population, which we are not as interested in. However, I think we can use the approach to determine the same type of function as I outlined above.

### Thoughts on Today/Work // 11-15

The modeling group met today. There was a general consensus that we needed to get started on putting the code itself together after our various discussions. Again, based on the discussion from Thursday's class there is an agreement we wish to be able to notice

1. multiplicative effects
3. dominance

however, part of our discuss today focuses on how we wish to quantify traits. The argument I put forward is that we should consider quantifications of deviations from the mean for their group. For example, it is more meaningful when looking at alleles in height to consider how much taller you are than your peer group. This provides a loose control for many other genetic factors we cannot anticipate. Ie. a male from Norway is on average many inches taller than a female from Japan, so if we wish to understand the effect of some SNP on height, we should compare how the presence of the SNP impacts their height relative to what we would otherwise expect. Moreover, this provides a way to look for multiplicative effects. That is if SNP A is associated with being 1.2 times as tall, and SNP B is associated with being 1.1 times as tall, then the multiplicative effect would be expecting to be 1.32 times as tall. However, the additive effect is clearly better considered in terms of absolute units(or absolute risk) - that is for example A is associated with being 5 cm taller, and B is associated with being 3 cm taller, then the additive effect is being 8 cm taller. There is a similarity here, and an inevitable ambiguity in terms of when to consider baseline for a population (that is there could be some multiplicative effect from other genes which impact the baseline). Finally, we wish to be able to measure dominance (ie. a term modifying anything which is additive). I put forward a non-linear regression on the function

$A\left[\left(\frac{x}{baseline} + 1\right)\left( \frac{y}{baseline} + 1\right) -1\right]baseline+\bigg(b_1 + b_2 \delta(y)\bigg)x +\bigg(c_1 + c_2\delta(x)\bigg)y + d$

After some research, I think Mathematica might be a good language in which to do nonlinear regression, as the new 7.0 has a fantastic FindFit function. Tonight after struggling a bit I wrote a relatively simple mathematica code to generate some 'fake' data with a bit of noise, and then run a nonlinear regression to see how close we can get given certain amounts of noise. It is attached below here - More to come soon!

### Thoughts // 11-11

Today I finished my HETHR certification - only took a few hours - and I am looking forward to doing some infrastructure work. Also just wanted to post this paper - Prof. Church's words seem prophetic. Post-class errata: I feel pretty dumb for not reading the AUTHOR section more carefully. However, the article is still very interesting and I appreciated Prof. Church's explanation in lecture. In particular, I am interested in this non-single molecule approach, and it's also quite cool that Trait-o-matic is used.

### Notes on meetings // 11-9

Today I attended meetings for both the biology and the math group. First, as a brief overview of what we discussed in the biology group:

1. We considered limitations to the project, and things which we will want to document. The use of documenting such issues is that a) we can provide anyone who chooses to pick up the project in the future an idea of areas which might require work, and b) from our own perspective, better understand the areas we need to work on so as to define the scope of the project for our own sake. The main areas we discussed were the importance of environmental data, the need to look at largely multifactorial(ie. the order of hundreds of genes) in a pathways, and the ultimate need to integrate pathway data into the computer learning algorithm. As a takeaway, I think it is worth writing a document describing areas to work on in the future, so that the project is easily accessible for anyone who wishes to continue with it.
2. We talked through a couple papers brought up. Jackie’s paper on “functional organization” – this collaboration of labs was looking at yeast growing under normal conditions, so no amino-acid starvation, and what they’ve done is double mutant knockouts and looked at a bunch of different two gene crosses and how these combined knockouts affect growth, with one mutant alone you might expect growth increases two fold another one might expect decrease of .8 – sometimes you see it is increased and other times you see there is no growth at all. So in this case they knock out a gene that is responsible for cell cycle control and a gene that is in some signaling cascade – then they look at growth. The possible application of this is considering their data and seeing if our program can recreate interactions determined for yeast. We also discusse'd
3. Finally we worked through some of the model systems which we want the code to be able to replicate - these are further detailed on the project page but the underlying ideas were that we want the code to be able to recognize. These include dominance, additive, and multiplicative effects. These motivated a lot of the conversation we had later at the math group

For an overview of what we discussed at the math group: In dialogue with the bio folk we talked about the models we wanted to recognize. Essentially for now we wish to ignore the early steps of mapping genes via trait-o-matic, and using biological maps to quantify traits. Rather we want to make sure the decision engine works by providing it reverse engineered data based on the bio group models. Int his sense we ended up envisioning the decision box having a few possible methods of optimization including looking for dominiance, additivity, multipicative relationships, exponentials, logarithms, and a few probabilistic distributions. Hossein explained notion of a Kalman filter

### Modular Work For Project

I will post this on the project page once I refine this idea more. However, I'd like to outline an idea for a system by which we could develop the computer learning we wish to. In particular I think it would be worthwhile to develop 'toy gene interactions' where we determine the rules for epistasis/genetic interactions beforehand and test how well the computer learning can build this rule. I envision a few possilbe systems

1. Complete epistasis: Gene A controls eyecolor on a scale of 0 to 100, Gene B controls whether or not you have eyes(if you have no eyes, color is measured as 0)
2. Additive interaction: Gene A and B both control eyecolor pigment in an additive pigment
3. Multiplicative interaction: Gene A and B both control eyecolor pigment in a multiplicative manner, with B on a scale of 0 to 5 and A on a scale of 0 to 20
4. Three gene interaction - yet to come up with a good model for this, but I imagine it is a more robust test

### Ideas For Project Work

1. High level work on finding ways to organize data on polygenic inheritance
2. Organizing Database of multifactorial inheritance
3. Describing analytic method to evaluate inhe
4. Learning trait-o-matic code and structure
5. Integrating code to include analysis of polygenic traits

1. SNPCupid type analytic tool add-on

The role I would like to take is on the interface of designing an algorithm to understand polygenic inheritance etc... and integrating this into the program. However, I would like to see how others imagine working and work in a cooperative framework.

### October 23rd

Random thought on possible other project. We never really investigated one of the ideas mentioned on the webpage (I assume due to either Harris or Prof. Church): "Diagnostics: Disease Risk Calculator web-tool that incorporates family history, lifestyle risk, and individual genotype. " - I think one of the things which might be appealing about this idea is that it allows a very clear delineation of work. Some could do the programming linking genotype analysis, we could have mathematical analysis of the various interactions, and we could have different people getting into the literature on family history etc... It might be a bit late for this but it is worth considering in my opinion. I will look for some of the literature on this and post something better organized on the project talk page.

### October 21st

I read the following paper today. A few key points that I think are salient to our project

2. We might need to consider variability of an entire metabolic pathway

Anyhow, read the paper - if we commit to metabolism, it's definitely worth your time. Also a concern I want to voice about the notion of metabolism as we've thought of it. I am not sure we are being ambitious enough; in particular, I am worried we are doing things that are already being done. Looking at SNPs and how they affect drug metabolism is a richer literature than I thought and we are going to need a more innovative take on this problem for it to be worth it - I will comment more on this on the project page.

### October 20th

Class just ended - wanted to post two quick things before I forget.

1. Followup from yesterday's thought - after discussing briefly with Prof. Church, I think it is probably best to integrate another tool
2. Interesting review of Flux balance analysis + metabolic pathways: this paper

### October 19th Thought

I have been thinking about implementing a modular side project with a simple goal. Given some context, ie. what gene, who did the sequencing, what is the probability there is a sequencing error. A few things to consider

1. Wiki style collection of data - this could be used to generate data about areas 'hard to sequence'
2. Bayesian analysis - if we notice something really weird it is more likely to be an error because it comes from a subset in a probability space of very unlikely alleles
3. How to incorporate this into other analysis

More on this soon

### Interesting comments from David Gurwetz

This year marks the 50th anniversary for the field of pharmacogenetics, the study of the heritable determinants of drug response, both in terms of safety and efficacy. In 1957 Arno Motulsky published the first review in this field, which has exploded during the last decade (the current PubMed count for 'pharmacogenetics' is 4,184 items) and which is by now the subject of dedicated conferences and books. Yet, very little of this vast knowledge has found its way to the clinic. Recent reports examining the lack of uptake of pharmacogenetics into the clinical setting have emphasized high costs and lack of reimbursement among the key issues. A $1,000 genome, or merely a$1,000 'exome' as George Church rightfully notes, would allow the rapid uptake of pharmacogenetic testing to the bedside. Although ethical and legal concerns would remain substantial barriers for its implementation, the affordable genotyping costs would mean that policy makers would no longer be able to cite high costs as a reason for keeping pharmacogenetics out of the clinic. The Centers for Disease Control recently published a report stating that 6.7% of all US hospitalizations during 2004-2005 were the direct result of adverse drug reactions, highlighting their huge societal burden and the urgent need for improving drug safety. A \$1,000 genome would mean large savings, given the pharmacogenetic knowledge that is already available but so far beyond the reach of all but the very affluent.

### Interesting Article on Personalized Medicine/Pharmacogenomics(October 7, 2009)

I read this in Nature today and it got a lot of ideas flowing about human 2.0 and the metabolism project Fil, Alex^2,and I have been thinking about - in particular this excerpt recommendation for personalized genomics companies interested me

"Test pharmacogenomic markers. An estimated 100,000 people die annually in the United States from adverse drug reactions14, 15. Although few drugs are labelled[sic] to require or recommend genetic testing, consumers could find specific variants useful. Variants in drug metabolism genes or recommended for testing by drug labels are informative and could greatly affect an individual's treatment. Examples include variants affecting the efficacy of clopidogrel (used to reduce the risk of stroke or heart attack) or tamoxifen (used to treat breast cancer)16. Most of the DTC companies are testing for some pharmacogenomic markers17, 18; we encourage inclusion of as many of these markers as possible."

A great reference link to review drug metabolism is provided in this article here - it is extremely relevant to our project plans. In particular it covers the website http://www.pharmgkb.org/ which we are thinking about quite a bit in relation to our work.

### Sequencing Update(October 5, 2009)

Seems like sequencing really is getting cheaper - at least according to IBM - it's interesting and I hope to hear a bit more about methods of sequencing in discussion even if it is not direc

## Assignments

### Assignment 5

Somewhat like assignment 4 redux - below is an explicit documentation of the use of each page OMIM, GeneTests, and SNPedia as well as HGMD and GeneCards We wished to avoid the 'rush to code' so below are descriptions of how the project will integratet he various tools

OMIM: WE can comb the OMIM database for metabolic information. Unfortunately,t his data will likely be very hard to parse as it is extremely text heavy without many repeated features. However, it is possible there is information in OMIM we cannot find elsewhere. My thoughts are that this could be useful in finding places to focus on for diagnosing disorders which might be related to dosing - ie. we can find on OMIM that Debrisoquine sensitivity Codeine sensitivity are related to CYP2D6, CPD6, P450DB1. Moreover, due to the encyclopedic nature of the information we could use it to find direction on particular targets and explore relevant literature.

GeneTests: I imagine integrating GeneTests as a prescriptive measure. In other words, after identifying possible areas to be tested further (ie. to verify there was not a sequencing error) we could use the genetests database to automatically inform you where this testing can be performed.

SNPedia: This is by far the most useful of the websites for the project we have been imagining. I imagine the project having a coproductive relationship with SNPedia

1. Using SNPedia s a database through which to comb possible metabolic targets - this would be important for mapping how a certain cite could be related to a SNP which would be particularly easy to map. In fact the project's scope might simply be SNPS
2. The project could provide a great deal of prescriptive data on SNPs to test for further linkage, as in metabolism gene interaction is expected. As such we could use the wiki structure to make this data accessible to others

HGMD: The obvious use is as another database to comb for possible mutations as it is unlikely any of the databases is a superset of the others. Additionally, tracking loci where there seem to be many mutations could be used to forecast where metabolic issues could arise. Moreover we can cross search with OMIM to find mutations on OMIM genes of interest. Ie. given our previous example of the Codeine sensitivity gene CYP2D7P1 at 22q13 we find on HGMD the list of mutations ( in this case only one) - in particular there is a deletion which is labeled under "Codeine to morphine demethylation, association" - this might be more intricate than simply looking for SNPs but would also provide a more complete picture of metabolic mutations and character.

Genecards: This seems more of a supermarket approach, with a focus on purchasing related products. I anticipate if we wish to move into the lab we would use GeneCards to identify the proper products etc... however, right now this does not seem to offer much we don't have elsewhere

Going over wiki pages:

On Epistasis I added the following to the article: "However, in general epistasis is used to denote the departure from 'independence' of the effects of different genetic loci. Confusion often arises due to the varied interpretation of 'independence' between different branches of biology. For further discussion see [1]." - the reason being I found their definition implied ambiguity but failed to explain from where this ambiguity arises. I referred to the following article in my edit, which I found very helpful for those without a deep genetics background.

Some thoughts on QTL: I am interested in finding some mathematical resources to analyze QTLs based on phenotype and genome - perhaps this a tool we could integrate into the project?

### Assignment 4

Please see the flow diagram under Fil's entry. However, in order to avoid group-think I include an idea I'd like to integrate into the project: namely pharmgkb already has a somewhat extensive database of drug/SNP interactions with annotations. In this regard, I think there are two ideas to add to the outline

1. Include the annotated SNP/drug interactions in the database we create, and use its information as a rough guide
2. View our project as a way to suggest new interactions to test and recommend future research
3. Use the pathway database/SNP database on the website to make connections - possibly crawl over it and extract relevant info

tly relevant to the project.

### Assignment 3

In the process of brainstorming ideas for Human 2.0 I have met with various members of the class, including but not limited to Filip Zembowicz, Alex Lupsasca, and Benjamin Leibowicz. Perusing OMIM, GeneTests, and SNPedia I was struck by two things.

1. How much data we already have about the human genome, and some phenotypic results
2. The potential of open source

In light of this, my vision of Human 2.0 is at the nexus of technology and humanity, with a focus on applying our ability to rapidly collect and analyze genetic data to the issue of selective engineering. More specifically, I would like to come up with a way of providing individuals a choice of certain traits to activate. After a certain therapy they would be able to activate certain genes and express something when they chose. This idea would be based on the mechanism Harris showed us in which taking a supplement activates certain promoters. Thus in the genetic material with which people were treated there would be many genes with promoters that you could selectively activate by taking the pills with the corresponding supplement. Ie. if you wanted to produce a protein which blocked myostatin while you were training for a sport you could take one supplement, whereas if you wanted to release increased levels serotonin, dopamine, and norepinephrine to increase focus you could take another. A small step towards achieving this would be to find a few traits which we can map very well to expression of a certain protein, and sequence these with the corresponding promoters.

However, I recognize this idea is extremely difficult (if possible at all) to implement, and likely beyond the scope of the class. So, to offer another idea, largely inspired by my conversation with Fil, I could see a human 2.0 project based on advising individuals on daily choices in light of their genome. This advice could range from drug dosage(see Fil's idea), to diet/exercise habits, to how to handle decision making. For example, after sequencing your genome there might be information on what type of diet would best induce your body to stay lean. I see one way of implementing this project to be a wiki where sequenced individuals keep track of how they respond to certain diets, drug dosings, etc... and individuals could compare their genome (after having it sequenced) to the database on the wiki and find personalized advice. A small step towards implementing this would be to establish several profiles which we could sort sequenced individuals into (ie. dietary) - then find statistical associations between these profiles and responses to different choices in that profile.

### Assignment 2

I thought this assignment was both enjoyable and instructive. First, it definitely convinced me of Python's strength. In particular, I found a number of somewhat complicated parts of the assignment relatively easy to implement. Second, I was very interested in problem four - I would in the future like to implement some statistical tools to analyze what happens over large datasets. However, I would also like to develop a better biological and chemical understanding of what is happening with DNA. Code is posted below:

### Assignment 1

Overall I found the assignment at times frustrating but ultimately quite useful. The frustration was largely a result of getting python to work properly. I have some programming background from high school(which is now 3 years in the past), and have not used python for almost 5 years, so readjusting to the language and to programming in general were fun, if somewhat involved.

On python installation(I only write this in case anyone has a similar issue): I am running Mac OS 10.4 Tiger - this is probably not a great idea, and the only reason I have not upgraded is laziness. Mac comes with pre-installed python; however, my preinstalled python was verison 2.3 which is too out of date to be used with many of our packages, including the current version of numpy. I quickly installed a more recent version (2.6.2) but unfortunately found my default interpreter was still the old version, and I could only open the recent python's interpreter indirectly. After some snooping I found one could change the path used and thus which version of python by running a command(which comes in a customized download of recent python) called update default shell. Overall this was good practice in getting used to a UNIX environment.

On python programming: Wow, this is amazing. I am used to writing in C and forgot how compact and easy to write python is.

On spreadsheets: Useful, intuitive, and straightforward. However, clearly more limited in power than something like python

On the graphs: 1) A bit of mathematical analysis shows that this is just An = kn * A0 - we see the behavior follows this as it is all scaled exponentials. Ie. if k=1 we simply have a constant function at A0, if 0 < k < 1 then the exponential is decaying and if 1 < k we have exponential growth

2/3) [we include both of these in one discussion as the latter is simply the former without being allowed to go negative] The behavior of this is quite funky. For low values of k such as 1.1 we seem to have a well behaved decaying function

however, it is clear that behavior becomes hard to predict and seemingly random. Bold text