Harvard:Biophysics 101/2009:Modeling: Difference between revisions

Revision as of 10:40, 18 December 2009

Note on Documentation // 12-11

As I posted on my talk page I am starting to use this page to store documentation of the work the modeling group has done - I would like to encourage all members of the modeling team to collaboratively edit the documentation I started below - this means (Also, after we finish the documentation someone please delete this post from the page) I have started a skeleton with sections I believe we should include so lets all fill them in, and if there are sections to add by all means do so - also if you have any comments on documentation but do not want to make explicit changes to the entire documentation below, please post thoughts/ideas with your name within this section. Cheers!-- Zach

Please note that I have made a synthesis of the modeling group's work, which amounts to a Documentation of our software and model. It can be found on my talk page or directly here:

Documentation & Final Report -- Alex L

Documentation

Motivation / Project Overview

We proceed to explain the underlying motivation of the project of modeling, what our goals were and how we implemented them.

Classification/Types of Traits

Overview

We proceed to provide an overview of the types of traits we wish to model. In the course of the project, we encountered various types of phenotypic data, each which is best modeled in a certain way. Below we detail each of these types, provide examples of them, and explain the demands they put on modeling relative to each other. Methods described here were researched in various sources including an introduction to generalized linear models

Continuous(observable)

Continuous data in general is any data or information that is measurable on a continuous scale. In the context of something observable we could consider measuring height as a continous trait, as we could measure the height of all subjects to high precision. However, in practice, most supposedly continuous data is expressed in discrete terms. For instance, we round heights to the nearest inch or weights to the nearest pound. Thus we end up with discrete data, even if it contains many values. Continuous data has different demands in terms of modeling. In particular, it does not lend itself well to a logistic regression and instead may require non-linear methods to approach it. There is some evidence that artificial neural networks may be useful in dealing with such data.

Discrete/Nominal

Nominal classifications of data are discrete separations of data points based on non-quantitative classifications - for instance the measurement of eye color as blue, brown, or hazel. Often there is more quantitative data underlying such information ( to continue with the eye-color example one could consider pigment counts in the eyes ) - however there are methods for approaching this kind of data, particularly when ti is ordinal - ie. when the classifications have a natural order (such as young, middle-aged, old).

Binary

A binary or dichotomous variable is a nominal categorization where there are only two categories. For instance, contracting a disease or not contracting a disease, alive or dead, male or female. This too intersects with ordinal data as there can be an inherent order to such classifications.

Appropriate methods for each type

Based on our research, we have methods to approach intersections of these data types. In particular we consider combination of explanatory and response variables and provide the theoretically viable methods for each. The first entry corresponds to the explanatory, and the second corresponds to the response variable (taken from Ch 1. of an introduction to generalized linear models).

Binary-Binary: logistic regression or log-linear models
Binary-Nominal/Discrete: Contingency tables and log-linear models
Binary-Continuous: t-tests
Nominal-Binary: Generalized logistic regression and log-linear models
Nominal-Nominal: Contingency tables and log-linear models
Nominal-Continuous: Analysis of variance
Continuous-Binary: Dose-response models including logistic regression
Continuous-Binary: Multiple regression

Types of Models (and evaluations)

We provide brief descriptions of some of the models we both considered and implemented. In each section you will find both a description of the model as well as an evaluation of its strengths and weaknesses.

Linear Regression

Linear regression is the most basic modeling method in which one considers a model of the form [math]\displaystyle{ y = X\beta + e }[/math] ie. it is modeling the relationship between a response and a vector of explanatory variables by fitting a linear equation for the response to observed explanatory data. For a more detailed review please see this explanation. Unfortunately, this model while simple does not provide a strong framework from which to approach questions of genetic interaction. First, the relationship between genetic factors and phenotype is clearly nonlinear (particularly in cases of dichotomous results such as disease risk) and second it has no means of measuring interactions.

Logistic Regression

Logistic regression is a parametric statistical method that is often used in the genetic study of diseases. One of its advantages is that it can test for both genetic and environmental predictors, which it relates to via the logit function. It is also particularly well suited for binary or dichotomous responses - ie. whether or not one has a genetic disease. Moreover, it has been proven effective in case-control studies. However, it is not without its disadvantages. First, it performs very poorly when faced with dimensionality problems (an issue we will certainly face as we consider more and more factors) and it also is prone to false positives. Finally, it has shown weaknesses in detecting interactions of various effects. For further detail please refer to this paper on it use in genetic epidemiology and these notes for a rigorous mathematical description. We implemented this regression method and found some limited success, however at the time of this documentation we had not yet received an appropriate training data-set. Note if you are dealing with non-dichotomous response variables then one can consider the method of ordinal regression where you consider orders to the various nominal categories and hence can take probabilities of a response being less than a certain threshold (and accordingly summing probabilities) or one can consider the multinomial method where you consider relative probabilities of one response or another but no absolutes (though starting with one absolute probability others can be calculated).

Non-linear Regression

This method is completely analogous to the linear regression except we no longer constrain the explanatory relationship to be linear but can instead consider anything. This has the clear advantage of allowing us to specify possible interactions on which to regress. For example see the original non-linear model we produced see User_talk:Zachary_Frankel#Thoughts_on_Today.2FWork_.2F.2F_11-15. However, the unfortunate drawback of this approach is that it is computationally very intensive and requires enormous datasets. Moreover, note that the logistic regression models are implicitly non-linear regressions. In particular we are moving to a linear domain by a suitable transformation mechanism and thus allow for non-linear interactions even though we solve the model in a linear fashion.

Computer Learning and Neural Networks

First, note that all models are a form of computer learning - namely we take dynamic datasets and use them to train the model on and hence this is a form of learning. However, a more explicit form of this is the use of artificial neural networks to recognize patterns in observed genomic data. In these models we use input layers(usually just one), hidden layers, and an output layer. Here a layer is something composed of nodes, and each connects to the subsequent layer. These connections have assigned weights from which subsequent values are calculated. This is called a feed-forward structure because information flows from the input to the output layer through the nodes of the hidden layer. In this process we use explanatory variable data for the input layer. In training the network adjusts weights to reduce error as much as possible. A clear strength of this approach is that this method is more general than any particular model as we can use it to construct equivalent models. For instance, a neural network with a logistic transfer function on one hidden layer is the same as our logistic regression. Additionally, it does not suffer from the same dimensionality problems as the logistic regression. This approach is also more flexible and likely more accurate in general. Moreover, it is better able to detect gene-gene interactions. However, there is not extensive literature on this method but we plan to investigate it more in the future.

Project Code / Implementation

In this section you will find complete documentation, as well as links to source code on all of the programming we implemented in the context of the project.

Overview of Capabilities / Walkthrough

Data Read In

Model Production

Output to Trait-o-Matic

Tests of Our model

Future Direction / Other Ideas

Here we discuss various ideas we generated throughout the course which we did not have either the time or resources to pursue as well as a discussion of various directions which we would like to take the project in the future.

Methods of Training

We see the issue of training data as one of the major hurdles for the project and just as important as the models themselves. In this regard we would long-term like to integrate the use of PGP-type data in training these models automatically rather than requiring uploaded CSV files from biologists. While currently there is not enough data available to allow for meaningful training in this way we can imagine a process in which

Many people have their genomes sequenced and also have a phenotypic profile taken of themselves
This information is stored in a publicly (or researcher) accessible database where genomes match up with traits
Whenever a model is being trained for predictive purposes it identifies the SNPs relevant in the genomes of the database as well as the corresponding phenotype, and then uses this for training
A model is outputted based on this training data

Such a process seems far more powerful in the long term than manually curated datasets on individual studies.

Future Models to Include

We plan to include variants of the logistic regression. In particular we need to provide options for predictions of continuous traits by automating the process of threshholding. These threshold logistic regressions would be flexible enough for continuous and multi-valued nominal traits. In this regard we would implement a multinomial logistic regression in addition to an ordinal one. We also would like to include a Fuzzy C-Means Clustering model, a strong Neural Networks model, and a Classification Tree model.

Parental Origin of SNPS

A recent paper in Nature entitled parental origin of sequence variants associated with complex diseases presents a compelling case for the importance of parental origin of SNPs in determining their effect on complex traits such as disease risk. The examples they present include SNPs that are protective when inherited from your mother but are harmful when inherited from your father. For more on this see Zach's on the subject. In this regard we would like to build into the project in the future capabilities to find such associations. This first requires datasets which include parental origin - finding such data would be facilitated by genotyping entire families and looking at the children for such aspects. We could then generalize our methods to consider maternal and paternal SNPs as separate variables and all of our infrastructure would still work for them. Thus we would have

New datasets with parental information to facilitate training
Prediction would require submitting familial infromation (parental sequences perhaps) but would not require a strong difference from our current model in terms of mathematical infrastructure.

Explanatory Methods

The focus of the project so far has been on predicting phenotype based on multiple genes, but has not focused at all on understanding these interactions. That is to say, we treat interaction as a black box in our models and attempt to determine what will come out. However, from a biological perspective (in particular if considering treatments) we are interested in how genes interact and what the nature of their epistasis is. This question is addressed in this paper which is further explored in Zach's post. We would like in the future to develop methods as in the paper to determine what interactions actually are (which is what we attempted to do in our first failed non-linear approach). Thus we could display information on the trait-o-matic page on how various genes interact and perhaps even visualize it. This owuld be a useful tool for researchers attempting to understand various pathways or for people seeking to understand what is actually happening biologically when they see their traits.

Lessons Learned

In this project we learned a great deal not only about statistics and biology, but also about the way they interact. Some key points we would like to pass on to anyone continuing this project or looking into something similar

Explicit non-linear regressions, while ideal in terms of explanatory power, are not computationally feasible and require datasets which are too large. In particular, the explanatory power we wish to find from such a model exceeds the information going into it.
Logistic Regression and Neural Networks are the two most promising means of predicting phenotypes based on genotypes. Neural networks in the long term are likely preferable given their flexibility and superior ability to deal with interactions of genes and avoid false positives ( a problem we found in test data sets for the logistic regression )
It is important to approach these problems after considering biological nuances, and different types of data will demand different sorts of models. An appreciation of the biological intricacies is necessary for setting up a framework to distinguish between such things.
Large and realistic data sets are essential. Without them, more complicated methods of model inference can fail to see even associations that we find obvious. This includes the presence of noise in simulated data sets — logistic regression actually failed to infer several simple models because of the absence of noise in our data set.

Update // 12-6 (Zach)

Just finished implementing the basics of the logistic regression in python. A nice feature is that this is all modular so we can reuse the code quite easily even if our model changes a bit or the way we wish to interface is different. We will make sure to provide a rigorous discussion of why we chose to use the model we did. Clearly this regression is best for binary traits but it is nonetheless a nice start, I believe. So as of now we have

A model for continuous traits with explanatory power - this is not implemented but we will provide a discussion of its merits, notes on its implementation, and possible future plans in our writeup
A logistic regression model for binary traits - I will post this code and explicate a bit further in my writeup
A clear extension of the logistic regression into threshholded traits

Hello World // 12-4

Hello all - this is Zach. I have been keeping track of my work on my talk page/the project page but to better organize our efforts leading up to the end of hte course I wanted to make sure we have a page for modeling! I'll post some stuff on here soon, but I wanted a centralized hub.

@@ Line 5: / Line 5: @@
 Please note that I have made a synthesis of the modeling group's work, which amounts to a Documentation of our software and model. It can be found on my [[User_talk:Alex_Lupsasca | talk page]] or directly here:
-[http://openwetware.org/images/1/18/Documentation.pdf Documentation & Final Report]
+[http://openwetware.org/images/1/18/Documentation.pdf Documentation & Final Report] -- Alex L
 == Documentation ==