Harvard:Biophysics 101/2009:Modeling: Difference between revisions

Revision as of 04:15, 18 December 2009

Note on Documentation // 12-11

As I posted on my talk page I am starting to use this page to store documentation of the work the modeling group has done - I would like to encourage all members of the modeling team to collaboratively edit the documentation I started below - this means (Also, after we finish the documentation someone please delete this post from the page) I have started a skeleton with sections I believe we should include so lets all fill them in, and if there are sections to add by all means do so - also if you have any comments on documentation but do not want to make explicit changes to the entire documentation below, please post thoughts/ideas with your name within this section. Cheers!-- Zach

Documentation

Motivation / Project Overview

We proceed to explain the underlying motivation of the project of modeling, what our goals were and how we implemented them.

Classification/Types of Traits

Overview

We proceed to provide an overview of the types of traits we wish to model. In the course of the project, we encountered various types of phenotypic data, each which is best modeled in a certain way. Below we detail each of these types, provide examples of them, and explain the demands they put on modeling relative to each other. Methods described here were researched in various sources including an introduction to generalized linear models

Continuous(observable)

Continuous data in general is any data or information that is measurable on a continuous scale. In the context of something observable we could consider measuring height as a continous trait, as we could measure the height of all subjects to high precision. However, in practice, most supposedly continuous data is expressed in discrete terms. For instance, we round heights to the nearest inch or weights to the nearest pound. Thus we end up with discrete data, even if it contains many values. Continuous data has different demands in terms of modeling. In particular, it does not lend itself well to a logistic regression and instead may require non-linear methods to approach it. There is some evidence that artificial neural networks may be useful in dealing with such data.

Discrete/Nominal

Nominal classifications of data are discrete separations of data points based on non-quantitative classifications - for instance the measurement of eye color as blue, brown, or hazel. Often there is more quantitative data underlying such information ( to continue with the eye-color example one could consider pigment counts in the eyes ) - however there are methods for approaching this kind of data, particularly when ti is ordinal - ie. when the classifications have a natural order (such as young, middle-aged, old).

Binary

A binary or dichotomous variable is a nominal categorization where there are only two categories. For instance, contracting a disease or not contracting a disease, alive or dead, male or female. This too intersects with ordinal data as there can be an inherent order to such classifications.

Appropriate methods for each type

Based on our research, we have methods to approach intersections of these data types. In particular we consider combination of explanatory and response variables and provide the theoretically viable methods for each. The first entry corresponds to the explanatory, and the second corresponds to the response variable (taken from Ch 1. of an introduction to generalized linear models).

Binary-Binary: logistic regression or log-linear models
Binary-Nominal/Discrete: Contingency tables and log-linear models
Binary-Continuous: t-tests
Nominal-Binary: Generalized logistic regression and log-linear models
Nominal-Nominal: Contingency tables and log-linear models
Nominal-Continuous: Analysis of variance
Continuous-Binary: Dose-response models including logistic regression
Continuous-Binary: Multiple regression

Types of Models (and evaluations)

We provide brief descriptions of some of the models we both considered and implemented. In each section you will find both a description of the model as well as an evaluation of its strengths and weaknesses.

Linear Regression

Linear regression is the most basic modeling method in which one considers a model of the form [math]\displaystyle{ y = X\beta + e }[/math] ie. it is modeling the relationship between a response and a vector of explanatory variables by fitting a linear equation for the response to observed explanatory data. For a more detailed review please see this explanation. Unfortunately, this model while simple does not provide a strong framework from which to approach questions of genetic interaction. First, the relationship between genetic factors and phenotype is clearly nonlinear (particularly in cases of dichotomous results such as disease risk) and second it has no means of measuring interactions.

Logistic Regression

Logistic regression is a parametric statistical method that is often used in the genetic study of diseases. One of its advantages is that it can test for both genetic and environmental predictors, which it relates to via the logit function. It is also particularly well suited for binary or dichotomous responses - ie. whether or not one has a genetic disease. Moreover, it has been proven effective in case-control studies. However, it is not without its disadvantages. First, it performs very poorly when faced with dimensionality problems (an issue we will certainly face as we consider more and more factors) and it also is prone to false positives. Finally, it has shown weaknesses in detecting interactions of various effects. For further detail please refer to this paper on it use in genetic epidemiology and these notes for a rigorous mathematical description. We implemented this regression method and found some limited success, however at the time of this documentation we had not yet received an appropriate training data-set. Note if you are dealing with non-dichotomous response variables then one can consider the method of ordinal regression where you consider orders to the various nominal categories and hence can take probabilities of a response being less than a certain threshold (and accordingly summing probabilities) or one can consider the multinomial method where you consider relative probabilities of one response or another but no absolutes (though starting with one absolute probability others can be calculated).

Non-linear Regression

This method is completely analogous to the linear regression except we no longer constrain the explanatory relationship to be linear but can instead consider anything. This has the clear advantage of allowing us to specify possible interactions on which to regress. For example see the original non-linear model we produced see User_talk:Zachary_Frankel#Thoughts_on_Today.2FWork_.2F.2F_11-15. However, the unfortunate drawback of this approach is that it is computationally very intensive and requires enormous datasets. Moreover, note that the logistic regression models are implicitly non-linear regressions. In particular we are moving to a linear domain by a suitable transformation mechanism and thus allow for non-linear interactions even though we solve the model in a linear fashion.

Computer Learning and Neural Networks

First, note that all models are a form of computer learning - namely we take dynamic datasets and use them to train the model on and hence this is a form of learning. However, a more explicit form of this is the use of artificial neural networks to recognize patterns in observed genomic data. In these models we use input layers(usually just one), hidden layers, and an output layer. Here a layer is something composed of nodes, and each connects to the subsequent layer. These connections have assigned weights from which subsequent values are calculated. This is called a feed-forward structure because information flows from the input to the output layer through the nodes of the hidden layer. In this process we use explanatory variable data for the input layer. In training the network adjusts weights to reduce error as much as possible. A clear strength of this approach is that this method is more general than any particular model as we can use it to construct equivalent models. For instance, a neural network with a logistic transfer function on one hidden layer is the same as our logistic regression. Additionally, it does not suffer from the same dimensionality problems as the logistic regression. This approach is also more flexible and likely more accurate in general. Moreover, it is better able to detect gene-gene interactions. However, there is not extensive literature on this method but we plan to investigate it more in the future.

Project Code / Implementation

In this section you will find complete documentation, as well as links to source code on all of the programming we implemented in the context of the project.

Overview of Capabilities / Walkthrough

Data Read In

Model Production

Output to Trait-o-Matic

Future Direction / Other Ideas

Here we discuss various ideas we generated throughout the course which we did not have either the time or resources to pursue as well as a discussion of various directions which we would like to take the project in the future.

Methods of Training

Future Models to Include

Update // 12-6 (Zach)

Just finished implementing the basics of the logistic regression in python. A nice feature is that this is all modular so we can reuse the code quite easily even if our model changes a bit or the way we wish to interface is different. We will make sure to provide a rigorous discussion of why we chose to use the model we did. Clearly this regression is best for binary traits but it is nonetheless a nice start, I believe. So as of now we have

A model for continuous traits with explanatory power - this is not implemented but we will provide a discussion of its merits, notes on its implementation, and possible future plans in our writeup
A logistic regression model for binary traits - I will post this code and explicate a bit further in my writeup
A clear extension of the logistic regression into threshholded traits

Hello World // 12-4

Hello all - this is Zach. I have been keeping track of my work on my talk page/the project page but to better organize our efforts leading up to the end of hte course I wanted to make sure we have a page for modeling! I'll post some stuff on here soon, but I wanted a centralized hub.

@@ Line 51: / Line 51: @@
 This method is completely analogous to the linear regression except we no longer constrain the explanatory relationship to be linear but can instead consider anything. This has the clear advantage of allowing us to specify possible interactions on which to regress. For example see the original non-linear model we produced see User_talk:Zachary_Frankel#Thoughts_on_Today.2FWork_.2F.2F_11-15. However, the unfortunate drawback of this approach is that it is computationally very intensive and requires enormous datasets. Moreover, note that the logistic regression models are implicitly non-linear regressions. In particular we are moving to a linear domain by a suitable transformation mechanism and thus allow for non-linear interactions even though we solve the model in a linear fashion.
-==== Computer Learning ====
+==== Computer Learning and Neural Networks ====
+First, note that all models are a form of computer learning - namely we take dynamic datasets and use them to train the model on and hence this is a form of learning. However, a more explicit form of this is the use of artificial neural networks to recognize patterns in observed genomic data. In these models we use input layers(usually just one), hidden layers, and an output layer. Here a layer is something composed of nodes, and each connects to the subsequent layer. These connections have assigned weights from which subsequent values are calculated. This is called a feed-forward structure because information flows from the input to the output layer through the nodes of the hidden layer. In this process we use explanatory variable data for the input layer. In training the network adjusts weights to reduce error as much as possible. A clear strength of this approach is that this method is more general than any particular model as we can use it to construct equivalent models. For instance, a neural network with a logistic transfer function on one hidden layer is the same as our logistic regression. Additionally, it does not suffer from the same dimensionality problems as the logistic regression. This approach is also more flexible and likely more accurate in general. Moreover, it is better able to detect gene-gene interactions. However, there is not extensive literature on this method but we plan to investigate it more in the future.
 === Project Code / Implementation ===