Harvard:Biophysics 101/2009:Modeling: Difference between revisions
Line 48: | Line 48: | ||
==== Non-linear Regression ==== | ==== Non-linear Regression ==== | ||
This method is completely analogous to the linear regression except we no longer constrain the explanatory relationship to be linear but can instead consider anything. This has the clear advantage of allowing us to specify possible interactions on which to regress. For example see the original non-linear model we produced see User_talk:Zachary_Frankel#Thoughts_on_Today.2FWork_.2F.2F_11-15. However, the unfortunate drawback of this approach is that it is computationally very intensive and requires enormous datasets. Moreover, note that the logistic regression models are implicitly non-linear regressions. In particular we are moving to a linear domain by a suitable transformation mechanism and thus allow for non-linear interactions even though we solve the model in a linear fashion. | |||
==== Computer Learning ==== | ==== Computer Learning ==== |
Revision as of 04:04, 18 December 2009
Note on Documentation // 12-11
As I posted on my talk page I am starting to use this page to store documentation of the work the modeling group has done - I would like to encourage all members of the modeling team to collaboratively edit the documentation I started below - this means (Also, after we finish the documentation someone please delete this post from the page) I have started a skeleton with sections I believe we should include so lets all fill them in, and if there are sections to add by all means do so - also if you have any comments on documentation but do not want to make explicit changes to the entire documentation below, please post thoughts/ideas with your name within this section. Cheers!-- Zach
Documentation
Motivation / Project Overview
We proceed to explain the underlying motivation of the project of modeling, what our goals were and how we implemented them.
Classification/Types of Traits
Overview
We proceed to provide an overview of the types of traits we wish to model. In the course of the project, we encountered various types of phenotypic data, each which is best modeled in a certain way. Below we detail each of these types, provide examples of them, and explain the demands they put on modeling relative to each other. Methods described here were researched in various sources including an introduction to generalized linear models
Continuous(observable)
Continuous data in general is any data or information that is measurable on a continuous scale. In the context of something observable we could consider measuring height as a continous trait, as we could measure the height of all subjects to high precision. However, in practice, most supposedly continuous data is expressed in discrete terms. For instance, we round heights to the nearest inch or weights to the nearest pound. Thus we end up with discrete data, even if it contains many values. Continuous data has different demands in terms of modeling. In particular, it does not lend itself well to a logistic regression and instead may require non-linear methods to approach it. There is some evidence that artificial neural networks may be useful in dealing with such data.
Discrete/Nominal
Nominal classifications of data are discrete separations of data points based on non-quantitative classifications - for instance the measurement of eye color as blue, brown, or hazel. Often there is more quantitative data underlying such information ( to continue with the eye-color example one could consider pigment counts in the eyes ) - however there are methods for approaching this kind of data, particularly when ti is ordinal - ie. when the classifications have a natural order (such as young, middle-aged, old).
Binary
A binary or dichotomous variable is a nominal categorization where there are only two categories. For instance, contracting a disease or not contracting a disease, alive or dead, male or female. This too intersects with ordinal data as there can be an inherent order to such classifications.
Appropriate methods for each type
Based on our research, we have methods to approach intersections of these data types. In particular we consider combination of explanatory and response variables and provide the theoretically viable methods for each. The first entry corresponds to the explanatory, and the second corresponds to the response variable (taken from Ch 1. of an introduction to generalized linear models).
- Binary-Binary: logistic regression or log-linear models
- Binary-Nominal/Discrete: Contingency tables and log-linear models
- Binary-Continuous: t-tests
- Nominal-Binary: Generalized logistic regression and log-linear models
- Nominal-Nominal: Contingency tables and log-linear models
- Nominal-Continuous: Analysis of variance
- Continuous-Binary: Dose-response models including logistic regression
- Continuous-Binary: Multiple regression
Types of Models (and evaluations)
We provide brief descriptions of some of the models we both considered and implemented. In each section you will find both a description of the model as well as an evaluation of its strengths and weaknesses.
Linear Regression
Linear regression is the most basic modeling method in which one considers a model of the form [math]\displaystyle{ y = X\beta + e }[/math] ie. it is modeling the relationship between a response and a vector of explanatory variables by fitting a linear equation for the response to observed explanatory data. For a more detailed review please see this explanation. Unfortunately, this model while simple does not provide a strong framework from which to approach questions of genetic interaction. First, the relationship between genetic factors and phenotype is clearly nonlinear (particularly in cases of dichotomous results such as disease risk) and second it has no means of measuring interactions.
Logistic Regression
Logistic regression is a parametric statistical method that is often used in the genetic study of diseases. One of its advantages is that it can test for both genetic and environmental predictors, which it relates to via the logit function. It is also particularly well suited for binary or dichotomous responses - ie. whether or not one has a genetic disease. Moreover, it has been proven effective in case-control studies. However, it is not without its disadvantages. First, it performs very poorly when faced with dimensionality problems (an issue we will certainly face as we consider more and more factors) and it also is prone to false positives. Finally, it has shown weaknesses in detecting interactions of various effects. For further detail please refer to this paper on it use in genetic epidemiology and these notes for a rigorous mathematical description. We implemented this regression method and found some limited success, however at the time of this documentation we had not yet received an appropriate training data-set. Note if you are dealing with non-dichotomous response variables then one can consider the method of ordinal regression where you consider orders to the various nominal categories and hence can take probabilities of a response being less than a certain threshold (and accordingly summing probabilities) or one can consider the multinomial method where you consider relative probabilities of one response or another but no absolutes (though starting with one absolute probability others can be calculated).
Non-linear Regression
This method is completely analogous to the linear regression except we no longer constrain the explanatory relationship to be linear but can instead consider anything. This has the clear advantage of allowing us to specify possible interactions on which to regress. For example see the original non-linear model we produced see User_talk:Zachary_Frankel#Thoughts_on_Today.2FWork_.2F.2F_11-15. However, the unfortunate drawback of this approach is that it is computationally very intensive and requires enormous datasets. Moreover, note that the logistic regression models are implicitly non-linear regressions. In particular we are moving to a linear domain by a suitable transformation mechanism and thus allow for non-linear interactions even though we solve the model in a linear fashion.
Computer Learning
Project Code / Implementation
In this section you will find complete documentation, as well as links to source code on all of the programming we implemented in the context of the project.
Overview of Capabilities / Walkthrough
Data Read In
Model Production
Output to Trait-o-Matic
Future Direction / Other Ideas
Here we discuss various ideas we generated throughout the course which we did not have either the time or resources to pursue as well as a discussion of various directions which we would like to take the project in the future.
Methods of Training
Future Models to Include
Update // 12-6 (Zach)
Just finished implementing the basics of the logistic regression in python. A nice feature is that this is all modular so we can reuse the code quite easily even if our model changes a bit or the way we wish to interface is different. We will make sure to provide a rigorous discussion of why we chose to use the model we did. Clearly this regression is best for binary traits but it is nonetheless a nice start, I believe. So as of now we have
- A model for continuous traits with explanatory power - this is not implemented but we will provide a discussion of its merits, notes on its implementation, and possible future plans in our writeup
- A logistic regression model for binary traits - I will post this code and explicate a bit further in my writeup
- A clear extension of the logistic regression into threshholded traits
Hello World // 12-4
Hello all - this is Zach. I have been keeping track of my work on my talk page/the project page but to better organize our efforts leading up to the end of hte course I wanted to make sure we have a page for modeling! I'll post some stuff on here soon, but I wanted a centralized hub.