User talk:Brett Thomas: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
No edit summary
Line 1: Line 1:
== Brett's Impassioned Plea ==
'''I think''': That helping improve the mechanism for GWAS within the PGP will be more valuable toward discovering gene interactions than the synthetic process than the proposed synthetic approach.
'''Case Study''': I think a VERY common mistake with software design is not considering the end use at every step. So, I want to start with a case study of a problem that would be great to solve. Consider it's 2012 and we have 100,000 genomes. One of you has a kid (ie. not me) and we want to know what the probability is that they receive late-onset Alzheimers - and, more importantly, what you can do to prevent it. Read up on genetics/Alzheimers [[http://www.nia.nih.gov/Alzheimers/Publications/geneticsfs.htm here]].
== October 13th: My Thoughts On The Project ==
== October 13th: My Thoughts On The Project ==



Revision as of 13:54, 5 November 2009

Brett's Impassioned Plea

I think: That helping improve the mechanism for GWAS within the PGP will be more valuable toward discovering gene interactions than the synthetic process than the proposed synthetic approach.

Case Study: I think a VERY common mistake with software design is not considering the end use at every step. So, I want to start with a case study of a problem that would be great to solve. Consider it's 2012 and we have 100,000 genomes. One of you has a kid (ie. not me) and we want to know what the probability is that they receive late-onset Alzheimers - and, more importantly, what you can do to prevent it. Read up on genetics/Alzheimers [here].

October 13th: My Thoughts On The Project

High Level: I keep returning to the question of how we can use artificial intelligence and user community input to make a gene expression tool self learning. I think this would add another layer of analysis that traditional gene databases are missing.

The Problem(s): The central mechanism that each of these tools is trying to improve is the following: 1) identify a person's gene; 2) identify what non-genomic information can affect how this gene is expressed; 3) give the person information about future probabilities of outcome. It seems that the resources we've seen try to map step 1) to step 3) and on the whole do a poor job accounting for step 2).

This is actually a combination of two problems. The first is that information is not personalized enough. Companies like 23andme and Navigenics provide the above diagnostic tool, but it doesn't seem that they ask a person about their lifestyle. They give a bucket of information of the form: "You have gene 1 out of 3. If you have lifestyle A, you'll be susceptible to outcome X; and if you have lifestyle B, you'll be susceptible to outcome Y." It would be better if they provided targeted information of the form: "[First ask user their lifestyle] -> Since you have lifestyle B, you'll be susceptible to outcome Y. If you switch to lifestyle A, you'll transfer to outcome [somewhere between X and Y]" This is a subtle difference, and in these simple examples it doesn't seem important. But as research improves and environmental information becomes more targeted, I propose that users will begin to demand more and more targeted information.

How can research improve to make this information more targeted? That reveals the second problem: analyzing observational data is crucial. It seems there are two ways to figure out a genetic/environmental determinism: 1) do lab research to figure out the chemical mechanism that's happening in the cell; or 2) do a population study to figure out a correlation and try to determine cause and effect. It seems to me that (2) is much more promising in the near term. But the academic research method is insufficient - it'd take decades to acquire all the information we want through academic papers of the sort referenced in SNPedia.

The Solution(s): So the two problems are: 1) Need more targeted information; and 2) Need a way to expedite gene expression research. I propose that these two problems can be solved together by making a gene expression engine that is self-learning on the environmental observations of users.

The mechanism would work as follows:

  • User's genes are stored in a database along with their observations
  • User is given a secure web portal to record obsevations when they are triggered
  • Researchers can query this database for anonymized data through an API. This already improves the value of research.
  • In addition to researchers, an engine is built on top of the database that mines data for potentially statistically significant relationships. The standard model is to map two observational data pieces, and see if any genes seem to effect outcome. For example, engine could map an obesity question to a diet question, and scan the gene database to see if any genes cause diet to affect obesity differently. Statistically, this engine can follow the logarithmic sample size -> potential outcomes to test relationship that we saw in the last lecture.
  • What data is collected? A research expert is assigned to manage what information is requested from what users. Over time, researchers (and maybe users?) can request new information when they think it could be relevant. It is key that this information be targeted. Going on the previous example, a targeted diet question can be asked of all obese people that identifies diets high in salt/protein/etc. Then, researchers and the envine can go to town trying to identify more causal relationships.

Our Project: Obviously this is way beyond the scope of a final project, but instead a vision of how these things will work in the long run. How could we make something useful and/or insightful from it? I think we can start small and build one of these components. More to come...

Relationships to Current Resources:

Asst. 3: Project Ideas

The concept: One common theme in the resources we've looked at is linking DNA data to personal data. One of the lessons I took from the last discussion was that dynamic data collection in the PGP would allow an important new layer of data analysis. I returned to this idea after looking at the gene identifier sites assigned this week - thye were trying to link traits and genes by working around what I see as the most basic way to do this: looking at people's genes and then asking them if they have a certain trait.

Within the Personal Genome Project, I think such a mechanism would work as follows: researchers propose that a certain gene is associated with a certain trait. Researchers pose the question so it can be mapped to a discrete data set, and then send the questions to a targeted set of PGP-ers to get responses.

Implementation: I think this could be implemented as an extension to the PGP site. I think it'd take three infrastructure pieces:

  • Researcher facing engine: a platform that allows researchers to create questionnaires and specify which users they'll go to. Will also email users to say "we want to ask you another question."
  • User facing: a secure site for users to log in to quickly answer questions. Could be an app on the PGP site or standalone, depending on which aligns with the current rules.
  • Data: an extension to the current PGP data storage system to store data that is collected. Also could be directly integrated or a separate relational database with linked tables.

Notes:

  • Accounting for privacy: I think privacy would definitely be the biggest obstacle, particularly if we allow data to be cross tabulated, as many PGP-ers would be easy to uniquely identify.
  • What data to collect: I think the most important initial research would be to identify exactly what data researchers would want to collect. That work
  • API: I think a natural extension is to provide an API for the public to use. This would allow other gene sites to submit a (user, gene, trait) triplet. This would be an extensive undertaking, but may be worthwhile if such a service doesn't already exist.
  • Third party platforms: Another thought is that we could take advantage of a third party platform to create a quick app, like Google Health, Healthvault, or (the company that I worked for this summer) Keas.

Asst. 2: Modelling Gene Mutations

Here is a link to my code. It' spretty long (I did way too much) so I didn't want to clutter this page..

Asst. 1: Modeling Exponential Growth

I have some experience with python and excel, so the programming part of this asst wasn't very time consuming for me. I'm just going to throw out a few random notes:

  • The Model I was actually pretty confused by this model. These functions are some variant of Current Pop * constant factor. Seems like a more appropriate general model is Current pop to the power of constant factor. I just realized this a couple minutes ago, I'm sure I'll reconcile the difference before class.
  • Slide Without thinking, I copy/pasted Dr. Church's functions into excel as written. Then when I did the coding, I took it to mean linear growth with A2 representing the independent variable and A3 representing the output. This was dumb...and made the python coding like 10X harder too :)
  • Practicality When I actually understood what we were doing, I was able to analyze the biological component. In short: I really don't think exponential growth is a very practical model on either a population or evolutionary scale.
  • Population It seems that there have to be thousands of feedback loops when analyzing growth in a population. In the rabbit example, the true growth was probably only exponential for a short time before food enforced a negative feedback. On the other hand, if the first rabbits crowded out competitors, would have caused a positive feedback. The more I think about such examples, the more I think that exponential growth is more a corner case than a model.
  • Evolution Exponential growth makes even less sense to me when discussing evolutionary progress, because it seems evolution evolution of a species would "conquer" the lowest hanging fruit first. What I mean by this is that increases in brain size that were most effective probably came first, and then brain evolution would become subject to diminishing returns. If brain size is an indicator of progress, this would contradict the hypothesis from Slide 10: one has to be wrong..
  • Evolution vs. Technology Since I'm skeptical of the exponential model of evoluton, the analogy to Moore's Law becomes more interesting. Why should evolutionary vs. technological innovation be different? I wish I had the time to give this more thought, and hope we can in class today. One idea is that the pressures are different: transistor technology is measured absolutely, whereas in evolution a relative advantage is probably more important than an absolute advantage. Another is that it is more difficult for evolution to adjust the fundamental building blocks of a species, while Intel can easily switch from silicon to a graphite transistors if they can be abstracted to the same old x86 standards.