Harvard:Biophysics 101/2007/Project: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
(→‎Disease-Host Coevolution: added ideas for data input and analysis)
Line 108: Line 108:
===Input Data===   
===Input Data===   


Unfortunately, this is the limiting factor - anybody have a good source?
*Unfortunately, this is the limiting factor - anybody have a good source?
*This could be an excellent tool for the coming flood of genomes, given we could correlate pathogen genomes with personal genomes
*For now, we could just use Hapmap and the geographic localization of diseases?


===Data Characterization and analysis===
===Data Characterization and analysis===
Line 114: Line 116:
*For a simple case, we could imagine taking a known resistance trait - like CCR5 for HIV - and identifying the adaptations that allow HIV to infect resistant hosts.   
*For a simple case, we could imagine taking a known resistance trait - like CCR5 for HIV - and identifying the adaptations that allow HIV to infect resistant hosts.   
*With more data from host genomes, we could find new resistance traits.
*With more data from host genomes, we could find new resistance traits.
*A graphics-based visualization of polymorphism data (idea 5) could be helpful.  We could plot mutant strain on one axis, individual on another, etc.


===Action===
===Action===

Revision as of 07:52, 6 March 2007

Biophysics 101: Genomics, Computing, and Economics

Home        People        Schedule        Project        Python        Help       

Project Ideas

Project ideas that came up in the class February 22 are posted [here]

Application1 - ApoE

Alzheimers disease

Input Data

ApoE sequences

Data Characterization and analysis

Identify variation and search OMIM for similar variation and relationship to desease

Action

Suggest clinical testing actions and lifestyle changes

Identifying Common Genetic Motifs in Disease

We can write a script to interface all input genotypes with phenotypes for disease (note: we don't specifically have to look for motifs common to disease, but that seems pretty practical to me. Any phenotype will do, though).

Input Data

Since this script would theoretically cross-reference genotype and phenotype, we would need:

  • Genotypic Inputs (presumably in the form of personalized genome sequences)
  • Phenotypic Inputs (presumably this would take the form of a medical history for the corresponding genome sequence)

Data Characterization and analysis

I think we could design an algorithm to go through and scan for varying numbers of motifs of varying lengths found in specific population subsets, but absent in others. Are there any significant patterns found in a diseased group of people? Significant motifs present in sick populations? Significant motifs absent?

We will certainly have to perform quality-control, and perhaps we can model Cystic Fibrosis, color blindness, sickle cell (etc) to optimize our detection methods.

Action

How can we use these data to help people? Any identified motifs could certainly direct our research efforts, implicating new sites and players in the molecular mechanisms of disease. I'm a little confused by the recommendation on the project page of 'medical/dietary action'. Certainly we could use our data to inform someone of their risk for disease (note, this information could also be abused. Perhaps that would better inform their life-style choices? Prevention is an ideal solution to disease, but, for the inevitable genetic ones, we direct research towards therapy and subversion of the identified molecular mechanisms.

Mapping the Natural Genetic Library

Input Data

I am imagining a future where a billion dollar project can sequence a billion genomes. This data is limited as of now.

  • Whole genomic data of organisms, spanning the whole tree of life
  • Localization data
    • What phage goes with which cell/organism, and the genomes for both of them
    • Geographic localization: the source of the genome (extra-terrestrial?)
    • Information about the genome source

Data Characterization and analysis

Think of the whole genomic space. It is equal to an infinite-dimensional space of natural numbers (mod 4). In nature, however, very few of these actually occur, and furthermore, they all had to 'evolve' from a common ancestor. Thus given that we pick the right metric, the space is somewhat continuous - the natural library was not designed, but evolved. Most of this 'continuity' has been provided by the classical evolutionary mechanisms we've known about, but some similarity and convergence is due to similar environmental pressures 'directing' evolution in certain ways, and some are a result of the newly emerging concept of horizontal transfer of genomic information. We would like to know which subset of the whole genomic space actually occurs in nature, and what forces constrained nature to explore that subset and not others. Are there repeating patterns, limited by chemistry, physics, or biological constraints? Are there particular bottlenecks, or design constraints?

  • For example, a design constraint for a brain is coming up with changes one mutation at a time while keeping the brain operational. I don't remember where I've heard the analogy, but it is similar to changing wheels on a moving car. Similar arguments go for the eye, etc.
  • Data analysis would involve a range of tools for visualization, statistics, clustering, and comparison

Action

We could use this for some remarkable applications

  • Synthetic biology: Having an idea of how the 'natural library' looks like, we can make connections and information transfer between nodes of the space that otherwise would not have communicated. In other words, we can short-cut nature because we do not have to obey the design constraints mentioned above. The natural library would give us a better understanding of what we are doing (we can literally map our new 'synthetic' contribution), and also be a tremendous inspiration (we can copy tricks from nature)
  • It would clarify relationships between many organisms, why we evolved the way we did, and so on. It is the ultimate 'comparative genomics' platform -- mapping the whole genomic space of life.


BioWeather(ish): Influenza

Wouldn't it be cool if we could track mutations in influenza viruses and determine spatially and locationally where they occured, what virulence changes resulted, and what likely mutations (and properties of spread of these mutated viruses) might occur in the future, as our algorithm is updated with real-time epidemiological information?

Input Data

Sequences from Influenza A pages, and possibly real-time WHO data on (roughly characterized, if not genomically so) strains and spread rates.

Data Characterization and analysis

  • These are just some qualities we could analyze, but...
    • What are the differences in sequence, as tracked over:
      • Time
      • Location
    • What are the physical meanings of those differences (ie. protein changes)?
    • What changes, if any, are predictable in certain regions
    • (Note: I think this paper is a significant contribution to the influenza field and could inform this project -- CSN)
      • I'm currently unsure, but I've heard that extra-virulent strains may occur when there's a mix of:
        • Different human strains, or
        • Human strains and animal (bird, pig, etc.?) strains
        • Different animal strains
      • So perhaps we may spot where spatial distances between strains are getting smaller and make hypotheses about new hybrid strain creation (and virulence) based on that
    • Real-time inputs will allow prediction of spread characteristics as well as, possibly, predictions of virulence

Action

If we can predict where some particularly virulent strain will hit, and what its genomic characteristics will be, perhaps we can avert it with vaccines or quarantine measures.

Graphics-based visualization of polymorphism data

Inspired by the brilliant work at gapminder.org, which is looking at data of a different sort.

Input Data

HapMap data, genomic data, etc.

Data Characterization and analysis

Visualize data as data points in two or three dimensional space, and then using a combination of graphics and genomics algorithms, process this data to find points of interest. For example, haplotypes could be plotted with loci along one axis and individuals on another and some other factor on a third. Recombination frequency data could be gathered for various SNPs, for example, and any that stand out as compared with a theoretically model would be points of interest. Individual genomes could be binned by some sort of graphic algorithm that orders people along the axis in such a way as to minimize chaos.

Action

Though this analysis, we should be able to gain an understanding of which alleles 'work together'; not only would this help elucidate certain protein-protein interactions, we would also be able to locate in each personal genome potentially hazardous combinations of alleles, etc. and suggest therapeutic methods to address the phenotypes that result.

Agroplanning

The idea of this project is to determine the optimal usage of land and resources to grow grasses which can be used as an ethanol alternative energy source. Although technology is not there yet, eventually cellulose should be able to be efficiently converted to ethanol; when that occurs prairie grasses have high potential to be a much more viable ethanol source than conventional sources such as corn. However, different grasses take different conditions and can produce different theoretical limits to ethanol; a challenge will be to optimize growth areas and conditions.

Bkg: [1]

Input Data

  • Grass types
  • Land information, such as population density /soil pH / weather / etc...

Data Characterization and analysis

  • Figure out the amount of ethanol theoretically possible to be produced per grass
  • Create a model which maximizes ethanol production from grass while minimizing land usage

Action

  • Output is the above model, which can be used for agroplanning

Disease-Host Coevolution

Our pathogens live in a genetically determined environment - us. Their genetic polymorphisms will be shaped as a response to that environment. By correlating the genomes of pathogen and host, it may be possible to identify new loci involved in pathogenesis and disease resistance, and potential new drug targets.

Input Data

  • Unfortunately, this is the limiting factor - anybody have a good source?
  • This could be an excellent tool for the coming flood of genomes, given we could correlate pathogen genomes with personal genomes
  • For now, we could just use Hapmap and the geographic localization of diseases?

Data Characterization and analysis

  • For a simple case, we could imagine taking a known resistance trait - like CCR5 for HIV - and identifying the adaptations that allow HIV to infect resistant hosts.
  • With more data from host genomes, we could find new resistance traits.
  • A graphics-based visualization of polymorphism data (idea 5) could be helpful. We could plot mutant strain on one axis, individual on another, etc.

Action

  • Suggest strongest candidates for empirical testing.
  • If targets are confirmed, investigate for potential as drug target, ease of evasion by pathogen, level in population, etc.