Harvard:Biophysics 101/2009/Infrastructure: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
 
(37 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Infrastructure Background==
===Infrastructure Talk===
update: Trait-O-Matic now has an icon!


This is a listing of background information for those interested in Trait-O-Matic
==The Model Paradigm==


==Tasks==
After realizing the difficulty of just creating new models just from data mining and analysis, having talked to the modeling group, we decided to go along a more standardized approach. Joe Torella had suggested that we manually curate models taken from the literature and somehow incorporate them into Trait-o-matic's output. This idea ended up coalescing into a vision of how this could actually be implemented:
 
# Users would be able to upload models that they have written in python. This is a similar approach to that of Google App Engine [[http://code.google.com/appengine/]], which allows end users to upload small web services that run in Python or Java, to be run on Google's servers. To maintain the security of their servers, they have disabled many modules of the python interpreter which runs those files. Similarly, I plan to install a separate installation of python 2.6 that has the same modules disabled as Google does. ''RestrictedPython'' will be used to implement this: it seems to be a rather secure python environment that can be used to execute untrusted code [[http://pypi.python.org/pypi/RestrictedPython/]]
# These models will present a unified programming environment to anyone wishing to code in a literature model. For example (as described below), at the top of the file, the particular SNPs that are used to make a calculation will be commented and run into a database when the model py file is uploaded to the server. Then, the server knows in what order to pass SNPs into the model. We plan to pass SNPs as command line arguments, in the order that they are listed in the header of the .py file. Then the values will be available for the user.
##There is some ambiguity about how we will end up passing values. There are two options: binary and standard, with binary just indicating whether the person has a particular allele, and standard passing the actual homozygous/heterozygous character of the SNP.
# The "hypothesis" tab will present a way to create a "barebones" model, or just exporting an excel file containing particular values. This barebones model will be a way for more novice programmers to hit the ground running, by presenting to them the variables they need as already declared and commented!
# This model paradigm can be used with Zach's parametrized model. Although that model will have to be parametrized on another dataset for now, since our dataset it so small, The parametrized model can be loaded in as a regular model. Eventually, the "learning" and "modeling" components will connect back to each other, as the models logistic regression takes data from the T-o-M dataset itself and then is available as a usable, continuously updating model for unknown traits.
 
==Model Standard==
 
This is the standard we are looking to use for users to upload their custom models. These values should all be commented at the top of the python file. We might add some in the future, so do not take this as a definite format yet.
<code>
rsid:1799884
rsid:1260326
rsid:560887
rsid:10830963
trait: Diabetes Mellitus, Non-Insulin-Dependent
source: Reiling et al. 2009, Dutch New Hoorn Study
pmid: 10841471, 98242742        <--- these are any PubMed articles used to make the file. You can also index these on separate lines
model: Genetic Score
model author: Your Name, Your Partner's Name
mode: binary OR standard        <--- what is passed into the model. Binary = 1 0 1 0, standard = AA , TA , GT, etc..
output: Relative risk of disease by age 53, compared to the general population
notes: Genotype cannot on its own predict diabetes risk, which varies substantially with age, race, BMI and family history.
</code>
 
There are many categories of these models that can be created, such as for example a model for pharmacogenomics of particular drugs, disease traits, descriptive traits, etc!
 
==Possible Tasks==
Right now, we envision a Trait-o-Matic output that
 
(I) has all the data from the various databases together (no OMIM/snpedia/Other tabs)
 
(II) has all the phenotypes associated with a certain more general condition (for example, height) grouped under one tab with the appropriate label (and perhaps, eventually, with a model-generated quantification)
 
(III) under the tab, has the same output as is currently available, plus extra: perhaps (A) a paragraph automatically extracted from appropriate literature, (B) information on the numerical output of other possible models, (C) an interface for manual curation/statistical evaluation of the current models
 
To accomplish these goals, we suggest the following short-term tasks:
 
 
1. Think of new tasks!
 
2. Look at the databases (OMIM, snpedia, morbid map, etc.) that are stored in the mySQL databases for Trait-o-Matic, and try to think of ways (probably building on/incorporating Fil's UMLS method) for organizing them further
 
3. Continue to document Trait-o-Matic core (Alex should have a basic schema map done by the end of this weekend, but the core needs more documenting too)
 
4. Building on (2), begin to look at modifications that will need to be made in the core database-to-JSON python files, in order to integrate the new classification data we will have organized
 
5. Build in a quick and easy way to dump the final JSON output before it goes to the front end display (just to play around with, reroute to a model, etc.)
 
6. Continue general familiarization with Trait-o-Matic
 
==Notes==
*Front end structure
Just before this post I wasted a little bit of time making a dumb diagram of the Trait-o-Matic front end:
 
[[Image:tomfrontendmap.jpg|600px|T-o-M Front End VERY SIMPLE JSON retrieval only]]
 
Obviously this ignores everything except the simple JSON retrieval process.  Also I include my sloppy and almost as simple notes [[Media:tom_notes.txt|here.]]  See my talk page for more thoughts "[[User:Alexander J. Ratner|Alexander J. Ratner]] 01:25, 23 November 2009 (EST)"


*Text Classification
*Text Classification
   
   
*NOTE 11/16 FZ: I found this really cool NIH tool called United Medical Language System [http://www.nlm.nih.gov/research/umls/ here]. I am quickly learning how to use their interface, as it already contains all of the data stored in MeSH and ICD, among many others. It has a way of hierarchically organizing search terms. Thus, my work for the next few days will be exploring how to store the categorizations in the phenotype data that t-o-m outputs, and displaying it according to these categorizations (I already understand how results are displayed, see below for results.php) but now have to see how the data structures (JSON files )generating such results are generated
--I already have MeSH parsed into a mySQL database, I am currently working on ICD-10 as a supplemental hierarchy of diseases [[http://apps.who.int/classifications/apps/icd/icd10online/]]. As soon as this is ready the data will be applied to the classification.
--I already have MeSH parsed into a mySQL database, I am currently working on ICD-10 as a supplemental hierarchy of diseases [[http://apps.who.int/classifications/apps/icd/icd10online/]]. As soon as this is ready the data will be applied to the classification.


Line 14: Line 76:


==Overview==
==Overview==
* '''/home/trait/core'''
*Formats
* /home/trait/core/affx_500k_to_gff.py
1. GFF format: “Gene Feature Format” is a file format for encoding feature information (starts, splice sites, stops, motifs, exons, introns, protein domains, etc.).  Claims to ‘aim for a low common denominator’ in terms of amount of genomic data represented.  Format is a list of tab-separated fields, one per line, each one describing a gene feature; they are (in the following order):
* /home/trait/core/cgi_to_gff.py
 
* /home/trait/core/codon.py
 
a.Name of the sequence being referred to (allows for multiple genes/sequences to be referenced in one file)
 
b. Source of the sequence
 
c. Feature; standard set defined at http://www.ebi.ac.uk/embl/WebFeat/
 
d. Start (integer number)
 
e. End
 
f. Score (floating point value)
 
g. Strand (+, - or .)
 
h. Frame (0, 1, 2, or .)
 
i. Attribute (separated by semicolon separators), again based on the above std. set
 
j. # is for commenting, ## for certain types of meta-data.
 
 
2. affx format or ‘affy’ format: from Affymetrix gene sequencing arrays
 
3. FASTQ format: text based format for storing both sequence and quality scores
 
4. FASTA format: text based format for storing genetic sequence with minimal comments
 
5. JSON format:  simple RSS style format for printing/transferring tables
*'''[[User:Alexander J. Ratner|Alexander J. Ratner]] 14:04, 17 November 2009 (EST)'''
 
== '''/home/trait/core''' ==
* '''/home/trait/core/affx_500k_to_gff.py'''
 
- outputs [http://www.sanger.ac.uk/Software/formats/GFF/ GFF] records for each SNP in an Affymetrix 500k Genechip file
 
* '''/home/trait/core/cgi_to_gff.py'''
 
- outputs GFF record of each entry in the Complete Genomics csv file
 
* '''/home/trait/core/codon.py'''
 
- codon_123(input)
-- returns a three letter amino acid abbreviation given a single letter code input
 
- codon_321(input)
-- returns a single letter code given a three letter amino acid abbreviation
 
* '''/home/trait/core/config.py'''
 
- contains the configuration for t-o-m, such as passwords for databases and the like
 
* '''/home/trait/core/fastq_to_fasta.py'''
 
- strips quality data from a [http://en.wikipedia.org/wiki/FASTQ_format FASTQ] format file and returns just the [http://en.wikipedia.org/wiki/FASTA_format FASTA] format
 
* '''/home/trait/core/gff_concordancy.py'''
 
- inputs two lists of GFF-containing files and outputs the concordance between the two in a tabular file


- codon_123
* '''/home/trait/core/gff_dbsnp_query.py'''
-- converts a single letter amino acid code to its three-letter abbreviation


- codon_321
- appends dbSNP information to db_xref (or GFF3, Dbxref) attributes
-- converts a three-letter amino acid abbreviation to its single letter code


* '''/home/trait/core/gff_hgmd_map.py'''
* '''/home/trait/core/gff_intersect.py'''


* /home/trait/core/config.py
- outputs the intersection of two GFF files, with attributes taken from the first
* /home/trait/core/fastq_to_fasta.py
 
* /home/trait/core/gff_concordancy.py
* '''/home/trait/core/gff_morbid_map.py'''
* /home/trait/core/gff_dbsnp_query.py
* /home/trait/core/gff_hgmd_map.py
* /home/trait/core/gff_intersect.py
* /home/trait/core/gff_morbid_map.py
* /home/trait/core/gff_nonsynonymous_filter.py
* /home/trait/core/gff_nonsynonymous_filter.py
* /home/trait/core/gff_omim_map.py
* /home/trait/core/gff_omim_map.py
Line 38: Line 154:
* /home/trait/core/gff_snpedia_map.py
* /home/trait/core/gff_snpedia_map.py
* /home/trait/core/gff_sort.pl
* /home/trait/core/gff_sort.pl
- sorts a GFF file (by feature length, minScore, maxScore, or custom expression)
* /home/trait/core/gff_subtract.py
* /home/trait/core/gff_subtract.py
* /home/trait/core/gff_twobit_map.py
* /home/trait/core/gff_twobit_map.py
Line 44: Line 163:
* /home/trait/core/json_allele_frequency_query.py
* /home/trait/core/json_allele_frequency_query.py
* /home/trait/core/json_to_job_database.py
* /home/trait/core/json_to_job_database.py
* /home/trait/core/maq_snp_to_gff.py
* '''/home/trait/core/maq_snp_to_gff.py'''
 
- outputs GFF record for each entry in a Maq SNP file
 
* /home/trait/core/omim_print_variants.py
* /home/trait/core/omim_print_variants.py
* /home/trait/core/setup.py
* /home/trait/core/setup.py
* /home/trait/core/snpedia.py
* /home/trait/core/snpedia.py
- Outputs tab-separated variant information (into data/snpedia.txt) for each entry in SNPedia
* /home/trait/core/snpedia_print_genotypes.py
* /home/trait/core/snpedia_print_genotypes.py
- Goes through snpedia.txt and prints out the associated genotypes (found in the snp19 database)
* /home/trait/core/snpinduse_to_gff.py
* /home/trait/core/snpinduse_to_gff.py
* /home/trait/core/trait-o-matic-server.py
* /home/trait/core/trait-o-matic-server.py
Line 56: Line 184:
* /home/trait/core/yh_gff_snp_to_gff.py
* /home/trait/core/yh_gff_snp_to_gff.py


* '''/home/trait/data'''
== '''/home/trait/data''' ==
 
This folder stores all of the raw data used by the application
 
 
 
== '''/home/trait/www/system/application/controllers/''' ==
 
* results.php -- this controls the loading of results into a data array!
 
== '''/home/trait/www/system/application/models/''' ==
 
== '''/home/trait/www/system/application/views/''' ==
 
* results.php
 
- this controls how the results are printed on the page. The table here is sorted by Sortable.js
 
== '''/home/trait/www/scripts''' ==
 
* Contains the JavaScript scripts that control the behavior of the webpage (such as dropdowns, etc)
 
* Sortable.js -- an implementation of [http://tetlaw.id.au/view/blog/table-sorting-with-prototype/ this] table sorting script
 
=='''Database Contents'''==
 
== Ariel ==
 
* ''jobs'' contains each processed genome's information
 
* ''files'' contains the locations of each processed genome's temporary files (in /tmp/...)
 
- id / path / kind (genotype, phenotype, omim, hgmd, morbid, snpedia, pharmgkb) / job
 
* ''users'' contains the usernames, password hashes, and emails of users who have submitted jobs
 
== Caliban ==


==CodeIgniter Frontend==


==Database Contents==


==Revision Control==
==Revision Control==

Latest revision as of 21:08, 10 December 2009

Infrastructure Talk

update: Trait-O-Matic now has an icon!

The Model Paradigm

After realizing the difficulty of just creating new models just from data mining and analysis, having talked to the modeling group, we decided to go along a more standardized approach. Joe Torella had suggested that we manually curate models taken from the literature and somehow incorporate them into Trait-o-matic's output. This idea ended up coalescing into a vision of how this could actually be implemented:

  1. Users would be able to upload models that they have written in python. This is a similar approach to that of Google App Engine [[1]], which allows end users to upload small web services that run in Python or Java, to be run on Google's servers. To maintain the security of their servers, they have disabled many modules of the python interpreter which runs those files. Similarly, I plan to install a separate installation of python 2.6 that has the same modules disabled as Google does. RestrictedPython will be used to implement this: it seems to be a rather secure python environment that can be used to execute untrusted code [[2]]
  2. These models will present a unified programming environment to anyone wishing to code in a literature model. For example (as described below), at the top of the file, the particular SNPs that are used to make a calculation will be commented and run into a database when the model py file is uploaded to the server. Then, the server knows in what order to pass SNPs into the model. We plan to pass SNPs as command line arguments, in the order that they are listed in the header of the .py file. Then the values will be available for the user.
    1. There is some ambiguity about how we will end up passing values. There are two options: binary and standard, with binary just indicating whether the person has a particular allele, and standard passing the actual homozygous/heterozygous character of the SNP.
  3. The "hypothesis" tab will present a way to create a "barebones" model, or just exporting an excel file containing particular values. This barebones model will be a way for more novice programmers to hit the ground running, by presenting to them the variables they need as already declared and commented!
  4. This model paradigm can be used with Zach's parametrized model. Although that model will have to be parametrized on another dataset for now, since our dataset it so small, The parametrized model can be loaded in as a regular model. Eventually, the "learning" and "modeling" components will connect back to each other, as the models logistic regression takes data from the T-o-M dataset itself and then is available as a usable, continuously updating model for unknown traits.

Model Standard

This is the standard we are looking to use for users to upload their custom models. These values should all be commented at the top of the python file. We might add some in the future, so do not take this as a definite format yet.

rsid:1799884
rsid:1260326
rsid:560887
rsid:10830963
trait: Diabetes Mellitus, Non-Insulin-Dependent
source: Reiling et al. 2009, Dutch New Hoorn Study
pmid: 10841471, 98242742        <--- these are any PubMed articles used to make the file. You can also index these on separate lines
model: Genetic Score
model author: Your Name, Your Partner's Name
mode: binary OR standard         <--- what is passed into the model. Binary = 1 0 1 0, standard = AA , TA , GT, etc..
output: Relative risk of disease by age 53, compared to the general population
notes: Genotype cannot on its own predict diabetes risk, which varies substantially with age, race, BMI and family history.

There are many categories of these models that can be created, such as for example a model for pharmacogenomics of particular drugs, disease traits, descriptive traits, etc!

Possible Tasks

Right now, we envision a Trait-o-Matic output that

(I) has all the data from the various databases together (no OMIM/snpedia/Other tabs)

(II) has all the phenotypes associated with a certain more general condition (for example, height) grouped under one tab with the appropriate label (and perhaps, eventually, with a model-generated quantification)

(III) under the tab, has the same output as is currently available, plus extra: perhaps (A) a paragraph automatically extracted from appropriate literature, (B) information on the numerical output of other possible models, (C) an interface for manual curation/statistical evaluation of the current models

To accomplish these goals, we suggest the following short-term tasks:


1. Think of new tasks!

2. Look at the databases (OMIM, snpedia, morbid map, etc.) that are stored in the mySQL databases for Trait-o-Matic, and try to think of ways (probably building on/incorporating Fil's UMLS method) for organizing them further

3. Continue to document Trait-o-Matic core (Alex should have a basic schema map done by the end of this weekend, but the core needs more documenting too)

4. Building on (2), begin to look at modifications that will need to be made in the core database-to-JSON python files, in order to integrate the new classification data we will have organized

5. Build in a quick and easy way to dump the final JSON output before it goes to the front end display (just to play around with, reroute to a model, etc.)

6. Continue general familiarization with Trait-o-Matic

Notes

  • Front end structure

Just before this post I wasted a little bit of time making a dumb diagram of the Trait-o-Matic front end:

T-o-M Front End VERY SIMPLE JSON retrieval only

Obviously this ignores everything except the simple JSON retrieval process. Also I include my sloppy and almost as simple notes here. See my talk page for more thoughts "Alexander J. Ratner 01:25, 23 November 2009 (EST)"

  • Text Classification


  • NOTE 11/16 FZ: I found this really cool NIH tool called United Medical Language System here. I am quickly learning how to use their interface, as it already contains all of the data stored in MeSH and ICD, among many others. It has a way of hierarchically organizing search terms. Thus, my work for the next few days will be exploring how to store the categorizations in the phenotype data that t-o-m outputs, and displaying it according to these categorizations (I already understand how results are displayed, see below for results.php) but now have to see how the data structures (JSON files )generating such results are generated

--I already have MeSH parsed into a mySQL database, I am currently working on ICD-10 as a supplemental hierarchy of diseases [[3]]. As soon as this is ready the data will be applied to the classification.

  • Reference Extraction

http://incubator.apache.org/pdfbox/ is a Java-based PDF text extractor that we can use to extract paragraphs out from references so that we can then display them together with traits. I will show a demonstration of text extraction in class today!

Overview

  • Formats

1. GFF format: “Gene Feature Format” is a file format for encoding feature information (starts, splice sites, stops, motifs, exons, introns, protein domains, etc.). Claims to ‘aim for a low common denominator’ in terms of amount of genomic data represented. Format is a list of tab-separated fields, one per line, each one describing a gene feature; they are (in the following order):


a.Name of the sequence being referred to (allows for multiple genes/sequences to be referenced in one file)

b. Source of the sequence

c. Feature; standard set defined at http://www.ebi.ac.uk/embl/WebFeat/

d. Start (integer number)

e. End

f. Score (floating point value)

g. Strand (+, - or .)

h. Frame (0, 1, 2, or .)

i. Attribute (separated by semicolon separators), again based on the above std. set

j. # is for commenting, ## for certain types of meta-data.


2. affx format or ‘affy’ format: from Affymetrix gene sequencing arrays

3. FASTQ format: text based format for storing both sequence and quality scores

4. FASTA format: text based format for storing genetic sequence with minimal comments

5. JSON format: simple RSS style format for printing/transferring tables

/home/trait/core

  • /home/trait/core/affx_500k_to_gff.py

- outputs GFF records for each SNP in an Affymetrix 500k Genechip file

  • /home/trait/core/cgi_to_gff.py

- outputs GFF record of each entry in the Complete Genomics csv file

  • /home/trait/core/codon.py

- codon_123(input) -- returns a three letter amino acid abbreviation given a single letter code input

- codon_321(input) -- returns a single letter code given a three letter amino acid abbreviation

  • /home/trait/core/config.py

- contains the configuration for t-o-m, such as passwords for databases and the like

  • /home/trait/core/fastq_to_fasta.py

- strips quality data from a FASTQ format file and returns just the FASTA format

  • /home/trait/core/gff_concordancy.py

- inputs two lists of GFF-containing files and outputs the concordance between the two in a tabular file

  • /home/trait/core/gff_dbsnp_query.py

- appends dbSNP information to db_xref (or GFF3, Dbxref) attributes

  • /home/trait/core/gff_hgmd_map.py
  • /home/trait/core/gff_intersect.py

- outputs the intersection of two GFF files, with attributes taken from the first

  • /home/trait/core/gff_morbid_map.py
  • /home/trait/core/gff_nonsynonymous_filter.py
  • /home/trait/core/gff_omim_map.py
  • /home/trait/core/gff_pharmgkb_map.py
  • /home/trait/core/gff_snpedia_map.py
  • /home/trait/core/gff_sort.pl

- sorts a GFF file (by feature length, minScore, maxScore, or custom expression)

  • /home/trait/core/gff_subtract.py
  • /home/trait/core/gff_twobit_map.py
  • /home/trait/core/gff_twobit_query.py
  • /home/trait/core/hapmap_load_database.py
  • /home/trait/core/json_allele_frequency_query.py
  • /home/trait/core/json_to_job_database.py
  • /home/trait/core/maq_snp_to_gff.py

- outputs GFF record for each entry in a Maq SNP file

  • /home/trait/core/omim_print_variants.py
  • /home/trait/core/setup.py
  • /home/trait/core/snpedia.py

- Outputs tab-separated variant information (into data/snpedia.txt) for each entry in SNPedia

  • /home/trait/core/snpedia_print_genotypes.py

- Goes through snpedia.txt and prints out the associated genotypes (found in the snp19 database)

  • /home/trait/core/snpinduse_to_gff.py
  • /home/trait/core/trait-o-matic-server.py
  • /home/trait/core/venter_gff_snp_to_gff.py
  • /home/trait/core/warehouse.py
  • /home/trait/core/watson_gff_to_gff.py
  • /home/trait/core/yh_gff_snp_to_gff.py

/home/trait/data

This folder stores all of the raw data used by the application


/home/trait/www/system/application/controllers/

  • results.php -- this controls the loading of results into a data array!

/home/trait/www/system/application/models/

/home/trait/www/system/application/views/

  • results.php

- this controls how the results are printed on the page. The table here is sorted by Sortable.js

/home/trait/www/scripts

  • Contains the JavaScript scripts that control the behavior of the webpage (such as dropdowns, etc)
  • Sortable.js -- an implementation of this table sorting script

Database Contents

Ariel

  • jobs contains each processed genome's information
  • files contains the locations of each processed genome's temporary files (in /tmp/...)

- id / path / kind (genotype, phenotype, omim, hgmd, morbid, snpedia, pharmgkb) / job

  • users contains the usernames, password hashes, and emails of users who have submitted jobs

Caliban

Revision Control

Joining

  • Acquire a Harvard Ethics Training in Human Research (HETHR) certificate by completing the training at [4]. This should take around 2 hours, you have to read 6 of the required sections and 4 of the "electives". Email your certificate to Sasha (awaitz@post.harvard)
  • Sign up for access to the control panel at [5]
  • Create an RSA ssh public key. This will allow the server to authenticate you. This is done by using the command

ssh-keygen -t rsa

Ensure that your private and public keys are in your .ssh or .openssh directories. Otherwise, ssh will not know where to look for them.

  • Upload your public key to the control panel
  • Follow the directions on the front page of the control panel. They will tell you to edit your .ssh/config file by adding:

Host *.oxf ProxyCommand ssh -p2222 turnout@switchyard.oxf.freelogy.org -x -a -o "Compression no" $SSH_PROXY_FLAGS %h User <YOUR_USERNAME>

  • You can now ssh to your node by following the directions on the front page of the contol panel
  • To set up trait-o-matic, follow the directions at [6]