Harvard:Biophysics 101/2009/Infrastructure: Difference between revisions

Revision as of 12:06, 17 November 2009

Infrastructure Background

This is a listing of background information for those interested in Trait-O-Matic

Tasks

Text Classification

NOTE 11/16 FZ: I found this really cool NIH tool called United Medical Language System here. I am quickly learning how to use their interface, as it already contains all of the data stored in MeSH and ICD, among many others. It has a way of hierarchically organizing search terms. Thus, my work for the next few days will be exploring how to store the categorizations in the phenotype data that t-o-m outputs, and displaying it according to these categorizations (I already understand how results are displayed, see below for results.php) but now have to see how the data structures (JSON files )generating such results are generated

--I already have MeSH parsed into a mySQL database, I am currently working on ICD-10 as a supplemental hierarchy of diseases [[1]]. As soon as this is ready the data will be applied to the classification.

Reference Extraction

http://incubator.apache.org/pdfbox/ is a Java-based PDF text extractor that we can use to extract paragraphs out from references so that we can then display them together with traits. I will show a demonstration of text extraction in class today!

Overview

Formats

1. GFF format: “Gene Feature Format” is a file format for encoding feature information (starts, splice sites, stops, motifs, exons, introns, protein domains, etc.). Claims to ‘aim for a low common denominator’ in terms of amount of genomic data represented. Format is a list of tab-separated fields, one per line, each one describing a gene feature; they are (in the following order):

     a.Name of the sequence being referred to (allows for multiple genes/sequences to be referenced in one file)

     b.	Source of the sequence

     c.	Feature; standard set defined at http://www.ebi.ac.uk/embl/WebFeat/

     d.	Start (integer number)

     e.	End

     f.	Score (floating point value)

     g.	Strand (+, - or .)

     h.	Frame (0, 1, 2, or .)

     i.	Attribute (separated by semicolon separators), again based on the above std. set

     j.	# is for commenting, ## for certain types of meta-data.

2. affx format or ‘affy’ format: from Affymetrix gene sequencing arrays

3. FASTQ format: text based format for storing both sequence and quality scores

4. FASTA format: text based format for storing genetic sequence with minimal comments

5. JSON format: simple RSS style format for printing/transferring tables

Alexander J. Ratner 14:04, 17 November 2009 (EST)

/home/trait/core

/home/trait/core/affx_500k_to_gff.py

- outputs GFF records for each SNP in an Affymetrix 500k Genechip file

/home/trait/core/cgi_to_gff.py

- outputs GFF record of each entry in the Complete Genomics csv file

/home/trait/core/codon.py

- codon_123(input) -- returns a three letter amino acid abbreviation given a single letter code input

- codon_321(input) -- returns a single letter code given a three letter amino acid abbreviation

/home/trait/core/config.py

- contains the configuration for t-o-m, such as passwords for databases and the like

/home/trait/core/fastq_to_fasta.py

- strips quality data from a FASTQ format file and returns just the FASTA format

/home/trait/core/gff_concordancy.py

- inputs two lists of GFF-containing files and outputs the concordance between the two in a tabular file

/home/trait/core/gff_dbsnp_query.py

- appends dbSNP information to db_xref (or GFF3, Dbxref) attributes

/home/trait/core/gff_hgmd_map.py
/home/trait/core/gff_intersect.py

- outputs the intersection of two GFF files, with attributes taken from the first

/home/trait/core/gff_morbid_map.py
/home/trait/core/gff_nonsynonymous_filter.py
/home/trait/core/gff_omim_map.py
/home/trait/core/gff_pharmgkb_map.py
/home/trait/core/gff_snpedia_map.py
/home/trait/core/gff_sort.pl

- sorts a GFF file (by feature length, minScore, maxScore, or custom expression)

/home/trait/core/gff_subtract.py
/home/trait/core/gff_twobit_map.py
/home/trait/core/gff_twobit_query.py
/home/trait/core/hapmap_load_database.py
/home/trait/core/json_allele_frequency_query.py
/home/trait/core/json_to_job_database.py
/home/trait/core/maq_snp_to_gff.py

- outputs GFF record for each entry in a Maq SNP file

/home/trait/core/omim_print_variants.py
/home/trait/core/setup.py
/home/trait/core/snpedia.py

- Outputs tab-separated variant information (into data/snpedia.txt) for each entry in SNPedia

/home/trait/core/snpedia_print_genotypes.py

- Goes through snpedia.txt and prints out the associated genotypes (found in the snp19 database)

/home/trait/core/snpinduse_to_gff.py
/home/trait/core/trait-o-matic-server.py
/home/trait/core/venter_gff_snp_to_gff.py
/home/trait/core/warehouse.py
/home/trait/core/watson_gff_to_gff.py
/home/trait/core/yh_gff_snp_to_gff.py

/home/trait/data

This folder stores all of the raw data used by the application

/home/trait/www/system/application/controllers/

results.php -- this controls the loading of results into a data array!

/home/trait/www/system/application/models/

/home/trait/www/system/application/views/

results.php

- this controls how the results are printed on the page. The table here is sorted by Sortable.js

/home/trait/www/scripts

Contains the JavaScript scripts that control the behavior of the webpage (such as dropdowns, etc)

Sortable.js -- an implementation of this table sorting script

Database Contents

Ariel

jobs contains each processed genome's information

files contains the locations of each processed genome's temporary files (in /tmp/...)

- id / path / kind (genotype, phenotype, omim, hgmd, morbid, snpedia, pharmgkb) / job

users contains the usernames, password hashes, and emails of users who have submitted jobs

Caliban

Revision Control

Joining

Acquire a Harvard Ethics Training in Human Research (HETHR) certificate by completing the training at [2]. This should take around 2 hours, you have to read 6 of the required sections and 4 of the "electives". Email your certificate to Sasha (awaitz@post.harvard)

Sign up for access to the control panel at [3]

Create an RSA ssh public key. This will allow the server to authenticate you. This is done by using the command

ssh-keygen -t rsa

Ensure that your private and public keys are in your .ssh or .openssh directories. Otherwise, ssh will not know where to look for them.

Upload your public key to the control panel

Follow the directions on the front page of the control panel. They will tell you to edit your .ssh/config file by adding:

Host *.oxf ProxyCommand ssh -p2222 turnout@switchyard.oxf.freelogy.org -x -a -o "Compression no" $SSH_PROXY_FLAGS %h User <YOUR_USERNAME>

You can now ssh to your node by following the directions on the front page of the contol panel

To set up trait-o-matic, follow the directions at [4]

Harvard:Biophysics 101/2009/Infrastructure: Difference between revisions

Revision as of 12:06, 17 November 2009

Contents

Infrastructure Background

Tasks

Overview

/home/trait/core

/home/trait/data

/home/trait/www/system/application/controllers/

/home/trait/www/system/application/models/

/home/trait/www/system/application/views/

/home/trait/www/scripts

Database Contents

Ariel

Caliban

Revision Control

Joining

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools

@@ Line 20: / Line 20: @@
 .	GFF format: “Gene Feature Format” is a file format for encoding feature information (starts, splice sites, stops, motifs, exons, introns, protein domains, etc.).  Claims to ‘aim for a low common denominator’ in terms of amount of genomic data represented.  Format is a list of tab-separated fields, one per line, each one describing a gene feature; they are (in the following order):
-       a.	Name of the sequence being referred to (allows for multiple genes/sequences to be referenced in one file)
+       a.Name of the sequence being referred to (allows for multiple genes/sequences to be referenced in one file)
        b.	Source of the sequence