User:Morgan G. I. Langille/Notebook/Project management: Difference between revisions
From OpenWetWare
Jump to navigationJump to search
(→Erebus) |
|||
Line 22: | Line 22: | ||
**not splitting sample will take 20 days! | **not splitting sample will take 20 days! | ||
**Split the samples and started the runs over again. | **Split the samples and started the runs over again. | ||
*re-wrote hmmscan_to_hmmscan.pl | |||
==Protein family stuff with Steve== | ==Protein family stuff with Steve== |
Revision as of 11:48, 26 January 2011
Halophiles
Erin and Andrew getting ortholog sets all halophile genomes Need to make list of things to be done for roche genome paper.
- Re-do crispr analysis for new NCBI genomes.
- Look at homologs of genes identified in new Science metabolic paper.
Darpa
Manuscripts
- Association paper showing Miguel and Xingpeng methods. Led by Miguel & Xingpeng
- Microbial geographical paper incorporating environment and location. Is it distance or habitat? Using GOS data. Unknown genes? Xingpeng & Morgan
- Unknown genes?
Erebus
Got 3 samples shared on mg rast. Need to run them through pfam pipeline.- Downloaded 3 samples, gzipped them, and started run on genbeo using hmmscan.
- sample 1A 132K reads
- sample 3A 500k reads
- sample XB 580k reads
- quick measurement suggests 30K sequences being matched per day.
- not splitting sample will take 20 days!
- Split the samples and started the runs over again.
- re-wrote hmmscan_to_hmmscan.pl
Protein family stuff with Steve
- trying to find data that I deleted...not looking good.
- Need to check the following locations:
- work computer
- old tarball that is being copied to /share/eisen-d1 (this is from Jan, 2010, so not a good option, but would have the main pfam vs GOS dataset)
- pfam vs camera/gos matrix is on darpa wiki
- try to recover manually from old image?
- Need the following (missing) files:
- hmmscan of pfams vs GOS (or "camera") -> this is on darpa wiki
- perl files to convert hmmscan output to a matrix of pfam counts (this needed only for erebus project as well, maybe rewrite)
- R scripts to calculate correlations and ecological distances from matrix (this needed for erebus project as well)
Rough Ideas
Starting with PFAM counts across all GOS samples
- Looking at samples
- alpha diversity of GOS samples (measure total protein diversity in each sample)
- provide a listing of most diverse samples and indicated if those are environmentally related
- beta diversity of GOS samples (are the samples related...presumbly yes)
- show a tree and possible a network describing the relatedness of the samples
- estimate total number of different protein families for each sample and all samples combined using chao index?
- alpha diversity of GOS samples (measure total protein diversity in each sample)
- Looking at families
- alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
- provide list of most diverse families and maybe suggest why those are so diverse?
- beta diversity -> do the groupings tell us anything (e.g. are they similar function, similar localization, etc.)
- map to GO terms to see if similar function
- chao index
- estimate total number of proteins for each family in the ocean (what is the most prevalent)
- alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
Collaboration with Steve would be a comparison between diversity measurements using taxon vs phylogenetic vs functional