User:Morgan G. I. Langille/Notebook/Project management: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 4: Line 4:


#Re-do crispr analysis for new NCBI genomes.
#Re-do crispr analysis for new NCBI genomes.
#Look at homologs of genes identified in new Science metabolic paper.


==Darpa==
==Darpa==

Revision as of 16:10, 25 January 2011

Halophiles

Erin and Andrew getting ortholog sets all halophile genomes Need to make list of things to be done for roche genome paper.

  1. Re-do crispr analysis for new NCBI genomes.
  2. Look at homologs of genes identified in new Science metabolic paper.

Darpa

Manuscripts

  1. Association paper showing Miguel and Xingpeng methods. Led by Miguel & Xingpeng
  2. Microbial geographical paper incorporating environment and location. Is it distance or habitat? Using GOS data. Unknown genes? Xingpeng & Morgan
  3. Unknown genes?

Erebus

  • Got 3 samples shared on mg rast. Need to run them through pfam pipeline.
  • Downloaded 3 samples, gzipped them, and started run on genbeo using hmmscan.
  • sample 1A 132K reads
  • sample 3A 500k reads
  • sample XB 580k reads
  • quick measurement suggests 30K sequences being matched per day.
    • not splitting sample will take 20 days!
    • Split the samples and started the runs over again.

Protein family stuff with Steve

  • trying to find data that I deleted...not looking good.
  • Need to check the following locations:
    • work computer
    • old tarball that is being copied to /share/eisen-d1 (this is from Jan, 2010, so not a good option, but would have the main pfam vs GOS dataset)
    • pfam vs camera/gos matrix is on darpa wiki
    • try to recover manually from old image?
  • Need the following (missing) files:
  1. hmmscan of pfams vs GOS (or "camera") -> this is on darpa wiki
  2. perl files to convert hmmscan output to a matrix of pfam counts (this needed only for erebus project as well, maybe rewrite)
  3. R scripts to calculate correlations and ecological distances from matrix (this needed for erebus project as well)

Rough Ideas

Starting with PFAM counts across all GOS samples

  • Looking at samples
    • alpha diversity of GOS samples (measure total protein diversity in each sample)
      • provide a listing of most diverse samples and indicated if those are environmentally related
    • beta diversity of GOS samples (are the samples related...presumbly yes)
      • show a tree and possible a network describing the relatedness of the samples
    • estimate total number of different protein families for each sample and all samples combined using chao index?
  • Looking at families
    • alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
      • provide list of most diverse families and maybe suggest why those are so diverse?
    • beta diversity -> do the groupings tell us anything (e.g. are they similar function, similar localization, etc.)
      • map to GO terms to see if similar function
    • chao index
      • estimate total number of proteins for each family in the ocean (what is the most prevalent)

Collaboration with Steve would be a comparison between diversity measurements using taxon vs phylogenetic vs functional