User:Morgan G. I. Langille/Notebook/Project management: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 12: Line 12:


==Erebus==
==Erebus==
*got matrix and built cluster using vegdist
*need to think about ways to identify pfams that have different counts to each other and to whole genomes.
*need to think about ways to identify pfams that have different counts to each other and to whole genomes.
**take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
**take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
**Do we see pathways that are over/under-represented that are not expected based on:
**Do we see pathways that are over/under-represented that are not expected based on:
***genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.
***genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.
===Pfam Subtraction Pipeline===
#Obtain taxon assignments for metagenomics sample
#Retrieve taxon id from name
#Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon
#Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit)
#subtract those pfams from total metagenomic pfam counts
#Look at leftover pfams and see what is interesting
#Possibly search through pre-computed pfam genomes to find genome with similar pfam composition


==Protein family stuff with Steve==
==Protein family stuff with Steve==

Revision as of 15:42, 31 January 2011

Halophiles

Need to make list of things to be done for roche genome paper.

  1. Organize files for roche genomes and new NCBI completed halophile genomes
  2. run mugsy on all genomes
  3. Re-do crispr analysis for new NCBI genomes.
  4. Look at homologs of genes identified in new Science metabolic paper.

Darpa

  • phone call on Friday

Erebus

  • need to think about ways to identify pfams that have different counts to each other and to whole genomes.
    • take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
    • Do we see pathways that are over/under-represented that are not expected based on:
      • genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.

Pfam Subtraction Pipeline

  1. Obtain taxon assignments for metagenomics sample
  2. Retrieve taxon id from name
  3. Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon
  4. Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit)
  5. subtract those pfams from total metagenomic pfam counts
  6. Look at leftover pfams and see what is interesting
  7. Possibly search through pre-computed pfam genomes to find genome with similar pfam composition

Protein family stuff with Steve

  • correlate different sample similarities using taxon vs pfam bc
  • Need to re-write R scripts to calculate correlations and ecological distances from matrix (this needed for erebus project as well).
    • Look at online notebook for some code, as well as darpa wiki (for the correlation stuff). I emailed steve asking for his R scripts.


Rough Ideas

Starting with PFAM counts across all GOS samples

  • Looking at samples
    • alpha diversity of GOS samples (measure total protein diversity in each sample)
      • provide a listing of most diverse samples and indicated if those are environmentally related
    • beta diversity of GOS samples (are the samples related...presumbly yes)
      • show a tree and possible a network describing the relatedness of the samples
    • estimate total number of different protein families for each sample and all samples combined using chao index?
  • Looking at families
    • alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
      • provide list of most diverse families and maybe suggest why those are so diverse?
    • beta diversity -> do the groupings tell us anything (e.g. are they similar function, similar localization, etc.)
      • map to GO terms to see if similar function
    • chao index
      • estimate total number of proteins for each family in the ocean (what is the most prevalent)

Collaboration with Steve would be a comparison between diversity measurements using taxon vs phylogenetic vs functional