User:Morgan G. I. Langille/Notebook/Project management: Difference between revisions
From OpenWetWare
Jump to navigationJump to search
(→Erebus) |
(→Erebus) |
||
Line 12: | Line 12: | ||
==Erebus== | ==Erebus== | ||
*need to think about ways to identify pfams that have different counts to each other and to whole genomes. | *need to think about ways to identify pfams that have different counts to each other and to whole genomes. | ||
**take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction | **take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction | ||
**Do we see pathways that are over/under-represented that are not expected based on: | **Do we see pathways that are over/under-represented that are not expected based on: | ||
***genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment. | ***genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment. | ||
===Pfam Subtraction Pipeline=== | |||
#Obtain taxon assignments for metagenomics sample | |||
#Retrieve taxon id from name | |||
#Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon | |||
#Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit) | |||
#subtract those pfams from total metagenomic pfam counts | |||
#Look at leftover pfams and see what is interesting | |||
#Possibly search through pre-computed pfam genomes to find genome with similar pfam composition | |||
==Protein family stuff with Steve== | ==Protein family stuff with Steve== |
Revision as of 15:42, 31 January 2011
Halophiles
Need to make list of things to be done for roche genome paper.
- Organize files for roche genomes and new NCBI completed halophile genomes
- run mugsy on all genomes
- Re-do crispr analysis for new NCBI genomes.
- Look at homologs of genes identified in new Science metabolic paper.
Darpa
- phone call on Friday
Erebus
- need to think about ways to identify pfams that have different counts to each other and to whole genomes.
- take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
- Do we see pathways that are over/under-represented that are not expected based on:
- genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.
Pfam Subtraction Pipeline
- Obtain taxon assignments for metagenomics sample
- Retrieve taxon id from name
- Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon
- Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit)
- subtract those pfams from total metagenomic pfam counts
- Look at leftover pfams and see what is interesting
- Possibly search through pre-computed pfam genomes to find genome with similar pfam composition
Protein family stuff with Steve
- correlate different sample similarities using taxon vs pfam bc
- Need to re-write R scripts to calculate correlations and ecological distances from matrix (this needed for erebus project as well).
- Look at online notebook for some code, as well as darpa wiki (for the correlation stuff). I emailed steve asking for his R scripts.
Rough Ideas
Starting with PFAM counts across all GOS samples
- Looking at samples
- alpha diversity of GOS samples (measure total protein diversity in each sample)
- provide a listing of most diverse samples and indicated if those are environmentally related
- beta diversity of GOS samples (are the samples related...presumbly yes)
- show a tree and possible a network describing the relatedness of the samples
- estimate total number of different protein families for each sample and all samples combined using chao index?
- alpha diversity of GOS samples (measure total protein diversity in each sample)
- Looking at families
- alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
- provide list of most diverse families and maybe suggest why those are so diverse?
- beta diversity -> do the groupings tell us anything (e.g. are they similar function, similar localization, etc.)
- map to GO terms to see if similar function
- chao index
- estimate total number of proteins for each family in the ocean (what is the most prevalent)
- alpha diversity. what fams are the most rich (not that interesting), diverse (interesting and informative)
Collaboration with Steve would be a comparison between diversity measurements using taxon vs phylogenetic vs functional