User:Morgan G. I. Langille/Notebook/Project management: Difference between revisions

Revision as of 15:42, 31 January 2011

Need to make list of things to be done for roche genome paper.

need to think about ways to identify pfams that have different counts to each other and to whole genomes.
- take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
- Do we see pathways that are over/under-represented that are not expected based on:
  - genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.

Obtain taxon assignments for metagenomics sample
Retrieve taxon id from name
Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon
Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit)
subtract those pfams from total metagenomic pfam counts
Look at leftover pfams and see what is interesting
Possibly search through pre-computed pfam genomes to find genome with similar pfam composition

correlate different sample similarities using taxon vs pfam bc
Need to re-write R scripts to calculate correlations and ecological distances from matrix (this needed for erebus project as well).
- Look at online notebook for some code, as well as darpa wiki (for the correlation stuff). I emailed steve asking for his R scripts.

Starting with PFAM counts across all GOS samples

Collaboration with Steve would be a comparison between diversity measurements using taxon vs phylogenetic vs functional

@@ Line 12: / Line 12: @@
 ==Erebus==
-*got matrix and built cluster using vegdist
 *need to think about ways to identify pfams that have different counts to each other and to whole genomes.
 **take pfam counts from all completed genomes, get a distribution, then ask if a single count is normal or not taking into account mutiple test correction
 **Do we see pathways that are over/under-represented that are not expected based on:
 ***genomes that are predicted from the metagenomics sample by taxon assignment (e.g. megan, amphora, etc.). This lets us know if something is missing/different from the information provided by looking at only taxonomy assignment.
+===Pfam Subtraction Pipeline===
+#Obtain taxon assignments for metagenomics sample
+#Retrieve taxon id from name
+#Look up pfam assignments for reach taxon (pre calculated) and multiply by the number of taxon
+#Somehow scale taxon assignments if they seem too large (this might happen with SEED or MEGAN predictions where each protein is counted as a taxon hit)
+#subtract those pfams from total metagenomic pfam counts
+#Look at leftover pfams and see what is interesting
+#Possibly search through pre-computed pfam genomes to find genome with similar pfam composition
 ==Protein family stuff with Steve==