Moore Notes 4 21 10

From OpenWetWare
Jump to navigationJump to search

Group Call

  • Update from Sam
    • Sam has been selected to give a talk at the Cold Spring Harbor Biology of Genomes Meeting. If people have time to provide feedback in the next two weeks, she would appreciate it.
      • Might give short sample talk next week
    • Phylogenetics of metagenomic data is lynchpin to many of our tools.
    • Simulation methods -
      • Have many ways of selecting a reference database in addition to many ways of selecting reads
      • HMMER models built by M. Wu. are used to generate alignmnets
    • Needed to test on 16S rRNA data to validate approach to OTU pipeline
      • Simulated a bunch of data from RPD alignment
        • JAE: RDP as restrictions on data that they release - very tight, hard to repackage, might need to use different source data
    • Going to look at some results that Sam has generated (trees are based on rpoB simulations, plots are based on analysis of 16S simulations)
      • export1.pdf: 20 sequence reference sequence database, 5 simulated reads. MaxPD used to select references. Built trees using reads (export1) and sources (export1b) and the topology generally looks consistent (RpoB sequences)
        • reads have median distribution of ~400, but added pads to data to simulate ultrashort reads near the ends of the locus
        • alignment is generated using the profile HMM and alignment scripts from amphora, not the sampled reference data
        • FastTree used to generate phylogeny, running raxml, both with fixed and unfixed reference trees. also trying pplacer. May also try NJ
      • export2: there are about 50 reads, 60 reference sequences in these trees (RpoB). Reference sequences are different from export1, though some are the same sequences - MaxPD used to select references. Reads from the same source are co-clustering in the tree. Some topological differences here. This is one of the worst trees I've seen.
        • JAE: Do single reads get placed correctly? If you only include one read, does it get placed in the proper location?
        • James: Might be interesting to know how quickly it goes wrong as you add more reads
    • some analyses that we've conducted:
      • compare source containing tree to read containing tree and calculate a per-read error metric to see if certain reads contribute more error than others.
        • Is extra branch length due to difference in the length of the alignment between the two trees (masking)?
    • Read length may be a problem: L2^2 norm finds a correlation where as median L1 difference doesn't
      • JL: Looks like read length doesn't introduce an accuracy bias, but a problem with precision. If you use short reads, might have lots of noise, but on average might have right answer
        • James: can't really tune sequence length with real data. Can control relative size of reference database. Number of reference reads appears to be more important according to earlier slides.
        • AD: Would be useful to know if covariance exists, because that could be important as you go to shorter read lengths.
        • JAE: We need to know some of this information because we need to know how to get to the correct tree. These metrics might provide guidance regarding what methods to use for data collection.
        • MW: Reiterates RAXML option to run parsimony
  • Update from Tom
    • Not much time - error discussed above looks tolerable from standpoint of estimating OTUs and diversity
    • Want to set up separate meeting to discuss
    • Paper coming soon - will need help from authors in writing it. Who is going to be on this manuscript - please contact Tom