PhyloPipeline

From OpenWetWare
Jump to navigationJump to search

Availability and dependencies

The scripts are available as an archive. The data included with the scripts are the results of running AMPHORA on the HOT/ALOHA metagenomic data set. Download the scripts and associated data here: File:MgTreeBuildingScripts.zip

These scripts require Perl, Bio::Phylo and BioPerl modules for Perl, Muscle, and RAxML-HPC-multithreaded.

--Steven Kembel

Description

This directory contains scripts and data files to automate the process of inferring phylogenetic relationships among metagenomic reads from different gene families, as identified by AMPHORA.

Many of the scrips are kludgy and assume things about the data, no error checking, you'll need to modify them to work with datasets other than the output generated by running AMPHORA on the HOT/ALOHA data set, and to work optimally on systems other than an 8-core multithreaded processor, although this should be easy to do.

Three methods are available to infer phylogenetic relationships among metagenomic reads in each gene family. Tree inference is performed using RAxML.

  • A tree is inferred from the metagenomic sequences alone. (mg)
  • A tree is inferred from the metagenomic sequences, plus the sequences in the AMPHORA reference alignment. The reference sequences are then pruned out of the tree, leaving just the meteagenomic sequences as the tips. (mgr)
  • A tree is inferred from the metagenomic sequences, plus the sequences in the AMPHORA reference alignment, and tree topology is constrained by the AMPHORA reference phylogeny. The reference sequences are then pruned out of the tree, leaving just the meteagenomic sequences as the tips. (mgrConstrained)

Directory contents

genes

  • text file containing list of gene families to process
    • (= 31 families in AMPHORA by default)

/mg

  • directory containing metagenomic reads identified by AMPHORA

mg_treebuild_script.sh

  • script to infer phylogenetic relationships among metagenomic reads

mgr_treebuild_script.sh

  • script to infer phylogenetic relationships among metagenomic reads based on combined ML analysis of reference alignments plus metagenomic reads

mgrConstrained_treebuild_script.sh

  • script to infer phylogenetic relationships among metagenomic reads based on combined ML analysis of reference alignments plus metagenomic reads, with tree topology constrained by the AMPHORA reference tree topology for each gene

/ref

  • AMPHORA reference alignments for each gene family

/reftree

  • AMPHORA reference phylogeny for each gene family

/utility-scripts

  • Perl scripts used by the tree building script

Directories created by the scripts

/working

  • This directory contains the RAxML working files and logs for each gene family

/results

  • This directory contains the results of the analyses. Exact filenames vary depending on whether the tree was built with metagenomic sequences only (mg), metagenomic sequences plus reference alignment (mgr), or metagenomic sequences plus reference alignemnts constrained by reference tree topology (mgrConstrained).

Results files created by the scripts

Files in /results include (where X is a gene family and Y is one of mg, mgr, mgrConstrained):

X.Y.aln

  • aligned sequences in fasta format

X.Y.aln.phylip

  • aligned sequences in relaxed Phylip format

X.mg.sample

  • metagenomic sequence occurrence in environmental samples
  • this is a Phylocom-formatted sample file (http://phylodiversity.net/phylocom)
  • each row contains sampleID<tab>Abundance<tab>sequenceID

X.Y.raxml.newick

  • the phylogenetic tree linking all metagenomic reads in a gene family