STAP: Difference between revisions

Revision as of 13:31, 25 February 2010

Introduction to ss-rRNA Taxonomy Assigning Pipeline (STAP)

The comparative analysis of ss-rRNA sequences is one of the most powerful approaches for studying phylogenetic relationships among all organisms. Our ss-rRNA Taxonomy Assigning Pipeline (STAP) combines publicly available packages such as, PHYML, BLASTN, and CLUSTALW with our own automatic alignment masking and tree parsing programs. STAP makes automatic taxonomic assignments for ss-rRNAs based on neighbor-joining or maximum likelihood phylogenetic trees rather than on the top BLASTN hits, and thus its results are more reliable and accurate. Most importantly, the automation yields results comparable to those achievable by manual analysis, yet offers speed and capacity that are unattainable by manual efforts.

First, ss-rRNA sequences obtained either by PCR of environmental samples or by metagenomic shotgun sequencing are searched against our ss-rRNA database by BLASTN to select a related data set. STAP then automatically generates, masks, and trims the multiple sequence alignments. Next, it builds a phylogenetic tree by either the maximum likelihood or neighbor-joining method. Automated analysis of the tree yields taxonomic assignments for each query sequence.

A paper describing STAP has been published (7/2/08) in PLoS One.

Wu D, Hartman A, Ward N, Eisen JA (2008) An Automated Phylogenetic Tree-Based Small Subunit rRNA Taxonomy and Alignment Pipeline (STAP). PLoS ONE 3(7): e2566. doi:10.1371/journal.pone.0002566

Download

There are two files needed to be downloaded:

db.tar.gz : the STAP database package, available from http://bobcat.genomecenter.ucdavis.edu/STAP/db.tar.gz [84.6 MB before unzipping, 652.4 MB after opening]

rRNA_scripts.tar.gz: the STAP program file package, available from http://bobcat.genomecenter.ucdavis.edu/STAP/rRNA_scripts.tar.gz [24 kb before unzipping, 100 kb after opening]

Installation

These are the requirements to run the STAP pipeline:

Linux OS (kernel version 2.6 or later)
Perl (version 5.0 or later)
NCBI blast package (download ncbi.tar.gz from ftp://ftp.ncbi.nih.gov/blast/executables/LATEST/)
PHYML (more information from http://atgc.lirmm.fr/phyml/scripts/binaries.php) [Note: You might need to change the permissions of this file, using "chmod 577"to execute it.]
CLUSTALW (download from www.ebi.ac.uk/Tools/clustalw)
Perl Module: BioPerl 1.4 or later (more information from http://www.bioperl.org/wiki/Main_Page)
Perl Module: XML::DOM (more information from http://search.cpan.org/~tjmather/XML-DOM-1.44/lib/XML/DOM.pm)

Note: If you are using MacOS, you might find it is easiest to install both Bioperl and XML-DOM using fink: http://www.finkproject.org/download/index.php. For more information, see: http://www.bioperl.org/wiki/Getting_BioPerl#MacOS_X_using_fink. Both Bioperl and XML-DOM will be installed if you use this. The one possible problem is that fink puts the bioperl libraries in /sw/lib, so you might need to add this to your path (e.g., export PERL5LIB=$PERL5LIB:/sw/lib/perl5/5.8.6)

Installation:

download database db.tar.gz into the desirable directory
```
tar -xzvf db.tar.gz
```
```
cd db_dir
```
Write down the full path of the database directory db_dir
Download rRNA_scripts.tar.gz into the desirable directory
```
tar -xzvf rRNA_scripts.tar.gz
```
```
cd rRNA_pipeline_scripts
```
Launch the perl setup program.
```
perl setup.pl
```
Answer the questions when prompt to set the pipline options.
To run STAP, run this script:
```
rRNA_pipeline_for_one.pl
```
This script can be located anywhere you want, but the directory rRNA_pipeline_scripts and its content should be kept intact.

Run the Program

Prepare the input sequence file in fasta format, one sequence per input file

Example: file test.seq                                  
  >accession                
  TTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAGCGGAAACGAGAAGTAGCTTGCTACTTCGGCGT                 
  CGAGCGGCGGACGGGTGAGTAATGCATAGGAAGTTGCCCAGTAGAGGGGGATAACCATTGGAAACGATGG                 
  CTAATACCGCATAACCTCTTCGGAGCAAAGCGGGGGACCTTCGGGCCTCGCGCTACTGGATACGCCTATG                 
  TGGGATTAGCTAGTTGGTGAGGTAATGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGGATGATC                 
  AGCCACACTGGGACTGAGACACGGCCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGG                 
  CGAAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGTTGTGA                 
  GGAAGGTTTCGTAGTTAATAACTGCGTTGCTTGACGTTAGCAACAGAAGAAGCACCGGCTAACTCCGTGC                 
  CAGCAGCCGCGGTAATACGGAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGCATGCAGGTGG                 
  TTTGTTAAGTCAGATGTGAAAGCCCGGGGCTCAACCTCGGAAGGTCATTTGAAACTGGCAAACTAGAGTA                 
  CTGTAGAGGGGGGTAGAATTTCAGGTGTAGCGGTGAAATGCGTAGAGATCTGAAGGAATACCAGTGGCGA                 
  AGGCGGCCCCCTGGACAGATACTGACACTCAGATGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCT                 
  GGTAGTCCACGCCGTAAACGATGTCTACTTGGAGGTTGTGGCCTTGAGCCGTGGCTTTCGGAGCTAACGC                 
  GTTAAGTAGACCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAATGAATTGACGGGGGCCCGCACAA                 
  GCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTACTCTTGACATCCAGAGAACTT                 
  TCCAGAGATGGATTGGTGCCTTCGGGAACTCTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGT                 
  GAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTGTTTGCCAGCGAGTAATGTCGGGAAC                 
  TCCAGGGAGACTGCCGGTGATAAACCGGAGGAAGGTGGGGACGACGTCAAGTCATCATGGCCCTTACGAG                 
  TAGGGCTACACACGTGCTACAATGGCGCATACAGAGGGCAGCAAGCTAGCGATAGTGAGCGAATCCCAAA                 
  AAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGA                 
  TCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGCTGC                 
  AAAAGAAGTAGGTAGTTTAACCTTCGGGAGGACGCTTACCACTTTGTGGTTCATGACTGGGGTGAAGTCG                 
  TAACAAGGTAGCCCTAGGGGAACC

Run the Program:

rRNA_pipeline_for_one.pl -i [input]

where [input] is replaced by the name of the input sequence file.

Example: rRNA_pipeline_for_one.pl -i test.seq

More parameters:

-i 	(required) Input format (one sequence per file)

-n 	(optional) The number of unclassified database ss-rRNA nearest-neighbors to 
                  ignore when identifying the nearest-neighbor in the first tree
                  Allowed entries:		0-20 
                  Default:			5
-t 	(optional) When building the second tree, the number of taxonomic levels to 
                  step up from the taxonomic assignment made in the first tree.
                  Default:	1
-o 	(optional) Output directory
                  Default:	the current directory
-d 	(optional) Domain of the input sequence
                  Allowed entries:	P (prokaryote)
                                       E (eukaryote)
                  If left blank, the program will assign the domain automatically.

Read the results

The following files will be generated by the STAP program:

  accession.seq:                   sequence file                   
  accession.mltree1:               the first tree                 
  accession.mltree1.with_taxonomy: The first tree with taxonomy infomation attached                 
  accession.mltree1.tab:           the parsing output of the first tree (based upon mid-point rooting)                   
  accession.mltree2:               the second tree                 
  accession.mltree2.with_taxonomy: the second tree with taxonomy infomation attached                 
  accession.mltree1.tab:           the parsing output of the scond tree (rooted by an outgroup defined by the first tree)               
  accession.results:               the summary of the taxonomy assgnments

Format of the tree parsing results:

Each line is tab-delimited with the information illustrated in the following figure: The output file is a tab-delimited text file, i.e one record per line, the data fields separated by tabs. The contents of each column is explained in the following list and figure:

  Column 1: Accession of the input sequences
  Column 2: Accession of a database sequence
  Column 3: Distance between the input and the database entry 
            (highlighted by the blue line)
  Column 4: Distance between the nodes that the input and the database entry are attached to
            (highlighted by the orange line)   
  Column 6: The branching position of the database sequence (marked by the red circle)
            along the path that connects the input and the outgroup 
            (highlighted by the red line)
  Cloumn 7: The taxonomy annotation of the database entry

The parsing results are sorted by column 6, 5, 4 and 3 to identify the nearest neighbor of the input sequence.

Format of the final result summary file:

The results summary file is also a tab-delimited text file. The contents of each column is illustrated in the following list and figure:

Column 1:  Accession of the input sequence    
Column 2:  The top Blast hit and its taxonomy    
Column 3:  The nearest neighbor in the first tree and its taxonomy    
Column 4:  The nearest neighbor in the second tree and its taxonomy.   
           This taxonomy is the final taxonomy assignment to the query.      
Column 5:  The domain of the query sequence if automatic domain identification   
           process was triggered.    
           Domain is determined by a representative neighbor joining tree   
           illustrated by the following figure. The branching position of   
           the query related to the domain braching point (indicated by the   
           red triangle) in terms of number of nodes (indicated by the rec   
           circles) is also reported in this column

Increasing the speed of the analysis

Two approaches can be used to decrease the analysis time.

<1> If the user knows the domain of a query sequence, the -d option should always be defined to skip the timeconsuming domain-identification step.

<2> A linux cluster is required for processing massive datasets. The STAP program can be run in parallel on nodes of a linux cluster (see http://www.rocklinux.org/wiki/Main_Page for more details).

An extra program to align large ss-rRNA sequences to each other

In the package, we include a script to align large ss-rRNA sequences against each other. The script is align_to_profile.pl. It takes a ss-rRNA multifasta format and builds alignments of all the sequences in the input file using the clustalw profile aligning method.

   Usage: align_to_rRNA_profile.pl -i [input] -d [domain] -o [output]
          Input parameters:
          -i: sequence file in multifasta format
          -d: domain (A for archaeal, B for bacteria, E for eukaryotes, P for prokaryotes,
              X for all domain; Default X)
          -o: output file (in aligned multifasta format)

For large amount of sequences, it is recommended to break the multifasta file into smaller sections and run align_to_profile.pl on a linux cluster. As long as the same domains are defined for all the processes, the output files of align_to_profile.pl can be concatenated into a large aligned multifasta file in which all the sequences are aligned to each other. In the package, we include the script trim_all_gap.pl to remove the all gapped columns of the concatenated multiple alignments.

   Usage: trim_all_gap.pl  -i [input] -o [output]

@@ Line 42: / Line 42: @@
 ==Run the Program==
-'''<1> Prepare the input sequence file in fasta format, one sequence per input file  '''
+#'''Prepare the input sequence file in fasta format, one sequence per input file  '''
   Example: file test.seq
@@ Line 69: / Line 69: @@
     TAACAAGGTAGCCCTAGGGGAACC
-'''<2>Run the Program:'''
+#'''Run the Program:'''
 rRNA_pipeline_for_one.pl -i [input]

STAP: Difference between revisions

Revision as of 13:31, 25 February 2010

Contents

Introduction to ss-rRNA Taxonomy Assigning Pipeline (STAP)

Download

Installation

Run the Program

Read the results

Increasing the speed of the analysis

An extra program to align large ss-rRNA sequences to each other

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools