Alignreads.py README

From OpenWetWare

Revision as of 20:07, 22 September 2010 by Zachary S. L. Foster (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Alignreads.py Readme - Last Modified 9/22/2010 for v2.23

What it is: A pipeline of five scripts used for making reference guided assemblies from microreads. It is meant to provide a one-step, automatable, script to produce full alignments, along with all of the associated files, using a single command. It takes a minimum of 2 arguments, but supports over 40 modifiers from the constituent scripts. Since the first step of the pipeline (YASRA) takes the majority of the computing time, the script can also take the output of a previous YASRA run and continue the pipeline from there. It can also rerun the pipeline on the YASRA data in a finished alignreads output; each analysis will have its own subfolder within the main folder containing the YASRA data. If no arguments are supplied a help menu is printed with the most common options displayed. The full help menu with all of the options can be reached by adding the -H/--advanced-help option. Alignreads uses the following scripts from the Liston lab: runyasra.py, sumqual.py, and qualtofa.py. It also uses yasra and its binaries, including lastz. It must be possible for the python interpreter to find these scripts to run alignreads; this can be done saving copies of scripts in the Python26 folder, your bin folder, or modifying your .chsrc file to search where they are saved on your system (nessesary if lastz binaries cannot be found).

How to use it: alignreads.py is a uncompiled python script, so the script must be complied in order to work. alignreads.py takes either 2 fasta files or 1 directory as its required arguments. Since this is a pipeline of five independent programs there are many options that can be used; the full list is at the end of this document. The following configuration will run the full pipeline:

Syntax: python alignreads.py [options] <microreads> <reference>

Example: python alignreads.py --single-step --mask-contigs 35 -e -b 200 ../myReads.fa ../myRef.fa

In order to start the pipeline with data from a previous yasra or alignreads run use the following:

Syntax: python alignreads.py [options] <YASRA output directory>

Example: python alignreads.py --mask-contigs 35 -e -b 200 ../myYasraData

The input directory is used to hold the output of the other 4 scripts.

How it works: YASRA, a open-source short read assembler, is the first step in the pipeline. It uses a reference and a set of microreads to produce various files, including a set of contigs and their quality values. These contigs are then aligned to the same reference by NUCmer, which outputs a ".delta" file that contains the locations and indels of the contigs within the alignment. sumqual.py, a script from the Liston lab, takes the output of NUCmer, the reference, and the quality values for the contigs produced by YASRA, and makes a "consensus" quality value file, which contains multiple contigs, aligned to the reference. Most of the important aspects of the alignment are represented in this data-rich format. Another script form the Liston lab called qualtofa.py, takes the "consensus" quality file and produces a fasta format alignment which is easily viewed by such programs as bio-edit.

Why use it: It is easy to use compared to using the constituent scripts independently. It is composed entirely of open-source code, so its free! It is single command, so it is easy to run multiple assemblies in a batch script, so time-consuming calculations can be done automatically, and unsupervised.

What it can do: Here is the full list of available options. options that are represented by lowercase letter are those that we anticipate will be used most frequently, thus in the standard help menu, reached by giving no arguments, only these will be shown. To see information on all of the options, type either 'python alignreads.py -H' or 'python alignreads.py --advanced-help'.

Usage: python alignreads.py [options] <Reads in .fa file> <Reference> OR...

      python alignreads.py [options] <YASRA folder>

Options:

 --version             show program's version number and exit
 -h, --help            show this help message and exit
 -H, --advanced-help   Display help information for all supported options
                       (Default: only basic options are shown)
 -z STRING, --import-options=STRING
                       specify the path to a Command_Line_Record.txt fie from
                       a previous run, or the folder that contains one. Any
                       other options used with this one are overwritten.
                       (Default: use options supplied)
 -Z, --debug           Save the debug log to the current working directory.
                       (Default: dont save)
 YASRA-Related Modifiers:
   -d, --silent        Nothing is printed to the screen (Default: print the
                       output of yasra to the screen)
   -t 454 or solexa, --read-type=454 or solexa
                       Specify the type of reads. (Default: solexa)
   -o circular or linear, --read-orientation=circular or linear
                       Specify orientation of the sequence. (Default:
                       circular
   -p same, high, medium, low or very low, --percent-identity=same, high, medium, low or very low
                       The percent identity (PID in yasra). The settings
                       correspond to different percent values depending on
                       the read type (-t). (Default: same)
   -a, --single-step   Activate yasra's single_step option (Default: run
                       yasra normally)
   -E FILEPATH, --external-makefile=FILEPATH
                       Specify path to external makefile used by YASRA.
                       (Default: use the makefile built in to runyasra)
   -Q, --no-dot-replace-reads
                       Do NOT replace N's with dots (.) in the microreads
                       file before running yasra/ (Default: replace dots)
   -I, --no-dos2unix-ref
                       Do NOT run dos2unix on the reference before running
                       yasra/ (Default: run dos2unix)
 NUCmer-Related Modifiers:
   -f STRING, --prefix=STRING
                       Set the output file prefix (Default: out)
   -b INT, --break-length=INT
                       Distance an alignment extension will attempt to extend
                       poor scoring regions before giving up (Default: 200)
   -j INT, --alternate-ref=INT
                       Specify a new reference to be used in the rest of the
                       alignment after yasra. (Default: use YASRA's
                       reference)
   -A mum, ref, or max, --anchor-uniqueness=mum, ref, or max
                       Specify how NUCmer chooses anchor matches using one of
                       three settings: mum = Use anchor matches that are
                       unique in both the reference and query, ref =  Use
                       anchor matches that are unique in the reference but
                       not necessarily unique in the query, max = Use all
                       anchor matches regardless of their uniqueness.
                       (Default = ref)
   -T INT, --min-cluster=INT
                       Minimum cluster length used in the NUCmer analysis.
                       (Default: 65)
   -D FLOAT, --diag-factor=FLOAT
                       Maximum diagonal difference factor for clustering,
                       i.e. diagonal difference / match separation used by
                       NUCmer. (Default: 0.12)
   -J, --no-extend     Prevent alignment extensions from their anchoring
                       clusters but still align the DNA between clustered
                       matches in NUCmer. (Default: extend)
   -F, --forward-only  Align only the forward strands of each sequence.
                       (Default: forward and reverse)
   -X INT, --max-gap=INT
                       Maximum gap between two adjacent matches in a cluster.
                       (Default: 90)
   -M INT, --min-match=INT
                       Minimum length of an maximal exact match. (Default:
                       20)
   -C, --coords        Automatically generate the <prefix>.coords file using
                       the 'show-coords' program with the -r option.
                       (Default: dont)
   -O, --no-optimize   Toggle alignment score optimization. Setting
                       --nooptimize will prevent alignment score optimization
                       and result in sometimes longer, but lower scoring
                       alignments (default: optimize)
   -S, --no-simplify   Simplify alignments by removing shadowed clusters.
                       Turn this option off if aligning a sequence to itself
                       to look for repeats. (Default: simplify)
 Delta-Filter-Related Modifiers:
   -y INT, --min-identity=INT
                       Set the minimum alignment identity [0, 100], (Default:
                       80)
   -l INT, --min-align-length=INT
                       Set the minimum alignment length (Default: 100)
   -K FLOAT, --max-overlap=FLOAT
                       Set the maximum alignment overlap for -r and -q
                       options as a percent of the alignment length [0, 100].
                       (Default 100)
   -B, --query-alignment
                       Query alignment using length*identity weighted LIS.
                       For each query, leave only the alignments which form
                       the longest consistent set for the query. (Defualt:
                       global alignment)
   -R, --ref-alignment
                       Reference alignment using length*identity weighted
                       LIS. For each reference, leave only the alignments
                       which form the longest consistent set for the
                       reference. (Defualt: global alignment)
   -G, --global-alignment
                       Global alignment using length*identity weighted LIS
                       (longest increasing subset). For every reference-query
                       pair, leave only the alignments which form the longest
                       mutually consistent set. (this is the default)
   -U FLOAT, --min-uniqueness=FLOAT
                       Set the minimum alignment uniqueness, i.e. percent of
                       the alignment matching to unique reference AND query
                       sequence [0, 100]. (Default 0)
 sumqual-Related Modifiers:
   -Y, --save-ref-dels
                       Save the sequence of the reference that corresponds to
                       empty gaps in the consensus in a fasta file. (Default:
                       dont save)
 qualtofa-Related Modifiers:
   -c, --exclude-contigs
                       Dont include each contig on its own line (Default:
                       include contigs)
   -i, --no-match-overlap
                       Add deletions (i.e. -'s) to the reference to
                       accommodate any overlapping matches. (Default:
                       Condense all overlapping regions of the consensus into
                       IUPAC ambiguity codes.)
   -e, --no-overlap    Add deletions (i.e. -'s) to the reference to
                       accommodate any overlapping sequence, including
                       unmatched sequence. (Default: Condense all overlapping
                       regions of the consensus into IUPAC ambiguity codes.)
   -k, --keep-contained
                       Include contained contigs (Defalt: save sequences of
                       contained contigs to a separate file)
   -q INT, --end-trim-qual=INT
                       Trim all the bases on either end of all contigs that
                       have a quality value less than the specified amount
                       (Default: 0)
   -s, --dont-save-SNPs
                       Dont save SNPs to a .qual file(Default: Save SNP file)
   -W, --dont-align-contigs
                       Do NOT align contigs to the reference using '-'s at
                       the start of each contig; independent of the
                       consensus. (Default: align contigs)
   -N INT, --end-trim-num=INT
                       Trim the ends of the contigs by the specified number
                       of bases. (Default: 0)
   -L INT, --min-match-length=INT
                       Set minimum length of the matching region of the
                       contigs. (Default: 50)
 Coverage and Call Proportion Masking:
   The following options take one integer argument and one decimal
   argument between 0 and 1, if the second is not supplied it is assumed
   to be 0.
   -m, --mask-contigs  Set minimum coverage depth and call proportion for
                       contig masking; independent of the consensus. Cannot
                       be used with the -c modifier.(Default: 0, 0)
   -n, --mask-contig-SNPs
                       Set minimum coverage depth and call proportion for
                       contig SNP masking; independent of the consensus.
                       Cannot be used with the -c modifier.(Default: 0, 0)
   -w, --mask-consensus
                       Set minimum coverage depth and call proportion for the
                       consensus; a new masked sequence will be added to the
                       output file. (Default: 0, 0)
   -x, --mask-SNPs     Set minimum coverage depth and call proportion for
                       SNPs in the consensus; a new masked sequence will be
                       added to the output file. (Default: 0, 0)
Personal tools