Liston:Computer Scripts: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
No edit summary
 
(32 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page contains the source code for some of the bioinformatics scripts used by the Liston Lab. Most of the scripts are written in Python 2.6.4 and are designed for Unix systems. A few are written as a list of unix commands designed to be executables.
This page contains the source code for some of the bioinformatics scripts used by the Liston Lab. Most of the scripts are written in Python 2.6.4 and are designed for Unix systems, although most of them can be run on windows systems. A few are written as a list of unix commands designed to be executables.




Line 15: Line 15:
This would compile and run the script sumqual.py with the modifiers -c and -v, using myQualFile.qual and myMumFile as arguments. All of the scripts save their output in a file in the current working directory, with a name usually composed of some combination of the arguments and the name of the script. However, one can save the output anywhere, under any name, using the following technique:
This would compile and run the script sumqual.py with the modifiers -c and -v, using myQualFile.qual and myMumFile as arguments. All of the scripts save their output in a file in the current working directory, with a name usually composed of some combination of the arguments and the name of the script. However, one can save the output anywhere, under any name, using the following technique:


     python sumqual.py -c -v ../myQualFile.qual ../myMumFile > ../myOutput.ext
     python sumqual.py -c -v ../myQualFile.qual ../myMumFile '''> ../myOutput.ext'''


The order in which the modifiers are given is not important, however, the order of the required arguments is important. For Example the above modifiers could be entered in the opposite order (-v -c), but the two file paths need to be in a predetermined order. Every Script has a description of what it does and how/when to use it in its source code. The list all the modifiers that the script supports and what they do is also included. A similar help menu can be viewed by calling the script with no arguments. For example, typing the following,
The order in which the modifiers are given is not important, however, the order of the required arguments is important. For Example the above modifiers could be entered in the opposite order (-v -c), but the two file paths need to be in a predetermined order. Some scripts have modifiers that require arguments of their own. These modifier arguments should be written directly after their respective modifier. For example, if the above modifier, -c, had a argument, one would type,
 
    python sumqual.py -c '''theArgument''' -v ../myQualFile.qual ../myMumFile
 
Every Script has a description of what it does and how/when to use it in its source code. The list all the modifiers that the script supports and what they do is also included. A similar help menu can be viewed by calling the script with no arguments. For example, typing the following,


     python sumqual.py
     python sumqual.py
Line 23: Line 27:
would cause a help menu to be printed to the screen.
would cause a help menu to be printed to the screen.


== Python Scripts ==
== Python Scripts ==  
[[baseanno.py]]
 
[[basediff.py]]
 
[[BPstats.py]]


[[gapstrip.py]]
<table border=2>
<tr><th>Script Name</th><th>Description</th><th>Input File Format</th><th>Output File Format</th></tr>
<tr><td>[[alignreads.py]] v2.22</td>
<td>A pipeline for combining the free aligner NUCmer and the free short read assembler YASRA so full alignments can be made with one command. Alignreads uses the following scripts from the Liston lab: runyasra.py, sumqual.py, and qualtofa.py. It also uses yasra and its binaries, including lastz. It must be possible for the python interpreter to find these scripts to run alignreads; this can be done saving copies of scripts in the Python26 folder, your bin folder, or modifying your .chsrc file to search where they are saved on your system (nessesary if lastz binaries cannot be found). alignreads preserves almost all of the functionality and output of the five programs it uses, so there are many options (over 40). Because there are so many modifiers, the help menu reached by not providing arguments only has what are expected to be the options used most often; to get all of the options type 'python alignreads.py -H' or 'python alignreads.py --advanced-help'. Since there are many files associated with the alignment, the output is a folder with numerous files organized into subfolders. alignreads.py can also start the pipeline from a folder produced by a previous run of alignreads.py or runyasra.py. In that case, the output is saved in the input folder and separate subfolders are created for each alignment. A more complete discription of alignreads can be found here: [[alignreads.py README]]. </td>
<td>A FASTA file containing reads, and A FASTA file containing a reference  OR  a folder from a previous run of alignreads or runyasra.</td>
<td>A folder containing all of the output from every part of the pipeline organized in subfolders based on each programs' output.</td></tr>
<tr><td>[[allcomb.py]] v1.2</td>
<td>Finds all possible combinations of elements (or group of elements) for every line of a tab-delimitated text file and output every combination, on its own line, to a .txt file.</td>
<td>tab-delimitated .txt file</td>
<td>tab-delimitated .txt file</td></tr>
<tr><td>[[baseanno.py]] v1.0</td>
<td>Converts a file containing a list of annotations, as well as each of their respective start and stop indices, into a file containing a list of base indices, each followed by any annotations that apply at that specific base. Each line of the input file is expected to be whitespace-delimitated, however if your annotations  have spaces in them, the script can be made to enforce tab-delimitation. The output file is always tab-delaminated.</td>
<td>[Annotation text] [Start Index] [End Index]</td>
<td>[Base Index] [Annotation1] [Annotation2] ... [Annotation]</td></tr>
<tr><td>[[basediff.py]] v1.2</td>
<td>Finds base differences between multiple aligned sequences in a single FASTA file and output a tab-delimitated txt file containing the base values for all the sequences at the index where the difference occurred. The script has many modifiers that change what is considered a difference.</td>
<td>FASTA file containing two or more aligned sequences</td>
<td>.txt file in the following format: [Base index] [Seq 1 value] [Seq 2 value] ... [Seq N value]</td></tr>
<tr><td>[[BPstats.py]] v1.2</td>
<td>Performs and outputs various statistical tests for each contig in a GSS basepile output. it is assumed that the information for each contig is 1000 bases long. The following statistics are outputted in a tab-delimitated list for each contig by the script:
<ul>
<li>Reference Match Length: The index of the last known base (i.e. not 'N')</li>
<li>Target Match Sum: The number of known bases (i.e. not 'N')</li>
<li>Coverage Proportion: proportion of 'Target match sum' to ' Reference match length'</li>
<li>Average Density: The average density value for all bases within ' Reference match length'</li>
<li>Median Density: median density of the entire range of density values</li>
</ul></td>
<td>GSS Basepile output. 1000 bases per contig</td>
<td>tab-deliminated list of the statistics for each contig</td></tr>
<tr><td>[[coveragestats.py]] v1.12</td>
<td>Calculates statistics for each contig in a Tablet output file and outputs them to a tab-delimitated '.txt' file following the name of their respective contig. The statistics calculated are: number of coverage values, average coverage depth, percent of bases with a coverage greater than zero, and maximum coverage depth. coveragestats.py requires that contig names have at least one non-numerical character.</td>
<td>Tablet coverage output file (.txt) or one of the same format</td>
<td>Tab-delimitated '.txt' file</td></tr>
<tr><td>[[fatomum.py]] v1.0</td>
<td>Creates a MUMmer-style output from a FASTA file of aligned sequences. The first line is always considered the reference and is not included in the output. The start index in terms of the reference, and in terms of the contig, as well as the length are recorded for each matching section of the contig.</td>
<td>FASTA file containing a reference and its aligned contigs</td>
<td>MUMmuer-like output</td></tr>
<tr><td>[[gapstrip.py]] v1.0</td>
<td>Removes all of the gaps (i.e. '-') that are common (i.e. at the same index) to all of the sequences in an aligned FASTA file. By default, the first sequence is considered the reference and is excluded from the analysis, but the number of sequences that are treated as such can be changed.</td>
<td>FASTA file containing two or more aligned sequences.</td>
<td>FASTA file same number of sequences as the input file but, with gaps removed</td></tr>
<tr><td>[[runyasra.py]] v1.02</td>
<td>Runs the free short read aligner YASRA and organizes its output into a folder. Links to the input data are also saved in the output folder.</td>
<td>A FASTA file containing reads, and a FASTA file containg a reference  OR  a folder from a previous run of alingreades or runyasra.</td>
<td>A folder containing all of YASRA's output as well as links to the input data.</td></tr>
<tr><td>[[sumqual.py]] v2.12</td>
<td>A script used in conjunction with the free aligners NUCmer and the free short read  assembler YASRA. Creates a file with a similar to the format to that of the .qual file YASRA outputs, except each base can have multiple sets of quality values. The format preserves all of the alignment, indel, quality, and reference data from the input files. Uses a .qual file containing the quality scores of multiple contigs and the output file of NUCmer to make a "consensus" .qual file, which contains the quality values in terms of the reference. Sumqual's output can be converted to a fasta format using qualtofa.py. td>
<td><ul><li>FASTA file containing the sequence of multiple contigs</li>
<li>NUCmer output file</li>
<li>Reference file used with Nucmer</li></ul></td>
<td>similar to .qual input file, except each base can hold the information from more than one contig</td></tr>
<tr><td>[[qualtofa.py]] v2.52</td>
<td>Selectively extracts the sequence from a quality file and outputs it to a FASTA file.  It is designed to be used in conjunction with sumqual.py and accepts its consensus-style output, which preserves all of the alignment, quality and reference information. Conflicting bases on overlapping matches from a consensus-style output are condensed into IUPAC ambiguity codes, or gaps can be added to the reference to accommodate repeated or unmatched sequence if a option is used. It has several masking and trimming options. </td>
<td>quality file (.qual); accepts sumqual.py's output</td>
<td>FASTA file</td></tr>
</table>

Latest revision as of 17:07, 22 September 2010

This page contains the source code for some of the bioinformatics scripts used by the Liston Lab. Most of the scripts are written in Python 2.6.4 and are designed for Unix systems, although most of them can be run on windows systems. A few are written as a list of unix commands designed to be executables.


Python Script Conventions

The scripts must be compiled using a Python compiler in the following format:

   python theScript.py [modifiers] <Arguments>. 

For example, in order to run the script sumqual.py one could enter the following into an Unix shell:

   python sumqual.py -c -v ../myQualFile.qual ../myMumFile

This would compile and run the script sumqual.py with the modifiers -c and -v, using myQualFile.qual and myMumFile as arguments. All of the scripts save their output in a file in the current working directory, with a name usually composed of some combination of the arguments and the name of the script. However, one can save the output anywhere, under any name, using the following technique:

   python sumqual.py -c -v ../myQualFile.qual ../myMumFile > ../myOutput.ext

The order in which the modifiers are given is not important, however, the order of the required arguments is important. For Example the above modifiers could be entered in the opposite order (-v -c), but the two file paths need to be in a predetermined order. Some scripts have modifiers that require arguments of their own. These modifier arguments should be written directly after their respective modifier. For example, if the above modifier, -c, had a argument, one would type,

   python sumqual.py -c theArgument -v ../myQualFile.qual ../myMumFile

Every Script has a description of what it does and how/when to use it in its source code. The list all the modifiers that the script supports and what they do is also included. A similar help menu can be viewed by calling the script with no arguments. For example, typing the following,

   python sumqual.py

would cause a help menu to be printed to the screen.

Python Scripts

Script NameDescriptionInput File FormatOutput File Format
alignreads.py v2.22 A pipeline for combining the free aligner NUCmer and the free short read assembler YASRA so full alignments can be made with one command. Alignreads uses the following scripts from the Liston lab: runyasra.py, sumqual.py, and qualtofa.py. It also uses yasra and its binaries, including lastz. It must be possible for the python interpreter to find these scripts to run alignreads; this can be done saving copies of scripts in the Python26 folder, your bin folder, or modifying your .chsrc file to search where they are saved on your system (nessesary if lastz binaries cannot be found). alignreads preserves almost all of the functionality and output of the five programs it uses, so there are many options (over 40). Because there are so many modifiers, the help menu reached by not providing arguments only has what are expected to be the options used most often; to get all of the options type 'python alignreads.py -H' or 'python alignreads.py --advanced-help'. Since there are many files associated with the alignment, the output is a folder with numerous files organized into subfolders. alignreads.py can also start the pipeline from a folder produced by a previous run of alignreads.py or runyasra.py. In that case, the output is saved in the input folder and separate subfolders are created for each alignment. A more complete discription of alignreads can be found here: alignreads.py README. A FASTA file containing reads, and A FASTA file containing a reference OR a folder from a previous run of alignreads or runyasra. A folder containing all of the output from every part of the pipeline organized in subfolders based on each programs' output.
allcomb.py v1.2 Finds all possible combinations of elements (or group of elements) for every line of a tab-delimitated text file and output every combination, on its own line, to a .txt file. tab-delimitated .txt file tab-delimitated .txt file
baseanno.py v1.0 Converts a file containing a list of annotations, as well as each of their respective start and stop indices, into a file containing a list of base indices, each followed by any annotations that apply at that specific base. Each line of the input file is expected to be whitespace-delimitated, however if your annotations have spaces in them, the script can be made to enforce tab-delimitation. The output file is always tab-delaminated. [Annotation text] [Start Index] [End Index] [Base Index] [Annotation1] [Annotation2] ... [Annotation]
basediff.py v1.2 Finds base differences between multiple aligned sequences in a single FASTA file and output a tab-delimitated txt file containing the base values for all the sequences at the index where the difference occurred. The script has many modifiers that change what is considered a difference. FASTA file containing two or more aligned sequences .txt file in the following format: [Base index] [Seq 1 value] [Seq 2 value] ... [Seq N value]
BPstats.py v1.2 Performs and outputs various statistical tests for each contig in a GSS basepile output. it is assumed that the information for each contig is 1000 bases long. The following statistics are outputted in a tab-delimitated list for each contig by the script:
  • Reference Match Length: The index of the last known base (i.e. not 'N')
  • Target Match Sum: The number of known bases (i.e. not 'N')
  • Coverage Proportion: proportion of 'Target match sum' to ' Reference match length'
  • Average Density: The average density value for all bases within ' Reference match length'
  • Median Density: median density of the entire range of density values
GSS Basepile output. 1000 bases per contig tab-deliminated list of the statistics for each contig
coveragestats.py v1.12 Calculates statistics for each contig in a Tablet output file and outputs them to a tab-delimitated '.txt' file following the name of their respective contig. The statistics calculated are: number of coverage values, average coverage depth, percent of bases with a coverage greater than zero, and maximum coverage depth. coveragestats.py requires that contig names have at least one non-numerical character. Tablet coverage output file (.txt) or one of the same format Tab-delimitated '.txt' file
fatomum.py v1.0 Creates a MUMmer-style output from a FASTA file of aligned sequences. The first line is always considered the reference and is not included in the output. The start index in terms of the reference, and in terms of the contig, as well as the length are recorded for each matching section of the contig. FASTA file containing a reference and its aligned contigs MUMmuer-like output
gapstrip.py v1.0 Removes all of the gaps (i.e. '-') that are common (i.e. at the same index) to all of the sequences in an aligned FASTA file. By default, the first sequence is considered the reference and is excluded from the analysis, but the number of sequences that are treated as such can be changed. FASTA file containing two or more aligned sequences. FASTA file same number of sequences as the input file but, with gaps removed
runyasra.py v1.02 Runs the free short read aligner YASRA and organizes its output into a folder. Links to the input data are also saved in the output folder. A FASTA file containing reads, and a FASTA file containg a reference OR a folder from a previous run of alingreades or runyasra. A folder containing all of YASRA's output as well as links to the input data.
sumqual.py v2.12 A script used in conjunction with the free aligners NUCmer and the free short read assembler YASRA. Creates a file with a similar to the format to that of the .qual file YASRA outputs, except each base can have multiple sets of quality values. The format preserves all of the alignment, indel, quality, and reference data from the input files. Uses a .qual file containing the quality scores of multiple contigs and the output file of NUCmer to make a "consensus" .qual file, which contains the quality values in terms of the reference. Sumqual's output can be converted to a fasta format using qualtofa.py. td>
  • FASTA file containing the sequence of multiple contigs
  • NUCmer output file
  • Reference file used with Nucmer
similar to .qual input file, except each base can hold the information from more than one contig
qualtofa.py v2.52 Selectively extracts the sequence from a quality file and outputs it to a FASTA file. It is designed to be used in conjunction with sumqual.py and accepts its consensus-style output, which preserves all of the alignment, quality and reference information. Conflicting bases on overlapping matches from a consensus-style output are condensed into IUPAC ambiguity codes, or gaps can be added to the reference to accommodate repeated or unmatched sequence if a option is used. It has several masking and trimming options. quality file (.qual); accepts sumqual.py's output FASTA file