BioMicroCenter:Software: Difference between revisions

Latest revision as of 08:21, 14 December 2022

A large amount of bioinformatic software is available at MIT. This page is meant to summarize some of the most common requests we have. The BioMicro Center collaborates with the Koch Institute Bioinformatics Computing Core and the MIT Libraries to support different packages

Desktop Software

Desktop software is available from our Download Page. Access may be limited to MIT users only. Below is a list of the software available:

Agilent 2100 Expert This software package is used to control the Agilent 2100 Bioanalyzer and to perform analysis of the output, including microfluidic and electrophoretic assays for RNA, DNA and proteins, as well as two-color flow cytometry. The software can be installed on your desktop to allow users to do additional analyses.

SSH This software is what we recommend for UNIX access to rous and for downloading files from our servers

Spotfire is a widely used data analysis and visualization tool. It can handle a number of clustering functions and statistical tests and has very robust graphical capabilities. The BioMicro Center operates a Spotfire server that is available to anyone at MIT. Licenses for Spotfire are available through the BioMicro Center on a yearly basis.

MATLAB A mathematical programming language used for mathematical modeling, as well as analyzing and visualizing data.

Wafergen Software Software for analyzing Wafergen data.

Roche LC480 Software for the Roche LightCyclers.

Tecan EvoWare Standard This software is available as part of our robotics service. Identical to the software used on the Tecan EVO 150s, the software contains a simulator that can be used to design your robotics experiments at your bench. Note that this software is on a different server.

MacVector a comprehensive Macintosh application that provides sequence editing, primer design, internet database searching, protein analysis, sequence confirmation, multiple sequence alignment, phylogenetic reconstruction, coding region analysis, and a wide variety of other functions.

Lasergene v8.0 A software package that provides sequence assembly including next-generation sequence analysis; simplified primer design, and expanded SNP reporting and management.

Galaxy

Galaxy is a bioinformatics platform that is designed to bring complicated informatics tools to bench scientists. Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more. For many users, the public Galaxy instance at Penn State can provide a very robust tool.

To make things even easier, we have created a galaxy server here at MIT. The Galaxy Server acts as a separate head node for ROUS. Users are required to have data storage space on Rowley or BMC-PUB and may be required to purchase a queue on ROUS.

Additional Resources

Software from MIT Libraries

BioBASE The BIOBASE Knowledge Library (BKL) contains comprehensive sets of protein databases such as HumanPSD, WormPD, GPGR-PD, PombePD, and MycopathPD in addition to analysis tools such as TRANSFAC, TRANSPATH, and ExPlain. BKL brings together curated data, analysis tools, and gene-centered information. BKL is one of the best ways to quickly assess a vast set of protein properties for a given protein or set of proteins.

GeneGO Metacore GeneGo is a leading provider of data mining & analysis solutions in systems biology. MetaCore, GeneGo's flapship product, is an integrated software suite for functional analysis of experimental data. MetaCore is based on a curated database of human protein-protein, protein-DNA interactions, transcription factors, signaling and metabolic pathways, disease and toxicity, and the effects of bioactive molecules.

INGENUITY PATHWAY ANALYSIS software that helps researchers model, analyze, and understand complex biological and chemical systems relevant to their experimental data. Researchers can search the scientific literature and find insights most relevant to their experimental data; analyze and build pathway models related to their experimental data;and share and collaborate with colleagues. IPA is currently licensed through June 2012.

UNIX SERVER

A large amount of software is installed on our cluster server. Please look at the ROUS page .

BMC-BCC Pipeline

The pipeline processes flowcell directories as they are generated by the Illumina sequencer software and postprocesses the output for use in downstream biological analyses. It is intended to be used by core facilities who own and/or operate Illumina sequencers for automation and consistency of processing Illumina data. The pipeline is a collection of command line utilities written primarily in the Python programming language. The commands are tied together using the ruffus pipelining package.

Release Notes 1.8 (11/11/2020)

Added the support of NovaSeq flowcells
Updated fastq format for NextSeq and NovaSeq flowcells
Generated Sample sheets for NextSeq and NovaSeq flowcells for demultiplexing
Added the support of cellranger-atac and spaceranger for 10x pipeline
Added various flags for re-run options
Refactored codes
Multiple projects can be loaded on the same lane

Release Notes 1.7 (09/01/2017)

Added the support for the 10x pipeline. Cellranger is upgraded from 2.0.2 to 3.0.1 on 1/11/2019

Added precheck options
- Give warning when samples have indices differ only 1 nt when 2nd index exists
- Give warning when inline indices are used but not given start nt and length
- Check existing project path is writable
- Check genome exists

Reliability and performance improvement
- Corrected the unmapped read percentage for pair-end read
- Corrected the lane barcode percentage (the percentage is now calculated against the pass filter count instead of the total count)
- Adjusted number of threads based on sample numbers
- Updated PPPQC code to handle flowcell that has reverse read length longer than forward read length
- Updated sam_stat code to use "samtools flagstat" instead of in-house script to compute sam file statistics
- Changed the delivery URL from rowley.mit.edu to bmc-data.mit.edu
- Added project name and URL in pipeline run notification email
- Added project name in delivery email

Third party software package update
- bcl2fastq from 2.15.0 to 2.19.1
- bwa from 0.7.12 to 0.7.16a
- fastqc from 0.11.4 to 0.11.5
- samtools from 1.3 to 1.5
- bowtie2 from 2.2.6 to 2.3.2

Release Notes 1.6 (03/18/2017)

Added the support for the SLURM scheduler
Added the support for CentOS
Added the delivery of BCL files through web when requested
Performance improvement (nextSeq fastq generation and PPPQC)

Release Notes 1.5.2 (08/12/2016)

Added the support for GA/NextSeq 2nd index
Improved the portabilty of the pipeline through code reorg

Release Notes 1.5.1 (06/12/2016)

Added the support for HT3DGE (high-throughput digital gene expression) project
Publish infosite to Filemaker database

Release Notes 1.5 (01/26/2016)

Added phiX percent perfect plot to calculate sequencing error rate for HiSeq and MiSeq
The percent perfect plot created by the PPPQC script is designed to calculate next generation sequencing error rate. The calculation can be applied to paired end sequencing or single end sequencing of either Nextseq, Miseq, or Hiseq, depending on the specific sequencing run. The script is based on the comparison between the sequenced spike in PhiX reads with PhiX reference genome sequence. To avoid potential alignment issues of sequencing reads with poor quality, the script first aligns the first 30 base pairs of the sequencing reads to identify PhiX reads as well as forward reads and reverse reads. Then full length PhiX forward and reverse reads were retrieved and compared to the reference sequence. The percentage of sequencing reads with zero mismatches, <=1 mismatches, <=2 mismatches, <=3 mismatches, and <=4 mismatches were calculated and plotted at each nucleotide position. For Nextseq sequencing, the reads from each camera were processed separately. For Hiseq sequencing, the reads from each lane were processed separately. For paired end reads, the reads from each mate pair were processed separately. Due to the nature of very low indel reading errors rate by Illumina sequencing, the reads with indels comparing to the reference sequence are excluded from the current calculation.

Added CNV quality control plot for ChIP, ReSeq and CGHSeq sample types
The CNV quality control plot created by the CNVQC script uses downsampled bam files to plot DNA copy numbers along the reference genome. Both mapability and GC% were considered during the normalizing process. Potential gains were marked in red and losses were marked in green. Currently it supports hg19 and mm9 genomes.

Upgraded software tools including fastqc, bwa, samtools, and bedtools
- fastqc upgraded from 0.11.2 to 0.11.4
- bwa upgraded from 0.7.10 to 0.7.12
- samtools upgraded from 0.1.19 to 1.3
- bedtools upgraded from 2.20.1 to 2.25.0

Improved performance and robustness
- Added a precheck flag -c to check the filemaker database to avoid human error
- Allowed the recursive pulling of samples in a subpool when creating sample json file
- Improved the robustness of sample json file when handling mixed barcodes
- Added a second person to receive delivery email if specified
- Enabled creating tarball of the flowcell directory after pipeline run ends
- Simplified the process to create a new release
- Reworked the code on publishing project data to avoid intermittent file system error
- Added flowcell as part of SGE job name to easily identify pipeline runs in the cluster
- Used 32 threads as default instead of 16 after new nodes were added to the rous cluster

Release Notes 1.4 (01/01/2015)

The quality scores of fastq files are now in Sanger format (previously the quality scores were in the Illumina 1.3+ format)
Add the support of NextSeq.

Release Notes 1.3 (07/25/2014)

Paired end quality control is added for samples aligned to genomes other than phiX. It summarizes basic mapping metrics from the BWA alignments to identify proper mapping reads and provides a distribution of insert lengths based on these mappings.
RNAseq quality control is added for RNASeq data for a list of genomes other than phiX. It checks distribution of the reads, 5' to 3' bias, strand specificity and ribosome RNA contamination. It also checks gene expression correlation between samples when applicable. .
Software upgrade: BWA is upgraded to 0.7.10 and fastqc is upgraded to 0.11.2
Improved the algorithms of demultiplexing and handling index mismatch
Performance enhancement. It uses 16 threads as default instead of 8 which reduces the pipeline runtime significantly for a HiSeq run.

Release Notes 1.2 (01/01/2014)

An information site about the pipeline run is delivered to MIT users
Sample data directory includes the flowcell code
Bug fix for pipeline re-run. When the pipeline was re-run, data may be duplicated in the fastq files. This is now fixed.
Performance enhancement. Data is written directly to the published directory for users, and copy is avoided whenever possible. This not only reduces disk storage, but also allows users to get their data faster.

Release Notes 1.0.2 (08/19/2013)

Switch from Bowtie to BWA for default alignments for generating SAM files.
The BWA version 0.7.5a is used by default for alignment. For Illumina sequence reads up to 70bp, the alignment is done by aln/samse/sampe (the BWA-backtrack algorithm). For longer sequence read > 70bp, the mem subcommand (the BWA-MEM algorithm) is used.

Bug fix for large SAM/BAM files
When processing large fastq files to generate a sam file, the sam file may be corrupted at the end of the file under certain circumstance if it is larger than 40GB. As a result, the SAM-BAM conversion may get a core dump. This is now fixed.

Release Notes 0.9 (10/18/2011)

Implemented all core functionality:

setting up and converting qseq files
qseq to fastq
fastqc and tag count statistics on flowcell-level sequences
splitting of barcoded samples into individual directories
individual fastqc
genome alignment using bowtie plus statistics
contamination qc checking
tag counts
conversion of alignments from SAM to BAM
production of bigWig files from SAM alignments
publishing user data to web directories

PPR Program

Generates Percent Perfect Reads for Miseq, Nextseq, and Hiseq data with phix spike-in

@@ Line 8: / Line 8: @@
 * '''Agilent 2100 Expert''' This software package is used to control the [[BioMicroCenter:2100BioAnalyzer|Agilent 2100 Bioanalyzer]] and to perform analysis of the output, including microfluidic and electrophoretic assays for RNA, DNA and proteins, as well as two-color flow cytometry. The software can be installed on your desktop to allow users to do additional analyses.
-* '''SSH''' This software is what we recommend for UNIX access to rous and for downloading files form our servers
+* '''SSH''' This software is what we recommend for UNIX access to rous and for downloading files from our servers
-* [http://spotfire.tibco.com/ '''Spotfire'''] is a widely used data analysis and visualization tool. It can handle a number of clustering functions and statistical tests and has very robust graphical capabilities. The BioMicro Center operates a Spotfire server that is available to anyone at MIT. Licenses for Spotfire are available through the BioMicro Center on a yearly basis.ew
+* [http://spotfire.tibco.com/ '''Spotfire'''] is a widely used data analysis and visualization tool. It can handle a number of clustering functions and statistical tests and has very robust graphical capabilities. The BioMicro Center operates a Spotfire server that is available to anyone at MIT. Licenses for Spotfire are available through the BioMicro Center on a yearly basis.
-* '''MATLAB''' A mathematical programming language used for mathematical modeling, as well as analyzing and visualizing data. Contact Stephen Goldman for access.
+* '''MATLAB''' A mathematical programming language used for mathematical modeling, as well as analyzing and visualizing data.
-*'''[[BioMicroCenter:Wafergen|Wafergen Software]]''' Software for analyzing your Wafergen data.
+*'''[[BioMicroCenter:Wafergen|Wafergen Software]]''' Software for analyzing Wafergen data.
 *'''[[BioMicroCenter:RTPCR| Roche LC480]]''' Software for the Roche LightCyclers.
 * [http://bmc-tecan.mit.edu/ '''Tecan EvoWare Standard'''] This software is available as part of our robotics service. Identical to the software used on the [[BioMicroCenter:Tecan_Freedom_Evo|Tecan EVO 150s]], the software contains a simulator that can be used to design your robotics experiments at your bench. Note that this software is on a different server.
-* '''COMSOL Multiphysics''' This software package creates a simulation environment that facilitates all steps in the modeling process.
 * '''MacVector''' a comprehensive Macintosh application that provides sequence editing, primer design, internet database searching, protein analysis, sequence confirmation, multiple sequence alignment, phylogenetic reconstruction, coding region analysis, and a wide variety of other functions.
@@ Line 28: / Line 26: @@
 == Galaxy ==
 [[ Image:BioMicroCenter_GalaxyFront.png | thumb | right | 300px | Front Page of the MIT Galaxy Site ]]
-[http://galaxy.psu.edu/ Galaxy] is a bioinformatics platform that is designed to bring complicated informatics tools to bench scientists. Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignemnts, compare genomic annotations, profile metagenomic samples and much much more. For many users, the public Galaxy instance at Penn State can provide a very robust tool.
+[http://galaxy.psu.edu/ Galaxy] is a bioinformatics platform that is designed to bring complicated informatics tools to bench scientists. Galaxy allows you to do analyses you cannot do anywhere else without the need to install or download anything. You can analyze multiple alignments, compare genomic annotations, profile metagenomic samples and much much more. For many users, the public Galaxy instance at Penn State can provide a very robust tool.
-To make things even easier we have created a galaxy server here at MIT. The Galaxy Server acts as a separate head node for ROUS. Users are required to have data storage space on Rowley or BMC-PUB and may be required to purchase a queue on ROUS.
+To make things even easier, we have created a galaxy server here at MIT. The Galaxy Server acts as a separate head node for ROUS. Users are required to have data storage space on Rowley or BMC-PUB and may be required to purchase a queue on ROUS.
 == Additional Resources ==
@@ Line 39: / Line 37: @@
 * [http://libguides.mit.edu/content.php?pid=14149&sid=223198 '''GeneGO Metacore'''] GeneGo is a leading provider of data mining & analysis solutions in systems biology. MetaCore, GeneGo's flapship product, is an integrated software suite for functional analysis of experimental data. MetaCore is based on a curated database of human protein-protein, protein-DNA interactions, transcription factors, signaling and metabolic pathways, disease and toxicity, and the effects of bioactive molecules.
-* [http://libguides.mit.edu/content.php?pid=14149&sid=843471 '''INGENUITY PATHWAY ANALYSIS'''] software that helps researchers model, analyze, and understand complex biological and chemical systems relevant to their experimental data. Researchers can search the scientific literature and find insights most relevant to their experimental data; analyze and build pathway models related to thier experimental data;and share and collaborate with colleagues. IPA is currently licensed through June 2012.
+* [http://libguides.mit.edu/content.php?pid=14149&sid=843471 '''INGENUITY PATHWAY ANALYSIS'''] software that helps researchers model, analyze, and understand complex biological and chemical systems relevant to their experimental data. Researchers can search the scientific literature and find insights most relevant to their experimental data; analyze and build pathway models related to their experimental data;and share and collaborate with colleagues. IPA is currently licensed through June 2012.
 == UNIX SERVER ==
-A large amount of software is installed on our cluster server.  Please look at the ROUS page.
+A large amount of software is installed on our cluster server.  Please look at the [http://openwetware.org/wiki/BioMicroCenter:Servers#Server_Software_Installed_on_ROUS '''ROUS'''] page .
 == BMC-BCC Pipeline ==
 The pipeline processes flowcell directories as they are generated by the Illumina sequencer software and postprocesses the output for use in downstream biological analyses. It is intended to be used by core facilities who own and/or operate Illumina sequencers for automation and consistency of processing Illumina data. The pipeline is a collection of command line utilities written primarily in the Python programming language. The commands are tied together using the ruffus pipelining package.
+'''Release Notes 1.8''' (11/11/2020)
+* Added the support of NovaSeq flowcells
+* Updated fastq format for NextSeq and NovaSeq flowcells
+* Generated Sample sheets for NextSeq and NovaSeq flowcells for demultiplexing
+* Added the support of cellranger-atac and spaceranger for 10x pipeline
+* Added various flags for re-run options
+* Refactored codes
+* Multiple projects can be loaded on the same lane
+'''Release Notes 1.7''' (09/01/2017)
+* Added the support for the 10x pipeline. Cellranger is upgraded from 2.0.2 to 3.0.1 on 1/11/2019
+* Added precheck options
+** Give warning when samples have indices differ only 1 nt when 2nd index exists
+** Give warning when inline indices are used but not given start nt and length
+** Check existing project path is writable
+** Check genome exists
+* Reliability and performance improvement
+** Corrected the unmapped read percentage for pair-end read
+** Corrected the lane barcode percentage (the percentage is now calculated against the pass filter count instead of the total count)
+** Adjusted number of threads based on sample numbers
+** Updated PPPQC code to handle flowcell that has reverse read length longer than forward read length
+** Updated sam_stat code to use "samtools flagstat" instead of in-house script to compute sam file statistics
+** Changed the delivery URL from rowley.mit.edu to bmc-data.mit.edu
+** Added project name and URL in pipeline run notification email
+** Added project name in delivery email
+* Third party software package update
+** bcl2fastq from 2.15.0 to 2.19.1
+** bwa from 0.7.12 to 0.7.16a
+** fastqc from 0.11.4 to 0.11.5
+** samtools from 1.3 to 1.5
+** bowtie2 from 2.2.6 to 2.3.2
+'''Release Notes 1.6''' (03/18/2017)
+* Added the support for the SLURM scheduler
+* Added the support for CentOS
+* Added the delivery of BCL files through web when requested
+* Performance improvement (nextSeq fastq generation and PPPQC)
+'''Release Notes 1.5.2''' (08/12/2016)
+* Added the support for GA/NextSeq 2nd index
+* Improved the portabilty of the pipeline through code reorg
+'''Release Notes 1.5.1''' (06/12/2016)
+* Added the support for HT3DGE (high-throughput digital gene expression) project
+* Publish infosite to Filemaker database
+'''Release Notes 1.5''' (01/26/2016)
+*''' Added phiX percent perfect plot to calculate sequencing error rate for HiSeq and MiSeq'''<p>The percent perfect plot created by the PPPQC script is designed to calculate next generation sequencing error rate. The calculation can be applied to paired end sequencing or single end sequencing of either Nextseq, Miseq, or Hiseq, depending on the specific sequencing run. The script is based on the comparison between the sequenced spike in PhiX reads with PhiX reference genome sequence. To avoid potential alignment issues of sequencing reads with poor quality, the script first aligns the first 30 base pairs of the sequencing reads to identify PhiX reads as well as forward reads and reverse reads. Then full length PhiX forward and reverse reads were retrieved and compared to the reference sequence. The percentage of sequencing reads with zero mismatches, <=1 mismatches, <=2 mismatches, <=3 mismatches, and <=4 mismatches were calculated and plotted at each nucleotide position. For Nextseq sequencing, the reads from each camera were processed separately. For Hiseq sequencing, the reads from each lane were processed separately. For paired end reads, the reads from each mate pair were processed separately. Due to the nature of very low indel reading errors rate by Illumina sequencing, the reads with indels comparing to the reference sequence are excluded from the current calculation.</p>
+* '''Added CNV quality control plot for ChIP, ReSeq and CGHSeq sample types'''<p>The CNV quality control plot created by the CNVQC script uses downsampled bam files to plot DNA copy numbers along the reference genome. Both mapability and GC% were considered during the normalizing process. Potential gains were marked in red and losses were marked in green. Currently it supports hg19 and mm9 genomes. </p>
+* '''Upgraded software tools including fastqc, bwa, samtools, and bedtools'''
+**fastqc upgraded from 0.11.2 to 0.11.4
+**bwa upgraded from 0.7.10 to 0.7.12
+**samtools upgraded from 0.1.19 to 1.3
+**bedtools upgraded from 2.20.1 to 2.25.0
+* '''Improved performance and robustness'''
+** Added a precheck flag -c to check the filemaker database to avoid human error
+** Allowed the recursive pulling of samples in a subpool when creating sample json file
+** Improved the robustness of sample json file when handling mixed barcodes
+** Added a second person to receive delivery email if specified
+** Enabled creating tarball of the flowcell directory after pipeline run ends
+** Simplified the process to create a new release
+** Reworked the code on publishing project data to avoid intermittent file system error
+** Added flowcell as part of SGE job name to easily identify pipeline runs in the cluster
+** Used 32 threads as default instead of 16 after new nodes were added to the rous cluster
+'''Release Notes 1.4''' (01/01/2015)
+* The quality scores of fastq files are now in Sanger format (previously the quality scores were in the Illumina 1.3+ format)
+* Add the support of NextSeq.
+'''Release Notes 1.3''' (07/25/2014)
+* Paired end quality control is added for samples aligned to genomes other than phiX. It summarizes basic mapping metrics from the BWA alignments to identify proper mapping reads and provides a distribution of insert lengths based on these mappings.
+* RNAseq quality control is added for RNASeq data for a list of genomes other than phiX. It checks distribution of the reads, 5' to 3' bias, strand specificity and ribosome RNA contamination. It also checks gene expression correlation between samples when applicable. .
+* Software upgrade: BWA is upgraded to 0.7.10 and fastqc is upgraded to 0.11.2
+* Improved the algorithms of demultiplexing and handling index mismatch
+* Performance enhancement. It uses 16 threads as default instead of 8 which reduces the pipeline runtime significantly for a HiSeq run.
+'''Release Notes 1.2''' (01/01/2014)
+* An information site about the pipeline run is delivered to MIT users
+* Sample data directory includes the flowcell code
+* Bug fix for pipeline re-run. When the pipeline was re-run, data may be duplicated in the fastq files. This is now fixed.
+* Performance enhancement. Data is written directly to the published directory for users, and copy is avoided whenever possible. This not only reduces disk storage, but also allows users to get their data faster.
 '''Release Notes 1.0.2''' (08/19/2013)
-* Switch from Bowtie to BWA for default alignments for generating SAM and BAM files.<p>The BWA version 0.7.5a is used by default for alignment. For Illumina sequence reads up to 70bp, the alignment is done by aln/samse/sampe (the BWA-backtrack algorithm). For longer sequence read > 70bp, use the mem subcommand (the BWA-MEM algorithm)</p>
+* Switch from Bowtie to BWA for default alignments for generating SAM files.<p>The BWA version 0.7.5a is used by default for alignment. For Illumina sequence reads up to 70bp, the alignment is done by aln/samse/sampe (the BWA-backtrack algorithm). For longer sequence read > 70bp, the mem subcommand (the BWA-MEM algorithm) is used.</p>
-* Bug fix for large SAM/BAM files<p>When processing large fastq files to generate a sam file, the sam file may be corrupted at the end of the file under special circumstance if it is larger than 40GB. As a result, the SAM-BAM conversion may get a core dump. This is now fixed.</p>
+* Bug fix for large SAM/BAM files<p>When processing large fastq files to generate a sam file, the sam file may be corrupted at the end of the file under certain circumstance if it is larger than 40GB. As a result, the SAM-BAM conversion may get a core dump. This is now fixed.</p>
 '''Release Notes 0.9''' (10/18/2011)<p>
@@ Line 67: / Line 153: @@
 * publishing user data to web directories
 </p>
+==PPR Program==
+[[BioMicroCenter:PPR Program|Generates Percent Perfect Reads for Miseq, Nextseq, and Hiseq data with phix spike-in]]

BioMicroCenter:Software: Difference between revisions

Latest revision as of 08:21, 14 December 2022

Contents

Desktop Software

Galaxy

Additional Resources

Software from MIT Libraries

UNIX SERVER

BMC-BCC Pipeline

PPR Program

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools