Revision as of 06:08, 7 November 2013

Winterschool program

Software list

Basics

linux Ubuntu 12.04.3 vs Debian 7.1 (think about 32 vs 64 bit versions)
java http://www.java.com/en/download/linux_manual.jsp?locale=en

Specific tools 1

FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

TagDust: http://genome.gsc.riken.jp/osc/english/software/src/tagdust.tgz
fastareformat from fastareformat exonerate-2.2.0 [1]
fixing fasta headers (gff fields) with python? small script
GEM [2]
1. CAVEAT: (problem with cores on different laptops...)

http://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%203/

BWA http://sourceforge.net/projects/bio-bwa/files/
Stampy http://www.well.ox.ac.uk/~gerton/software/Stampy/stampy-1.0.22r1848.tgz
last http://last.cbrc.jp/ (the 362 versiona has split and splice-mapping options)
bowtie http://bowtie-bio.sourceforge.net/bowtie2/index.shtml (bowtie2)
samtools http://sourceforge.net/projects/samtools/files/
picard http://sourceforge.net/projects/picard/files/
IGV/ IGVtools http://www.broadinstitute.org/software/igv/download

bamtools https://github.com/pezmaster31/bamtools
1. requires cmake: http://www.cmake.org/files/v2.8/cmake-2.8.12.tar.gz (or apt get)
bedtools http://code.google.com/p/bedtools/downloads/list
GATK http://www.broadinstitute.org/gatk/auth?package=GATK (download yourself: license!)
vcftools http://sourceforge.net/projects/vcftools/files/

Specific tools 2/RNA-Seq

tophat http://tophat.cbcb.umd.edu/
cufflinks http://cufflinks.cbcb.umd.edu/ (may require Boost libs!)
GEMtools https://github.com/gemtools/gemtools

Vagrant fixes

For X11 forwarding the Vagrantfile has to contain

config.ssh.forward_x11 = true

Introduction to Linux and the command line

why Linux?
1. runs on everything from cell phones to supercomputers
2. long history of stable tools
3. most of the bioinformatics software was written and intended to run on Linux

File and directories naming

Linux is case sensitive, do MyFile.txt is different from MYFILE.TXT or myfile.txt. Try to use some consistent naming schemes for your project directories, input data or result files. You can use very long, descriptive names, like these result files from ENCODE project:

wgEncodeUwTfbsNhdfneoCtcfStdAlnRep0.bam_VS_wgEncodeUwTfbsNhdfneoInputStdAlnRep1.bam.regionPeak.gz

Things to keep in mind:

1. never ever use space/tabs in your file/directory names
2. avoid Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)

logging in, connecting to other servers with ssh / sftp

As with other computers, one requires username password combination to connect to a specific computer. This combination can be specific to each of the computers or shared between i.e. all workstations at a given location.

SSH is a name for secure, encrypted connections between computers. It consist of two components, ssh server running on a remote machine and a ssh client on your laptop / workstation. The client is included in the default installations of recent Linux and OS X (Mac), but on Windows one has to install it ( http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html ). It is not required you have it for the course, but it is a good idea to have it on your computer if you want to access remote servers on the command line.

With properly configured ssh connection (on Linux and Mac) you can run not just command line programs but also graphical user interface. This is also possible on Windows, but it is way more complicated to set up.

Linux directory structure

As with Windows and Mac systems one can imagine the systems of directories (synonym of folders) as a giant tree, where each of the branches has an upper directory and may contain lower level directories (sub-directories). At the top of this branched hierarchy it is just one place from where everything starts (no C:, D: X: drives), and in Unix speak it is called root directory. Some examples of directory naming:

/home/linus/

/usr/local/bin

/home/linus/bioinf/programs

/home/linus/projects/chicken

Some shortcuts to remember (what is after # sign is a description):


/ #root directory

~ #your home directory

.. #one directory above

. #current directory

copy, rename/move files, create directories, symbolic links
view files (more/less, head, tail), count (wc)
search for strings / replace strings (grep & sed)
compressing / uncompressing files (gzip, bzip2, tar)
pipelines and redirection
awk in 5 minutes
where to go from there (clusters, python)

FASTQ

Illumina file formats (quality encodings)
paired / unpaired reads
quality checking (fastqc)
trimming & filtering (TagDust)
source of published FASTQ data: Short Read Archive vs ENA

Genomic fasta and gtf/gff gene annotation

resources at ENSEMBL
basic checks and reformatting

grepping fasta headers
fasta reformat from exonerate??

Mapping genomic reads

overview of mappers
1. GEM
2. bwa +/- stampy
3. last / bowtie
mapping steps (for each mapper)
genome indexing
mapping
+/- postprocessing

SAM and BAM file formats

Analyzing BAM files
sorting / indexing
viewing the mappings in IGV

tools for processing BAM files

samtools
picard
bamtools

getting mapping stats

extracting reads mapping to regions
getting coverage info for selected regions

Detecting SNPs

general procedure
GATK pipeline
other SNP calling programs [tba]

Working with VCF files

VCF file format
viewing VCFs in IGV
filtering SNPs by quality
set operations on VCF files (common SNPs, unique SNPs)

RNASeq

caveats (ribosomal RNA contamination)
mapping RNASeq
tophat
GRAPE
creating gene models from RNASeq (cufflinks)

User talk:Darek Kedra/sandbox 28: Difference between revisions

Revision as of 06:08, 7 November 2013

Contents