User talk:Darek Kedra/sandbox 28: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 41: Line 41:
==Introduction to Linux and the command line==
==Introduction to Linux and the command line==
#why Linux?
#why Linux?
## runs on everything from cell phones to supercomputers
## long history of stable tools
## most of the bioinformatics software was written and intended to run on Linux
#File and directories naming
Linux is case sensitive, do MyFile.txt is different from MYFILE.TXT or myfile.txt. Try to use some consistent naming schemes for your project directories, input data or result files. You can use very long, descriptive names, like these result files from ENCODE project:
<pre>
wgEncodeUwTfbsNhdfneoCtcfStdAlnRep0.bam_VS_wgEncodeUwTfbsNhdfneoInputStdAlnRep1.bam.regionPeak.gz
</pre>
Things to keep in mind:
## never ever use space/tabs in your file/directory names
## avoid Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)
#logging in, connecting to other servers with ssh / sftp
#logging in, connecting to other servers with ssh / sftp
As with other computers, one requires username password combination to connect to a specific computer. This combination can be specific to each of the computers or shared between i.e. all workstations at a given location.
SSH is a name for secure, encrypted connections between computers. It consist of two components, ssh server running on a remote machine and a ssh client on your laptop / workstation. The client is included in the default installations of recent Linux and OS X (Mac), but on Windows one has to install it ( http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html ). It is not required you have it for the course, but it is a good idea to have it on your computer if you want to access remote servers on the command line.
With properly configured ssh connection (on Linux and Mac) you can run not just command line programs but also graphical user interface. This is also possible on Windows, but it is way more complicated to set up.
#Linux directory structure
As with Windows and Mac systems one can imagine the systems of directories (synonym of folders) as a giant tree, where each of the branches has an upper directory and may contain lower level directories (sub-directories). At the top of this branched hierarchy it is just one place from where everything starts (no C:, D: X: drives), and in Unix speak it is called root directory. Some examples of directory naming:
<pre>
/home/linus/
/usr/local/bin
/home/linus/bioinf/programs
/home/linus/projects/chicken
</pre>
Some shortcuts to remember (what is after # sign is a description):
<pre>
/ #root directory
~ #your home directory
.. #one directory above
. #current directory 
</pre>
#copy, rename/move files, create directories, symbolic links
#copy, rename/move files, create directories, symbolic links
#view files (more/less, head, tail), count (wc)
#view files (more/less, head, tail), count (wc)

Revision as of 06:08, 7 November 2013

Winterschool program

Software list

Basics

  1. linux Ubuntu 12.04.3 vs Debian 7.1 (think about 32 vs 64 bit versions)
  2. java http://www.java.com/en/download/linux_manual.jsp?locale=en

Specific tools 1

  1. FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  1. TagDust: http://genome.gsc.riken.jp/osc/english/software/src/tagdust.tgz
  2. fastareformat from fastareformat exonerate-2.2.0 [1]
  3. fixing fasta headers (gff fields) with python? small script
  4. GEM [2]
    1. CAVEAT: (problem with cores on different laptops...)

http://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%203/

  1. BWA http://sourceforge.net/projects/bio-bwa/files/
  2. Stampy http://www.well.ox.ac.uk/~gerton/software/Stampy/stampy-1.0.22r1848.tgz
  3. last http://last.cbrc.jp/ (the 362 versiona has split and splice-mapping options)
  4. bowtie http://bowtie-bio.sourceforge.net/bowtie2/index.shtml (bowtie2)
  5. samtools http://sourceforge.net/projects/samtools/files/
  6. picard http://sourceforge.net/projects/picard/files/
  7. IGV/ IGVtools http://www.broadinstitute.org/software/igv/download
  1. bamtools https://github.com/pezmaster31/bamtools
    1. requires cmake: http://www.cmake.org/files/v2.8/cmake-2.8.12.tar.gz (or apt get)
  2. bedtools http://code.google.com/p/bedtools/downloads/list
  3. GATK http://www.broadinstitute.org/gatk/auth?package=GATK (download yourself: license!)
  4. vcftools http://sourceforge.net/projects/vcftools/files/

Specific tools 2/RNA-Seq

  1. tophat http://tophat.cbcb.umd.edu/
  2. cufflinks http://cufflinks.cbcb.umd.edu/ (may require Boost libs!)
  3. GEMtools https://github.com/gemtools/gemtools

Vagrant fixes

For X11 forwarding the Vagrantfile has to contain

config.ssh.forward_x11 = true


Introduction to Linux and the command line

  1. why Linux?
    1. runs on everything from cell phones to supercomputers
    2. long history of stable tools
    3. most of the bioinformatics software was written and intended to run on Linux
  1. File and directories naming

Linux is case sensitive, do MyFile.txt is different from MYFILE.TXT or myfile.txt. Try to use some consistent naming schemes for your project directories, input data or result files. You can use very long, descriptive names, like these result files from ENCODE project:

wgEncodeUwTfbsNhdfneoCtcfStdAlnRep0.bam_VS_wgEncodeUwTfbsNhdfneoInputStdAlnRep1.bam.regionPeak.gz

Things to keep in mind:

    1. never ever use space/tabs in your file/directory names
    2. avoid Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)
  1. logging in, connecting to other servers with ssh / sftp

As with other computers, one requires username password combination to connect to a specific computer. This combination can be specific to each of the computers or shared between i.e. all workstations at a given location.

SSH is a name for secure, encrypted connections between computers. It consist of two components, ssh server running on a remote machine and a ssh client on your laptop / workstation. The client is included in the default installations of recent Linux and OS X (Mac), but on Windows one has to install it ( http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html ). It is not required you have it for the course, but it is a good idea to have it on your computer if you want to access remote servers on the command line.

With properly configured ssh connection (on Linux and Mac) you can run not just command line programs but also graphical user interface. This is also possible on Windows, but it is way more complicated to set up.

  1. Linux directory structure

As with Windows and Mac systems one can imagine the systems of directories (synonym of folders) as a giant tree, where each of the branches has an upper directory and may contain lower level directories (sub-directories). At the top of this branched hierarchy it is just one place from where everything starts (no C:, D: X: drives), and in Unix speak it is called root directory. Some examples of directory naming:

/home/linus/

/usr/local/bin

/home/linus/bioinf/programs

/home/linus/projects/chicken

Some shortcuts to remember (what is after # sign is a description):


/ #root directory

~ #your home directory

.. #one directory above

. #current directory  
 
  1. copy, rename/move files, create directories, symbolic links
  2. view files (more/less, head, tail), count (wc)
  3. search for strings / replace strings (grep & sed)
  4. compressing / uncompressing files (gzip, bzip2, tar)
  5. pipelines and redirection
  6. awk in 5 minutes
  7. where to go from there (clusters, python)

FASTQ

  1. Illumina file formats (quality encodings)
  2. paired / unpaired reads
  3. quality checking (fastqc)
  4. trimming & filtering (TagDust)
  5. source of published FASTQ data: Short Read Archive vs ENA

Genomic fasta and gtf/gff gene annotation

  1. resources at ENSEMBL
  2. basic checks and reformatting
  • grepping fasta headers
  • fasta reformat from exonerate??

Mapping genomic reads

  1. overview of mappers
    1. GEM
    2. bwa +/- stampy
    3. last / bowtie
  2. mapping steps (for each mapper)
  3. genome indexing
  4. mapping
  5. +/- postprocessing

SAM and BAM file formats

  1. Analyzing BAM files
  2. sorting / indexing
  3. viewing the mappings in IGV

tools for processing BAM files

  1. samtools
  2. picard
  3. bamtools

getting mapping stats

  1. extracting reads mapping to regions
  2. getting coverage info for selected regions

Detecting SNPs

  1. general procedure
  2. GATK pipeline
  3. other SNP calling programs [tba]

Working with VCF files

  1. VCF file format
  2. viewing VCFs in IGV
  3. filtering SNPs by quality
  4. set operations on VCF files (common SNPs, unique SNPs)

RNASeq

  1. caveats (ribosomal RNA contamination)
  2. mapping RNASeq
  3. tophat
  4. GRAPE
  5. creating gene models from RNASeq (cufflinks)