User talk:Darek Kedra/sandbox 28: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 45: Line 45:
## most of the bioinformatics software was written and intended to run on Linux
## most of the bioinformatics software was written and intended to run on Linux


#File and directories naming
 
 
===logging in, connecting to other servers with ssh / sftp===
 
As with other computers, one requires username password combination to connect to a specific computer. This combination can be specific to each of the computers or shared between i.e. all workstations at a given location.
 
SSH is a name for secure, encrypted connections between computers. It consist of two components, ssh server running on a remote machine and a ssh client on your laptop / workstation. The client is included in the default installations of recent Linux and OS X (Mac), but on Windows one has to install it ( http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html ). It is not required you have it for the course, but it is a good idea to have it on your computer if you want to access remote servers on the command line.
 
With properly configured ssh connection (on Linux and Mac) you can run not just command line programs but also graphical user interface. This is also possible on Windows, but it is way more complicated to set up.
 
XXX Emilio: start vagrant or Linux from inside your virtual box machine, log in XXX
 
 
===File and directories naming===


Linux is case sensitive, do MyFile.txt is different from MYFILE.TXT or myfile.txt. Try to use some consistent naming schemes for your project directories, input data or result files. You can use very long, descriptive names, like these result files from ENCODE project:
Linux is case sensitive, do MyFile.txt is different from MYFILE.TXT or myfile.txt. Try to use some consistent naming schemes for your project directories, input data or result files. You can use very long, descriptive names, like these result files from ENCODE project:
Line 56: Line 69:
## avoid Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)  
## avoid Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)  


#logging in, connecting to other servers with ssh / sftp
As with other computers, one requires username password combination to connect to a specific computer. This combination can be specific to each of the computers or shared between i.e. all workstations at a given location.
SSH is a name for secure, encrypted connections between computers. It consist of two components, ssh server running on a remote machine and a ssh client on your laptop / workstation. The client is included in the default installations of recent Linux and OS X (Mac), but on Windows one has to install it ( http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html ). It is not required you have it for the course, but it is a good idea to have it on your computer if you want to access remote servers on the command line.
With properly configured ssh connection (on Linux and Mac) you can run not just command line programs but also graphical user interface. This is also possible on Windows, but it is way more complicated to set up.
   
   
#Linux directory structure
===Linux directory structure/navigating===
As with Windows and Mac systems one can imagine the systems of directories (synonym of folders) as a giant tree, where each of the branches has an upper directory and may contain lower level directories (sub-directories). At the top of this branched hierarchy it is just one place from where everything starts (no C:, D: X: drives), and in Unix speak it is called root directory. Some examples of directory naming:
As with Windows and Mac systems one can imagine the systems of directories (synonym of folders) as a giant tree, where each of the branches has an upper directory and may contain lower level directories (sub-directories). At the top of this branched hierarchy it is just one place from where everything starts (no C:, D: X: drives), and in Unix speak it is called root directory. Some examples of directory naming:


Line 77: Line 83:


</pre>
</pre>
Individual users have separate directories in which they store their data, results etc.


Some shortcuts to remember (what is after # sign is a description):
Some shortcuts to remember (what is after # sign is a description):
Line 87: Line 95:
.. #one directory above
.. #one directory above


. #current directory   
. #current directory   
   
   
</pre>
</pre>
There are few commands to move between directories, create new ones, and list what is located there:
<pre>
pwd #print working directory = show where you are
ls #list what is located in the current directory
ls -l #list what is located in the current directory with details
cd /home/linus/projects/chicken #go to this directory
cd - # special shortcut to go back to the directory you have been before
cd / #go to the root directory
cd ~ #go to your home directory spec
cd .. #go todirectory one level above the current one
mkdir mynew_directory #create new directory
rmdir myold_directory #remove some directory (it must be empty!)
</pre>
=== Absolute vs relative directory naming==
We can do all the operations on directories or files using two conventions:
<pre>
# absolute directory path starting with root "/"
ls /home/linus/projects/chicken/genome.fasta
#relative directory path (lets assume we are in  /home/linus/projects/banana/)
ls ../chicken/genome.fasta
</pre>
This is useful shortcut for saving typing and avoiding errors when i.e. working on different systems.
#Permissions
In order to restrict actions associated with a given file or directory, each entity has 3 flags (r = read, w = write, e = execute) for 3 groups of users (file owner, group to which file owner belongs, i.e. students, and "all" for all the remaining users on that computer). So we have 9 fields describing what each of these groups can do with the file. On the top of it (or rather in the front of the string) we have another flag to tell us about what kind of thing it is ("-" = just a regular file, d = directory, l = link, etc.)
By default, owner has read+write permissions, his group members can read but not write (modify or delete) his files, and the rest of the word should have no right to see content of his files.
These permissions are visible when listing content of a directory with "ls -l":
<pre>
-rwxr-xr-x 1 vagrant users    49 Nov  7 14:43 my_program.py
-rw-r----- 1 vagrant users  1812 Nov  7 14:39 myfile01.txt
---------- 1 vagrant users  9045 Nov  7 14:41 myfile02.txt
-r--r--r-- 1 vagrant users  2016 Nov  7 14:39 myfile03.txt
-rw------- 1 vagrant users 67863 Nov  7 14:40 myfile04.txt
-rw-rw-rw- 1 vagrant users  8125 Nov  7 14:40 myfile05.txt
-rw-rw---- 1 vagrant users  7233 Nov  7 14:40 myfile06.txt
lrwxrwxrwx 1 vagrant users    12 Nov  7 14:50 myfile07.txt -> myfile06.txt
drwxr-xr-x 2 vagrant users    0 Nov  7 14:50 test_dir1
</pre>
To change permissions, we have to specify first the group we want to modify, action (add permissions or remove them) then the permission themselves, finaly the name of the file(s):
<pre>
chmod a+x my_new_script.py #add execute permission to for all
chmod a-w # remove write permission for owner, group and the rest
</pre>
This is often useful for sharing data / results with other users on the same machine or when writing scripts/executing some programs downloaded from the net.


#copy, rename/move files, create directories, symbolic links
#copy, rename/move files, create directories, symbolic links
#view files (more/less, head, tail), count (wc)
#view files (more/less, head, tail), count (wc)
#search for strings / replace strings (grep & sed)
#search for strings / replace strings (grep & sed)

Revision as of 06:58, 7 November 2013

Winterschool program

Software list

Basics

  1. linux Ubuntu 12.04.3 vs Debian 7.1 (think about 32 vs 64 bit versions)
  2. java http://www.java.com/en/download/linux_manual.jsp?locale=en

Specific tools 1

  1. FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  1. TagDust: http://genome.gsc.riken.jp/osc/english/software/src/tagdust.tgz
  2. fastareformat from fastareformat exonerate-2.2.0 [1]
  3. fixing fasta headers (gff fields) with python? small script
  4. GEM [2]
    1. CAVEAT: (problem with cores on different laptops...)

http://sourceforge.net/projects/gemlibrary/files/gem-library/Binary%20pre-release%203/

  1. BWA http://sourceforge.net/projects/bio-bwa/files/
  2. Stampy http://www.well.ox.ac.uk/~gerton/software/Stampy/stampy-1.0.22r1848.tgz
  3. last http://last.cbrc.jp/ (the 362 versiona has split and splice-mapping options)
  4. bowtie http://bowtie-bio.sourceforge.net/bowtie2/index.shtml (bowtie2)
  5. samtools http://sourceforge.net/projects/samtools/files/
  6. picard http://sourceforge.net/projects/picard/files/
  7. IGV/ IGVtools http://www.broadinstitute.org/software/igv/download
  1. bamtools https://github.com/pezmaster31/bamtools
    1. requires cmake: http://www.cmake.org/files/v2.8/cmake-2.8.12.tar.gz (or apt get)
  2. bedtools http://code.google.com/p/bedtools/downloads/list
  3. GATK http://www.broadinstitute.org/gatk/auth?package=GATK (download yourself: license!)
  4. vcftools http://sourceforge.net/projects/vcftools/files/

Specific tools 2/RNA-Seq

  1. tophat http://tophat.cbcb.umd.edu/
  2. cufflinks http://cufflinks.cbcb.umd.edu/ (may require Boost libs!)
  3. GEMtools https://github.com/gemtools/gemtools

Vagrant fixes

For X11 forwarding the Vagrantfile has to contain

config.ssh.forward_x11 = true


Introduction to Linux and the command line

  1. why Linux?
    1. runs on everything from cell phones to supercomputers
    2. long history of stable tools
    3. most of the bioinformatics software was written and intended to run on Linux


logging in, connecting to other servers with ssh / sftp

As with other computers, one requires username password combination to connect to a specific computer. This combination can be specific to each of the computers or shared between i.e. all workstations at a given location.

SSH is a name for secure, encrypted connections between computers. It consist of two components, ssh server running on a remote machine and a ssh client on your laptop / workstation. The client is included in the default installations of recent Linux and OS X (Mac), but on Windows one has to install it ( http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html ). It is not required you have it for the course, but it is a good idea to have it on your computer if you want to access remote servers on the command line.

With properly configured ssh connection (on Linux and Mac) you can run not just command line programs but also graphical user interface. This is also possible on Windows, but it is way more complicated to set up.

XXX Emilio: start vagrant or Linux from inside your virtual box machine, log in XXX


File and directories naming

Linux is case sensitive, do MyFile.txt is different from MYFILE.TXT or myfile.txt. Try to use some consistent naming schemes for your project directories, input data or result files. You can use very long, descriptive names, like these result files from ENCODE project:

wgEncodeUwTfbsNhdfneoCtcfStdAlnRep0.bam_VS_wgEncodeUwTfbsNhdfneoInputStdAlnRep1.bam.regionPeak.gz

Things to keep in mind:

    1. never ever use space/tabs in your file/directory names
    2. avoid Unix special characters in file names (!?"'%&^~*$|/\{}[]()<>:)


Linux directory structure/navigating

As with Windows and Mac systems one can imagine the systems of directories (synonym of folders) as a giant tree, where each of the branches has an upper directory and may contain lower level directories (sub-directories). At the top of this branched hierarchy it is just one place from where everything starts (no C:, D: X: drives), and in Unix speak it is called root directory. Some examples of directory naming:

/home/linus/

/usr/local/bin

/home/linus/bioinf/programs

/home/linus/projects/chicken

Individual users have separate directories in which they store their data, results etc.

Some shortcuts to remember (what is after # sign is a description):


/ #root directory

~ #your home directory

.. #one directory above

.  #current directory  
 

There are few commands to move between directories, create new ones, and list what is located there:


pwd #print working directory = show where you are

ls #list what is located in the current directory

ls -l #list what is located in the current directory with details

cd /home/linus/projects/chicken #go to this directory

cd - # special shortcut to go back to the directory you have been before

cd / #go to the root directory

cd ~ #go to your home directory spec

cd .. #go todirectory one level above the current one

mkdir mynew_directory #create new directory

rmdir myold_directory #remove some directory (it must be empty!)

= Absolute vs relative directory naming

We can do all the operations on directories or files using two conventions:

# absolute directory path starting with root "/"

ls /home/linus/projects/chicken/genome.fasta

#relative directory path (lets assume we are in  /home/linus/projects/banana/)

ls ../chicken/genome.fasta

This is useful shortcut for saving typing and avoiding errors when i.e. working on different systems.

  1. Permissions

In order to restrict actions associated with a given file or directory, each entity has 3 flags (r = read, w = write, e = execute) for 3 groups of users (file owner, group to which file owner belongs, i.e. students, and "all" for all the remaining users on that computer). So we have 9 fields describing what each of these groups can do with the file. On the top of it (or rather in the front of the string) we have another flag to tell us about what kind of thing it is ("-" = just a regular file, d = directory, l = link, etc.)

By default, owner has read+write permissions, his group members can read but not write (modify or delete) his files, and the rest of the word should have no right to see content of his files. These permissions are visible when listing content of a directory with "ls -l":


-rwxr-xr-x 1 vagrant users    49 Nov  7 14:43 my_program.py
-rw-r----- 1 vagrant users  1812 Nov  7 14:39 myfile01.txt
---------- 1 vagrant users  9045 Nov  7 14:41 myfile02.txt
-r--r--r-- 1 vagrant users  2016 Nov  7 14:39 myfile03.txt
-rw------- 1 vagrant users 67863 Nov  7 14:40 myfile04.txt
-rw-rw-rw- 1 vagrant users  8125 Nov  7 14:40 myfile05.txt
-rw-rw---- 1 vagrant users  7233 Nov  7 14:40 myfile06.txt
lrwxrwxrwx 1 vagrant users    12 Nov  7 14:50 myfile07.txt -> myfile06.txt
drwxr-xr-x 2 vagrant users     0 Nov  7 14:50 test_dir1

To change permissions, we have to specify first the group we want to modify, action (add permissions or remove them) then the permission themselves, finaly the name of the file(s):

chmod a+x my_new_script.py #add execute permission to for all 

chmod a-w # remove write permission for owner, group and the rest

This is often useful for sharing data / results with other users on the same machine or when writing scripts/executing some programs downloaded from the net.

  1. copy, rename/move files, create directories, symbolic links
  1. view files (more/less, head, tail), count (wc)
  2. search for strings / replace strings (grep & sed)
  3. compressing / uncompressing files (gzip, bzip2, tar)
  4. pipelines and redirection
  5. awk in 5 minutes
  6. where to go from there (clusters, python)

FASTQ

  1. Illumina file formats (quality encodings)
  2. paired / unpaired reads
  3. quality checking (fastqc)
  4. trimming & filtering (TagDust)
  5. source of published FASTQ data: Short Read Archive vs ENA

Genomic fasta and gtf/gff gene annotation

  1. resources at ENSEMBL
  2. basic checks and reformatting
  • grepping fasta headers
  • fasta reformat from exonerate??

Mapping genomic reads

  1. overview of mappers
    1. GEM
    2. bwa +/- stampy
    3. last / bowtie
  2. mapping steps (for each mapper)
  3. genome indexing
  4. mapping
  5. +/- postprocessing

SAM and BAM file formats

  1. Analyzing BAM files
  2. sorting / indexing
  3. viewing the mappings in IGV

tools for processing BAM files

  1. samtools
  2. picard
  3. bamtools

getting mapping stats

  1. extracting reads mapping to regions
  2. getting coverage info for selected regions

Detecting SNPs

  1. general procedure
  2. GATK pipeline
  3. other SNP calling programs [tba]

Working with VCF files

  1. VCF file format
  2. viewing VCFs in IGV
  3. filtering SNPs by quality
  4. set operations on VCF files (common SNPs, unique SNPs)

RNASeq

  1. caveats (ribosomal RNA contamination)
  2. mapping RNASeq
  3. tophat
  4. GRAPE
  5. creating gene models from RNASeq (cufflinks)