Wayne:High Throughput Sequencing Resources: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 28: Line 28:


<u>Rules</u>
<u>Rules</u>
Developing a pipeline:
*Developing a pipeline:
copy a small but representative part of your data to sirius run all the programs you need on them
**copy a small but representative part of your data to sirius
debug and save final version of pipeline e.g. in a text file copy all your data
**run all the programs you need on them
run your pipeline on all data
**debug and save final version of pipeline e.g. in a text file
debug and update pipeline
**copy all your data
mv results wherever you want
**run your pipeline on all data
erase data
**debug and update pipeline
 never start more jobs than the number of available cores
**mv results wherever you want
 look at the memory and cpu usage before you start to load
**erase data
sirius with commands (cmd) htop or top
*Never start more jobs than the number of available cores
*
*Look at the memory and cpu usage before you start to load sirius with commands (cmd)
*
*Use to view real-time CPU usage:
*
**htop
**top
*If you don't know something, use manual
**man ls
**Google
**Ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
 
<u>Installing programs yourself</u>
*Check if it's already installed
*Create a dir in your home folder
**mkdir ~/bin
*Put it in your path or check to see if it's already there
**cat .bash_profile
***PATH=$PATH:$HOME/bin
***export PATH
*Install programs to bin
**compile it with prefix ~/bin
 
<u>Data transfer (network)</u>
*Command Line Interface (CLI)
**scp [options] user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2
**if you copy directories, use the option -r for recursive
*Graphical User Interface (GUI)
**FileZilla, Cyberduck, Fugu, etc.....
*First check if there is enough space available for you to move data
**check disk usage
***df -h
**check disk space used by a dir
***du -hs /path
***du -h -max-depth=1 /path
 
<u>Data editing)</u>
*Small modifications to a file on the server
**vim filename
*vi
 


<br>
<br>

Revision as of 16:20, 19 February 2013


Sirius Usage (our lab server)

Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow.

Login

  • ssh user@sirius.eeb.ucla.edu
  • slogin user@sirius.eeb.ucla.edu
  • to learn about the server:
    • uname -a

Structure and organization

  • Your home (user) director holds <5Gb of data (be aware!)
    • /home/user
  • For genomes and databases
    • /databases
  • Location of installed programs
    • /usr/local/bin
    • /opt/
  • The location to store your data
    • /data/
    • /data/user
      • You can create your own personal directory if you'd like (see below for commands)
  • The location to place scripts
    • /work/user

Rules

  • Developing a pipeline:
    • copy a small but representative part of your data to sirius
    • run all the programs you need on them
    • debug and save final version of pipeline e.g. in a text file
    • copy all your data
    • run your pipeline on all data
    • debug and update pipeline
    • mv results wherever you want
    • erase data
  • Never start more jobs than the number of available cores
  • Look at the memory and cpu usage before you start to load sirius with commands (cmd)
  • Use to view real-time CPU usage:
    • htop
    • top
  • If you don't know something, use manual
    • man ls
    • Google
    • Ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)

Installing programs yourself

  • Check if it's already installed
  • Create a dir in your home folder
    • mkdir ~/bin
  • Put it in your path or check to see if it's already there
    • cat .bash_profile
      • PATH=$PATH:$HOME/bin
      • export PATH
  • Install programs to bin
    • compile it with prefix ~/bin

Data transfer (network)

  • Command Line Interface (CLI)
    • scp [options] user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2
    • if you copy directories, use the option -r for recursive
  • Graphical User Interface (GUI)
    • FileZilla, Cyberduck, Fugu, etc.....
  • First check if there is enough space available for you to move data
    • check disk usage
      • df -h
    • check disk space used by a dir
      • du -hs /path
      • du -h -max-depth=1 /path

Data editing)

  • Small modifications to a file on the server
    • vim filename
  • vi



Basic server commands (for Sirius)

Here is a list of commonly used linux commands:

Command Usage
ssh username@sirius.eeb.ucla.edu Secure shell login to the Sirius server
logout (or control+D) Logout of the Sirius server
pwd Print working directory (your current location
ls List (all contents of current location)
ls options ls -a (hidden files), ls -l (long/detailed list), ls -t (sorted by time modified instead of name)
cd /give/path Change directories
cd .. Go up one directory
mkdir directoryName Make a new directory
rmdir directoryName Remove directory (must be empty)...Remember that you cannot undo this move!
rmdir -r directoryName Recursively remove directory and the files it contains...Remember that you cannot undo this move!
rmdir filename Remove specified file...Remember that you cannot undo this move!
head filename Print to screen the top 10 lines or so of the specified file
tail filename Print to screen the last 10 lines or so of the specified file
more filename Allows file contents or piped output to be sent to the screen one page at a time
less filename Opposite of more command
wc filename Print byte, word, and line counts
wc [options] filename -c (bytes); -l (lines); -w (words) delimited by whitespace or newline
whereis [filename, command] Lists all occurances of filename or command
mv current/path destination/path Move (akin to cut/paste), to remove the file in the current location
cp current/path destination/path Copy (also used to rename files if you keep them in their current path), keeps a copy in the current path
~/path Tilde designated a shortcut for the path to your home directory
nohup commands & To initiate a no-hangup background job
screen To initiate a new screen session to start a new background job
tar -xzf filename.tar.gz Decompress tar.gz file
gzip -c filename >filename.gz Compress file into tar.gz; the ">" means print to outfile filename.gz



Here is a list of commonly used linux commands for learning about the CPU utilization:

Command Usage
top Display top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.
mpstat To display the utilization of each CPU individually. It reports processors related statistics.
mpstat -P ALL The mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported.
sar Displays the contents of selected cumulative activity counters in the operating system


High throughput (HT) platform and read types

  • ABI-SOLiD
  • Illumina single-end vs. paired-end
  • Ion Torrent
  • MiSeq
  • Roche-454
  • Solexa


CBI Collaboratory

UCLA's

Computational Biosciences Institute Collaboratory hosts a variety of 3-day workshops that provide both a general introduction to genome/bioinformatic sciences as well as more advanced (focus) workshops (e.g. ChIP-Seq; BS-Seq; Exome sequencing). The CBI Collaboratory focuses on a set of publicly available resources, from the web-based bioinformatic tool Galaxy/UCLA (resource for HT workflows and is a central location of a variety of HT tools for multiple platforms and data types), but also tools such as R and Matlab. The introductory workshops do not require any programming experience and the Collaboratory Fellows additionally serve as a counseling resource for data analysis.


File formats and conversions

  • blc
  • qseq
  • fastq


Deplexing using barcoded sequence tags

  • Editing (or hamming) distance


Quality control

  • Fastx tools
  • Using mapping as the quality control for reads



Trimming and clipping

  • Trim based on low quality scored per nucleotide position within a read
  • Clip sequence artefacts (e.g. adapters, primers)



FASTQC and FASTX tools


BED and SAM tools


GATK variant calling


R basics


HT sequence analysis using R (and Bioconductor)


DNA sequence analysis


RNA-seq analysis

Common objectives of transcriptome analysis:


SOLiD software tools