Wayne:High Throughput Sequencing Resources

From OpenWetWare

(Difference between revisions)
Jump to: navigation, search
(Getting your HT sequence data)
(Basic unix and usage of Sirius (our lab server))
Line 77: Line 77:
*ls --- lists your files
*ls --- lists your files
*ls -l --- lists your files in long format
*ls -l --- lists your files in long format
-
*ls -a --- shows hidden files
+
*ls -a --- shows hidden files... this is actually a critical command! If you *think* you are using  little space but it turns out you have a million hidden files... voila, hidden files can be managed.
*ls -t --- sorted by time modified instead of name
*ls -t --- sorted by time modified instead of name
*more ''filename'' --- shows first part of a file; hit space bar to see more
*more ''filename'' --- shows first part of a file; hit space bar to see more

Revision as of 16:19, 13 March 2013

Wayne Lab Home
Laboratory Protocols


Contents

Basic unix and usage of Sirius (our lab server)

Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. For the PDF, click here.

You should familiarize yourself with some basic Unix commands by doing a few tutorials.

Login

  • ssh user@sirius.eeb.ucla.edu --- to secure login
  • slogin user@sirius.eeb.ucla.edu --- to secure login
  • uname -a --- to learn about the server
  • passwd --- to change the default password you are given
  • logout (or control+D) --- to logout


Structure and organization

  • Your home (user) director holds <5Gb of data (be aware!)
    • /home/user
  • For genomes and databases
    • /databases
  • Location of installed programs
    • /usr/local/bin
    • /opt/
  • The location to store your data
    • /data/
    • /data/user
      • You can create your own personal directory if you'd like (see below for commands)
  • The location to place scripts and data ONLY while you are working with it
    • /work/user
  • whoami --- returns your username


Rules

  • Developing a pipeline:
    • copy a small but representative part of your data to sirius
    • run all the programs you need on them
    • debug and save final version of pipeline (e.g. in a text file)
    • copy all your data
    • run your pipeline on all data
    • debug and update pipeline
    • move results wherever you want
    • erase data
  • Never start more jobs than the number of available cores (e.g. If there are 50 jobs running, do NOT submit more than 14 to make a total of 64 jobs)!!
  • Look at the memory and cpu usage before you start to load sirius with commands (cmd)
    • htop --- use to view real-time CPU usage
    • top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.
  • If you don't know something, use manual
    • man ls --- to look up the functionality of the ls tool, use Google, or ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
  • mpstat --- to display the utilization of each CPU individually. It reports processors related statistics
  • mpstat -P ALL --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
  • sar --- displays the contents of selected cumulative activity counters in the operating system
  • ps -u yourusername --- lists your processes
  • kill PID"" --- kills (ends) the process with that process ID


Installing programs yourself

  • Check if it's already installed
  • mkdir ~/bin --- to creak a directory in your home folder
  • cat .bash_profile --- put it in your path or check to see if it's already there
  • PATH=$PATH:$HOME/bin
  • export PATH
  • compile it with prefix ~/bin --- install programs to bin


Data transfer (network)

  • scp options user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 --- Command Line Interface (CLI) for moving files
  • scp -r user@host_source:path/to/dir user@host_dest:/dest/path --- Command Line Interface (CLI) for moving directories
  • FileZilla, Cyberduck, Fugu, etc. --- Graphical User Interface (GUI)
  • df -h -- check disk usage
  • du -hs /path --- check disk space used by a directory
  • du -h -max-depth=1 /path --- check disk space used by a directory


Files

  • ls --- lists your files
  • ls -l --- lists your files in long format
  • ls -a --- shows hidden files... this is actually a critical command! If you *think* you are using little space but it turns out you have a million hidden files... voila, hidden files can be managed.
  • ls -t --- sorted by time modified instead of name
  • more filename --- shows first part of a file; hit space bar to see more
  • head filename --- print to screen the top 10 lines or so of the specified file
  • tail filename --- print to screen the last 10 lines or so of the specified file
  • emacs filename --- an editor for editing a file
  • cp filename1 filename2 --- copies a file in your current location
  • cp path/to/filename1 path/to/filename2 --- you can specify a file copy at another location
  • rm filename --- permanently remove a file (Caution! This cannot be undone!)
  • diff filename1 filename2 --- compares files and shows where they differ
  • wc filename --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
  • wc -l filename --- tells you how many lines are in a file (whitespace or newline delimited)
  • wc -w filename --- tells you how many words are in a file
  • wc -c filename --- tells you how many characters (bytes) are in a file
  • chmod options filename --- change the read, write, and execute permissions for a file (Google this!)


File compression

  • gzip filename --- compresses files to make a file with a .gz extension
  • gzip -c filename >filename.gz --- compress file into tar.gz; the ">" means print to outfile filename.gz
  • gunzip filename ---uncompress a gzip file
  • tar -xzf filename.tar.gz --- decompressing a tar.gz file
  • gzcat filename --- lets y ou look at a gzipped file without having to gunzip it


Directories

  • pwd --- prints working directory (your current location)
  • cd /path/to/desired/location --- change directories by providing path
  • cd ../ --- go up one directory
  • mkdir directoryName --- make a new directory
  • rmdir directoryName --- remove directory (must be empty)...Remember that you cannot undo this move!
  • rmdir -r directoryName --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
  • rmdir filename --- remove specified file...Remember that you cannot undo this move!


Finding things

  • whereis [filename, command] --- lists all occurances of filename or command
  • ff --- finds files anywhere on the system
  • ff -p --- finds a file by just typing in the beginning of the file name
  • grep string filename(s) --- looks for strings in the files (use man grep for more information)
  • ~/path --- tilde designated a shortcut for the path to your home directory
  • nohup commands & --- to initiate a no-hangup background job (writes stdout to nohup.out)
  • screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)


Data editing

  • vim filename --- to edit the file


History

  • ctrl+r --- searching history
  • history --- display history
  • !#cmd_num --- display history
  • Arrow up is a short cut to scroll through recently used commands




Top
Wayne Lab Home

High throughput (HT) platform and read types

  • ABI-SOLiD
  • Illumina single-end vs. paired-end
  • Ion Torrent
  • MiSeq
  • Roche-454
  • Solexa


Top
Wayne Lab Home

CBI Collaboratory

UCLA's
Computational Biosciences Institute Collaboratory hosts a variety of 3-day workshops that provide both a general introduction to genome/bioinformatic sciences as well as more advanced (focus) workshops (e.g. ChIP-Seq; BS-Seq; Exome sequencing). The CBI Collaboratory focuses on a set of publicly available resources, from the web-based bioinformatic tool Galaxy/UCLA (resource for HT workflows and is a central location of a variety of HT tools for multiple platforms and data types), but also tools such as R and Matlab. The introductory workshops do not require any programming experience and the Collaboratory Fellows additionally serve as a counseling resource for data analysis.


Top
Wayne Lab Home


Getting your HT sequence data

1. Walk a hard drive over (e.g. Freimer Lab)

  • Not deplexed
  • bcl are image files to help the machine store read data during sample sequencing...this is the NEW way of producing results files
  • Convert to qseq using the program CASAVA

2. rsync (e.g. Pellegrini Lab)

  • Retrieve qseq (not deplexed) files

3. ftp site (e.g. Berkeley)

  • Added cost for library preps ($150/sample), run bioanalyzer, qPCR, and quantification
  • Conversion of bcl to qseq
  • Option to retrieve data as fastq
  • Added cost (?) for deplexing

4. MiSeq (e.g. UCLA Human Genetics Core)

  • Retrieve fasta file formats
  • They can deplex and map data


Top
Wayne Lab Home

File formats and conversions

  • blc
  • qseq
  • fastq


Top
Wayne Lab Home

Deplexing using barcoded sequence tags

  • Editing (or hamming) distance


Top
Wayne Lab Home

Quality control

  • Fastx tools
  • Using mapping as the quality control for reads



Top
Wayne Lab Home

Trimming and clipping

  • Trim based on low quality scored per nucleotide position within a read
  • Clip sequence artefacts (e.g. adapters, primers)


Top
Wayne Lab Home


FASTQC and FASTX tools


Top
Wayne Lab Home

BED and SAM tools


Top
Wayne Lab Home

GATK variant calling

GATK and GATK Guide


Top
Wayne Lab Home

R basics

Here is a file with some helpful R commands for inputting data, making basic plots, statistics, etc. courtesy of Los Lobos.

Also, refer to the following websites for help:


Top
Wayne Lab Home

Python basics

Here is a file with helpful commands in Python, BioPython, EggLib, etc., from Los Lobos.

Also, here are several links to help you get going:


Top
Wayne Lab Home

HT sequence analysis using R (and Bioconductor)


Top
Wayne Lab Home

DNA sequence analysis


Top
Wayne Lab Home

RNA-seq analysis

Common objectives of transcriptome analysis:


Top
Wayne Lab Home

SOLiD software tools


Top
Wayne Lab Home
Personal tools