Wayne:High Throughput Sequencing Resources: Difference between revisions
Line 168: | Line 168: | ||
*Retrieve qseq (not deplexed) files | *Retrieve qseq (not deplexed) files | ||
3. ftp site (e.g. Berkeley) | 3. ftp site (e.g. Berkeley) | ||
*Added cost for library preps ($150/sample), run bioanalyzer, qPCR, and quantification | |||
*Conversion of bcl to qseq | *Conversion of bcl to qseq | ||
*Option to retrieve data as fastq | |||
*Added cost (?) for deplexing | *Added cost (?) for deplexing | ||
4. MiSeq (e.g. UCLA Human Genetics Core) | 4. MiSeq (e.g. UCLA Human Genetics Core) |
Revision as of 18:23, 21 February 2013
Basic unix and usage of Sirius (our lab server)
Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. For the PDF, click here.
You should familiarize yourself with some basic Unix commands by doing a few tutorials.
Login
- ssh user@sirius.eeb.ucla.edu --- to secure login
- slogin user@sirius.eeb.ucla.edu --- to secure login
- uname -a --- to learn about the server
- passwd --- to change the default password you are given
- logout (or control+D) --- to logout
Structure and organization
- Your home (user) director holds <5Gb of data (be aware!)
- /home/user
- For genomes and databases
- /databases
- Location of installed programs
- /usr/local/bin
- /opt/
- The location to store your data
- /data/
- /data/user
- You can create your own personal directory if you'd like (see below for commands)
- The location to place scripts and data ONLY while you are working with it
- /work/user
- whoami --- returns your username
Rules
- Developing a pipeline:
- copy a small but representative part of your data to sirius
- run all the programs you need on them
- debug and save final version of pipeline (e.g. in a text file)
- copy all your data
- run your pipeline on all data
- debug and update pipeline
- move results wherever you want
- erase data
- Never start more jobs than the number of available cores (e.g. If there are 50 jobs running, do NOT submit more than 14 to make a total of 64 jobs)!!
- Look at the memory and cpu usage before you start to load sirius with commands (cmd)
- htop --- use to view real-time CPU usage
- top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.
- If you don't know something, use manual
- man ls --- to look up the functionality of the ls tool, use Google, or ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
- mpstat --- to display the utilization of each CPU individually. It reports processors related statistics
- mpstat -P ALL --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
- sar --- displays the contents of selected cumulative activity counters in the operating system
- ps -u yourusername --- lists your processes
- kill PID"" --- kills (ends) the process with that process ID
Installing programs yourself
- Check if it's already installed
- mkdir ~/bin --- to creak a directory in your home folder
- cat .bash_profile --- put it in your path or check to see if it's already there
- PATH=$PATH:$HOME/bin
- export PATH
- compile it with prefix ~/bin --- install programs to bin
Data transfer (network)
- scp options user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 --- Command Line Interface (CLI) for moving files
- scp -r user@host_source:path/to/dir user@host_dest:/dest/path --- Command Line Interface (CLI) for moving directories
- FileZilla, Cyberduck, Fugu, etc. --- Graphical User Interface (GUI)
- df -h -- check disk usage
- du -hs /path --- check disk space used by a directory
- du -h -max-depth=1 /path --- check disk space used by a directory
Files
- ls --- lists your files
- ls -l --- lists your files in long format
- ls -a --- shows hidden files
- ls -t --- sorted by time modified instead of name
- more filename --- shows first part of a file; hit space bar to see more
- head filename --- print to screen the top 10 lines or so of the specified file
- tail filename --- print to screen the last 10 lines or so of the specified file
- emacs filename --- an editor for editing a file
- cp filename1 filename2 --- copies a file in your current location
- cp path/to/filename1 path/to/filename2 --- you can specify a file copy at another location
- rm filename --- permanently remove a file (Caution! This cannot be undone!)
- diff filename1 filename2 --- compares files and shows where they differ
- wc filename --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
- wc -l filename --- tells you how many lines are in a file (whitespace or newline delimited)
- wc -w filename --- tells you how many words are in a file
- wc -c filename --- tells you how many characters (bytes) are in a file
- chmod options filename --- change the read, write, and execute permissions for a file (Google this!)
File compression
- gzip filename --- compresses files to make a file with a .gz extension
- gzip -c filename >filename.gz --- compress file into tar.gz; the ">" means print to outfile filename.gz
- gunzip filename ---uncompress a gzip file
- tar -xzf filename.tar.gz --- decompressing a tar.gz file
- gzcat filename --- lets y ou look at a gzipped file without having to gunzip it
Directories
- pwd --- prints working directory (your current location)
- cd /path/to/desired/location --- change directories by providing path
- cd ../ --- go up one directory
- mkdir directoryName --- make a new directory
- rmdir directoryName --- remove directory (must be empty)...Remember that you cannot undo this move!
- rmdir -r directoryName --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
- rmdir filename --- remove specified file...Remember that you cannot undo this move!
Finding things
- whereis [filename, command] --- lists all occurances of filename or command
- ff --- finds files anywhere on the system
- ff -p --- finds a file by just typing in the beginning of the file name
- grep string filename(s) --- looks for strings in the files (use man grep for more information)
- ~/path --- tilde designated a shortcut for the path to your home directory
- nohup commands & --- to initiate a no-hangup background job (writes stdout to nohup.out)
- screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)
Data editing
- vim filename --- to edit the file
History
- ctrl+r --- searching history
- history --- display history
- !#cmd_num --- display history
- Arrow up is a short cut to scroll through recently used commands
High throughput (HT) platform and read types
- ABI-SOLiD
- Illumina single-end vs. paired-end
- Ion Torrent
- MiSeq
- Roche-454
- Solexa
CBI Collaboratory
UCLA's
Getting your HT sequence data
1. Walk a hard drive over (e.g. Freimer Lab)
- Not deplexed
- bcl are image files to help the machine store read data during sample sequencing...this is the NEW way of producing results files
- Convert to qseq using the program CASAVA
2. rsync (e.g. Pellegrini Lab)
- Retrieve qseq (not deplexed) files
3. ftp site (e.g. Berkeley)
- Added cost for library preps ($150/sample), run bioanalyzer, qPCR, and quantification
- Conversion of bcl to qseq
- Option to retrieve data as fastq
- Added cost (?) for deplexing
4. MiSeq (e.g. UCLA Human Genetics Core)
- Retrieve fasta file formats
- They can deplex and map data
File formats and conversions
- blc
- qseq
- fastq
Deplexing using barcoded sequence tags
- Editing (or hamming) distance
Quality control
- Fastx tools
- Using mapping as the quality control for reads
Trimming and clipping
- Trim based on low quality scored per nucleotide position within a read
- Clip sequence artefacts (e.g. adapters, primers)
FASTQC and FASTX tools
BED and SAM tools
GATK variant calling
GATK and GATK Guide
R basics
Here is a file with some helpful R commands for inputting data, making basic plots, statistics, etc. courtesy of Los Lobos.
Also, refer to the following websites for help:
Python basics
Here is a file with helpful commands in Python, BioPython, EggLib, etc., from Los Lobos.
Also, here are several links to help you get going:
HT sequence analysis using R (and Bioconductor)
DNA sequence analysis
RNA-seq analysis
Common objectives of transcriptome analysis:
- Quantifying and annotating aligned reads
- Normalizing RNA-Seq read count data and identifying differentially expressed genes (DEG) (R packages):
- easyRNASeq (simplifies read counting per genome feature)
- DEXSeq (Inference of differential exon usage)
- baySeq (also see: segmentSeq)
- Genominator (Bullard et al. 2010)
- Detection of alternative splice junctions
SOLiD software tools