VonHoldt:High Throughput Sequencing Resources: Difference between revisions

Revision as of 16:45, 5 June 2013

Basic unix and usage of Sirius (our lab server)

Sirius is our analytical powerhouse (64 cores, amazing for parallel computing; 512Gb memory; 64 bit file system in the x86_64 configuration) and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through our IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. For the PDF, click here.

You should familiarize yourself with some basic Unix commands by doing a few tutorials. Here is also a nice website with a large number of linux commands .

Login

ssh user@sirius.eeb.ucla.edu --- to secure login
slogin user@sirius.eeb.ucla.edu --- to secure login
uname -a --- to learn about the server
passwd --- to change the default password you are given
logout (or control+D) --- to logout

Structure and organization

Your home (user) directory on Sirius holds <5Gb of data (be aware!)
- /home/user
For genomes and databases
- /databases
Location of installed programs
- /usr/local/bin
- /opt/
The location to store your data
- /data/
- /data/user
  - You can create your own personal directory if you'd like (see below for commands)
The location to place scripts and data ONLY while you are working with it
- /work/user
whoami --- returns your username
du -a username --- returns your space usage but make sure to do this from the parent directory of your user directory

Rules

Developing a pipeline:
- copy a small but representative part of your data to sirius
- run all the programs you need on them
- debug and save final version of pipeline (e.g. in a text file)
- copy all your data
- run your pipeline on all data
- debug and update pipeline
- move results wherever you want
- erase data
Never start more jobs than the number of available cores (e.g. If there are 50 jobs running, do NOT submit more than 14 to make a total of 64 jobs)!!
Look at the memory and cpu usage before you start to load sirius with commands (cmd)
- htop --- use to view real-time CPU usage
- top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.
If you don't know something, use manual
- man ls --- to look up the functionality of the ls tool, use Google, or ask admins (Jonathan or Ron) or in-lab (Rena or Pedro)
mpstat --- to display the utilization of each CPU individually. It reports processors related statistics
mpstat -P ALL --- the mpstat command display activities for each available processor, processor 0 being the first one. Global average activities among all processors are also reported
sar --- displays the contents of selected cumulative activity counters in the operating system
ps -u yourusername --- lists your processes
kill PID --- kills (ends) the process with that process ID
ps -u username --- lists all the current jobs for a specified username

Installing programs yourself

Check if it's already installed
mkdir ~/bin --- to creak a directory in your home folder
cat .bash_profile --- put it in your path or check to see if it's already there
PATH=$PATH:$HOME/bin
export PATH
compile it with prefix ~/bin --- install programs to bin

Data transfer (network)

scp options user@host_source:path/to/file1 user@host_dest:/dest/path/to/file2 --- Command Line Interface (CLI) for moving files
scp -r user@host_source:path/to/dir user@host_dest:/dest/path --- Command Line Interface (CLI) for moving directories
FileZilla, Cyberduck, Fugu, etc. --- Graphical User Interface (GUI)
df -h -- check disk usage
du -hs /path --- check disk space used by a directory
du -h -max-depth=1 /path --- check disk space used by a directory

Files

ls --- lists your files
ls -l --- lists your files in long format
ls -a --- shows hidden files... this is actually a critical command! If you *think* you are using little space but it turns out you have a million hidden files... voila, hidden files can be managed.
ls -t --- sorted by time modified instead of name
ls -h --- lists your files in "human" format
ls -hla --- gives you all from combing the three commands; it's beautiful.
more filename --- shows first part of a file; hit space bar to see more
head filename --- print to screen the top 10 lines or so of the specified file
tail filename --- print to screen the last 10 lines or so of the specified file
emacs filename --- an editor for editing a file
cp filename1 filename2 --- copies a file in your current location
cp path/to/filename1 path/to/filename2 --- you can specify a file copy at another location
rm filename --- permanently remove a file (Caution! This cannot be undone!)
diff filename1 filename2 --- compares files and shows where they differ
wc filename --- tells you how many lines (whitespace or newline delimited), words, and characters (bytes) are in a file
wc -l filename --- tells you how many lines are in a file (whitespace or newline delimited)
wc -w filename --- tells you how many words are in a file
wc -c filename --- tells you how many characters (bytes) are in a file
chmod options filename --- change the read, write, and execute permissions for a file (Google this!)

File compression [see also the gzip usage website]

gzip filename --- compresses files to make a file with a .gz extension
gzip -c filename >filename.gz --- compress file into tar.gz; the ">" means print to outfile filename.gz
gunzip filename ---uncompress a gzip file
tar -xzf filename.tar.gz --- decompressing a tar.gz file
gzcat filename --- lets y ou look at a gzipped file without having to gunzip it

Directories

pwd --- prints working directory (your current location)
cd /path/to/desired/location --- change directories by providing path
cd ../ --- go up one directory
mkdir directoryName --- make a new directory
rmdir directoryName --- remove directory (must be empty)...Remember that you cannot undo this move!
rmdir -r directoryName --- recursively remove directory and the files it contains...Remember that you cannot undo this move!
rmdir filename --- remove specified file...Remember that you cannot undo this move!

Finding things

whereis [filename, command] --- lists all occurances of filename or command
ff --- finds files anywhere on the system
ff -p --- finds a file by just typing in the beginning of the file name
grep string filename(s) --- looks for strings in the files (use man grep for more information)
~/path --- tilde designated a shortcut for the path to your home directory
nohup commands & --- to initiate a no-hangup background job (writes stdout to nohup.out)
screen --- to initiate a new screen session to start a new background job (ctrl+a+d if you need to detach; screen -ls to list running screens; reattach screen pid)

Data editing

vim filename --- to edit the file

History

ctrl+r --- searching history
history --- display history
!#cmd_num --- display history
Arrow up is a short cut to scroll through recently used commands

VonHoldt:High Throughput Sequencing Resources: Difference between revisions

Revision as of 16:45, 5 June 2013

Basic unix and usage of Sirius (our lab server)

High throughput (HT) platform and read types

CBI Collaboratory

Getting your HT sequence data

File formats and conversions

Deplexing using barcoded sequence tags

Quality control

Trimming and clipping

FASTQC and FASTX tools

BED and SAM tools

GATK variant calling

R basics

Python basics

HT sequence analysis using R (and Bioconductor)

DNA sequence analysis

RNA-seq analysis

SOLiD software tools

Passing Arguments to Scripts and Programs Using xargs

Navigation menu

Search