VonHoldt:High Throughput Sequencing Resources: Difference between revisions

Revision as of 05:51, 8 September 2015

Della and Tigress

The DELLA processing and TIGRESS data storage servers of the High Performance Computing center of Princeton are our analytical powerhouses and we have specific locations on the server to do specific jobs. It is stored in a lovely server closet and so the way to access it is though a secure shell (ssh). Your username and password are obtained through the IT staff. Once you have logged on, there are a series of commands and "server etiquette" you will need to follow. The PU website has more information on basic usage and tutorials if you are interested.

You should familiarize yourself with some basic Unix commands by doing a few tutorials. Here is also a nice website with a large number of linux commands.

Login

ssh netid@della.princeton.edu --- to secure login
- If you are on wifi, you need to use VPN for secure access!! This makes it possible to ssh remotely from Small World Coffee!
slogin netid@della.princeton.edu --- to secure login
uname -a --- to learn about the server
passwd --- to change the default password you are given
logout (or control+D) --- to logout

Rules

Della is only to be used to execute code via a formal job submission program (qsub command)
- You only have 1GB of space on your Della home directory
- You have 500GB of SCRATCH space on your Della home directory
Tigress is for storage of all data! Write all output to this server, as well
- ln -s --- soft link to your Tigress data files using the command

Shared data - the vonHoldt Group

All data is available to members of the vonHoldt Group on Tigress
To access the data
- ssh into Della using your own netID and password
- cd to /tigress/VONHOLDT
  - /tigress/VONHOLDT --- contains directories for scripts, programs, and data
    - Also keep your scripts in your own scripts folder in this directory
  - /tigress/VONHOLDT/data --- contains shared data
Make sure you don't move data around - we will use scripts to link to each file's absolute path and moving data around will break our code
As a rule: DO NOT MOVE DATA UNLESS YOU DISCUSS WITH THE LAB GROUP FIRST
You can create your own directory in /tigress/VONHOLDT if you need to test code, manipulate new files, etc
Please take all efforts to avoid overwriting any data! It will be backed-up locally as well but it will be a hassle to fix (for both you and me)! Thank you!

Submitting jobs using qsub
The to_run.sh script needs to be kept in your Della home directory.

Here is a good format for your own to_run.sh scripts:

#!/bin/sh
#SBATCH -N 4 # nodes=4
#SBATCH --ntasks-per-node=20 # ppn=20
#SBATCH -J MYPROGRAM # job name
#SBATCH -t 14:00 # 14 minutes walltime
#SBATCH --mail-user=username@princeton.edu
#SBATCH --mail-type=begin
#SBATCH --mail-type=end
module load openmpi
srun ./myprogram

Use sbatch to_run.sh to run jobs on Della. Make sure your active part of the to_run.sh (being the ./batchAwk.sh) points to the right directories on Tigress that contain either more scripts or the data. Note that the new nodes have 20 cores and 128G of memory. Again, any problems/questions, please email cses@princeton.edu.

Usage

slurm --- to submit a script (e.g. jobs_to_run.sh) on Della which can point to a perl/python/R/shell scripts on Tigress that does the actual work
sbatch --- to submit your script/job
Job length: Initially estimate 2x the amount of time you think your job will take to complete. You can refine this value over time.
- Test queue
  - 1 hour limit
  - 2 job maximum per user and NOT to be used for production mode
- Short queue
  - 24 hour limit
  - 40 job maximum
- Medium queue
  - 72 hour limit
  - 16 jobs maximum per user
  - 432 total cores
qstat --- to check the job progress on Della
You can ssh into any node once you have the node ID from your qsub to check on the job status using traditional commands:
- htop --- use to view real-time CPU usage
- top --- displays the top CPU processes/jobs and provides an ongoing look at processor activity in real time. It displays a listing of the most CPU-intensive tasks on the system, and can provide an interactive interface for manipulating processes. It can sort the tasks by CPU usage, memory usage and runtime.

A schematic on how Della and Tigress are setup and basic usage.

@@ Line 244: / Line 244: @@
   quit
+The "-i" at login will turn off the prompt where ftp asks you if you want to copy every file in the specified directory. The command "mget *" is to copy everything in your specified directory to your local directory. That local directory is set from wherever you login to ftp. I suggest locally changing to the directory to where you want the files copied. Then initiate the ftp session and mget.
 <br>

VonHoldt:High Throughput Sequencing Resources: Difference between revisions

Revision as of 05:51, 8 September 2015

Della and Tigress

Basic unix

High throughput (HT) platform and read types

File formats and conversions

RSYNC, FTP, and remote login

Deplexing using barcoded sequence tags

Quality control

Trimming and clipping

Mapping to a reference

FASTQC and FASTX tools

BED and SAM tools

GATK variant calling

Variant Call File (VCF)

R basics

Python basics

HT sequence analysis using R (and Bioconductor)

DNA methylation analysis

RNA-seq analysis

SOLiD software tools

Passing Arguments to Scripts and Programs Using xargs

Navigation menu

Search