FEMR Cluster
Cluster Set Up
File:Cajal-diagram.zip Original diagram (editable, SVG format)
Name | RAM | CPU | Cores | Speed | GPU |
cajal | 256 | E5-2630 v4 | 2 x 10 | 2.2 Ghz | - |
c01 | 512 | E5-2698 v4 | 2 x 20 | 2.2 Ghz | 6 x Titan XP (12 GB) |
c02 | 512 | E5-2698 v4 | 2 x 20 | 2.2 Ghz | 6 x Titan XP (12 GB) |
c03 | 256 | E5-2630 v4 | 2 x 10 | 2.2 Ghz | 4 x GTX 1080 (8GB) |
c04 | 512 | E5-2698 v4 | 2 x 20 | 2.2 Ghz | 5 x Titan XP (12 GB) |
c05 | 512 | E5-2698 v4 | 2 x 20 | 2.2 Ghz | 2 x Titan V (12GB) |
zurich | 128 | E5-2690 v3 | 2 x 12 | 2.6 Ghz | 1 x GTX 980 (4GB) |
Logging In
The FEMR cluster employed public & private keys system for logging in. No password is used.
Generating private & public key
In your private computer (Mac or Linux), type:
ssh-keygen -f ~/user
This will create 2 files user and user.pub. user (file) is your private key, where you should KEEP for yourself while user.pub is your public key, which will be placed on the server by the FEMR cluster admin.
Getting approval
Ask permission from your group leader to create your account and sent him/her your public key (user.pub) so that the FEMR cluster admin can place it on the cluster. It is preferable to have your username and private key with the same name.
Logging in using your key
Once your acc is approved, to log in, type:
ssh -i path_to_private_key -Y username@cajal.campus.mcgill.ca
Example:
ssh -i ~/user -Y username@cajal.campus.mcgill.ca
Storage
Storage location
- cajal: Guarne/Vargas/Ortega Lab (/data)
- zurich: Bui Lab (/mnt/zurich/data0)
Navigate to storage space:
cd /data/jvargas
or
cd /mnt/zurich/data0/
Queues
The cluster employs a job queue system (SLURM). The job sent to the queues will be calculated in the computing nodes while the interactive job will run directly on Cajal, which is not recommended because Cajal acts as a head node and storage node.
There are two main GPU queues (titan and gtx) and two cpu queues (cpu and cpusmall)
titan
Use to run class2d, class3d and Refine3d. GPU intensive jobs.
Three nodes. Each node has 6 GPU, 40 cores and 80 threads
When running GPU jobs these nodes will take 7 cores to manage GPUs during the job
titansmall
Use to run class2d, class3d and Refine3d. GPU intensive jobs.
One node with 3 GPU, 10 cores and 20 threads
When running GPU jobs these nodes will take 4 cores to manage GPUs during the job
gtx
Huy node (1)
Use to run MotionCor2 and GctF
One node with 4 GPU, 20 cores and 40 threads
When running GPU jobs these nodes will take 5 cores to manage GPUs during the job
cpu
Cores left for cpu based jobs: 30 cores per node.
mpi=30 Thread=2
There are four nodes in this queue. In principle you could assk for mpi=60 but not recommended. It will take longer to enter in the queue. About mpi=20-30 is reasonable for any job.
cpusmall
Cores left for cpu based jobs: 14 cores in this queue.
Software on the cluster
Load module
In the cluster, before using any program, you have to load it first. Type:
module load program_name
or
module add program_name
("load" and "add" are equivalent).
For example:
- Relion
module load relion
- Scipion
module load scipion
- Imod
module add imod
- Matlab
module load matlab
- Anaconda
module load anaconda
One of the advantages of modules is that it is easy to select which version of the software you want to use. Check what versions are available (see below), then include the version in the command
module load relion/2.1.0
When no version is specified, latest version available on cluster is loaded
Load MotionCor2 is a bit different
spack load --dependencies motioncor2
Check available module/software
module avail
Check what modules are loaded
module list
Unload modules
module purge
Running Jobs on the cluster
Check the status of the queue
To see how busy the cluster is, type:
sinfo
You will get something like this:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST titan up 7-00:00:00 2 idle c[01-02] gtx up 7-00:00:00 1 idle c03 cpu up 7-00:00:00 2 idle c[01-02] cpusmall* up 7-00:00:00 1 idle c03
In NODES STATE this is what it means:
- Idle: No jobs running
- Mixed: Some resources available
- Allocate: all resources in use.
sinfo has lots of options. For example, to get a more detailed view of resource usage:
sinfo -N --format="%.10P %.8N %.16C %G" PARTITION NODELIST CPUS(A/I/O/T) GRES titan c01 48/32/0/80 gpu:titan:6 cpu c01 48/32/0/80 gpu:titan:6 titan c02 8/72/0/80 gpu:titan:6 cpu c02 8/72/0/80 gpu:titan:6 gtx c03 0/0/40/40 gpu:gtx:4 cpusmall* c03 0/0/40/40 gpu:gtx:4 titan c04 12/68/0/80 gpu:titan:6 cpu c04 12/68/0/80 gpu:titan:6 titansmall c05 24/56/0/80 gpu:titan:3 cpu c05 24/56/0/80 gpu:titan:3
The format for CPUs is: Available/Idle/Unavailable/Total. GRES reports the number and type of GPU available.
See jobs running in queues
Type:
squeue
Typical output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 729 gtx 1005 sky PD 0:00 1 (Resources) 728 gtx Relion wasit R 1-23:05:38 1 c03 734 cpu 2652 sheny R 1-03:30:21 1 c01
Killing jobs in queue or running
See jobs in queue by typing
squeue
Then look at the JOBID and type:
scancel JOBID
For example:
scancel 45
Submit a job
sbatch submit.script
The submit.script will be described below.
Submission Script Format
Example of Submission Script
#!/bin/bash #SBATCH --time=2-0 #SBATCH --ntasks=1 #SBATCH --partition=cpusmall #SBATCH --error=error.log #SBATCH --output=frames_to_stack.log #SBATCH --job-name=frames_to_stack #SBATCH --mem 1000 module load imod inputDir='Frames' outDir='Movies' for i in ${inputDir}/*n0.mrc; do foo=${i/_frames_n0.mrc} foo=${foo#${inputDir}/} newstack ${inputDir}/${foo}_frames_n*.mrc ${outDir}/${foo}_frames.mrcs done
Explanation of the flag
- --time estimation of how long the job will take. Format: days-hours or hours:minutes:seconds.
- --ntasks number of mpi thread
- --partition queue_name
- --error error output file
- --output output file
- --job job name displayed with squeue
- --mem memory used in Megabytes
To submit jobs to the queue, submission scripts are used (files with ".sbs" extension in any of the Relion directories)
- Standard submission script for relion for GPU, Titan queue (it will be prefilled in the interface):
/usr/local/relion/3.1.2/bin/relion_slurm_titan.sbs
- Standard submission script for relion for GPU, GTX queue (it will be prefilled in the interface):
/usr/local/relion/2.1.0/bin/relion_slurm_gtx.sbs
- Standard submission script for relion for GPU, Titan small queue (it will be prefilled in the interface):
/usr/local/relion/2.1.0/bin/relion_slurm_titansmall.sbs
- Standard submission script for relion for CPU jobs:
/usr/local/relion/2.1.0/bin/relion_slurm_cpu.sbs
- Standard submission script for relion for CPU jobs (cpu_small queue):
/usr/local/relion/2.1.0/bin/relion_slurm_cpusmall.sbs
Running Job on Relion
In Relion, the script will be submitted automatically if you fill info in the Running tab and press Run (See picture bellow). However, you have to monitor the job yourself and kill it yourself.
Running Job on Scipion
In Scipion, the script will be submitted automatically and when you cancel the job, it will be canceled automatically from the interface.