Wilke:Using lab cluster

From OpenWetWare

Jump to: navigation, search

Home        Contact        People        Research        Publications        Materials


Using the lab cluster


The lab cluster is phylocluster.ccbb.utexas.edu. You will need an account on the cluster to be able to use it.

The operating system on the cluster is Linux. If you have no experience with the Linux shell and command line, you will have to learn about that first. This is a good tutorial for beginners: [1]

The cluster uses the Sun Grid Engine (SGE) to distribute computing tasks over the various compute nodes that are available. You will have to start all jobs that you want to run using SGE. This is different from how you run jobs on your computer at home, so even if you have plenty of experience with Linux on your own computer you may not know your way around SGE.

A brief introduction to using SGE is given here: [2]. There's a lot of material in this document. The most important part is that every job on the cluster should use the provided bare-bones script as a starting point: [3]. If you have many independent runs that can run in parallel and don't need to talk to each other, you should also read this page: [4].

Finally, it is good practice to copy your jobs to /state/partition1 on the compute node before running them. /state/partition1 refers to the local hard drive on the compute node. We will soon provide a tutorial on how to do this exactly.


Storage locations

There are several locations for data storage. Which location you should use depends on the nature and size of data you have to store.

  • Your home directory (/home/<your user name>). Your home directory is where you will end up by default after logging in to the cluster. Home directories are backed up, and they are generally a good place to store important files and data. However, home directories should not be used for large amounts of data (>10Gb is large). Also, most data (and code) related to research projects should be stored elsewhere (/share/WilkeLab/work or /share/WilkeLab/archive, see below).
  • /share/WilkeLab/work. This directory is the right place to store large amounts of data that you are currently working with. It is not backed up, but everything is stored on a RAID array, which protects the data against basic hardware malfunction. We currently have approximately 5.5Tb of storage available on /share/WilkeLab/work.

    Remember that /share/WilkeLab/work is used by all lab members, so we need to make sure data are well organized and we can figure out who stored what where. Therefore, you should create a subdirectory named like your user name, and then store all your data in that subdirectory. For example, if your username is ab123, create a directory /share/WilkeLab/work/ab123 and place all your data and code in there.

  • /share/WilkeLab/archive. This directory functions just like /share/WilkeLab/work, with the exception that it is backed up on top of being on a RAID. This means that not only is this storage protected against basic hardware malfunction, but there is an extra level of safety against catastrophic failure, such as a fire in the server room. We currently have about 2.8Tb of storage in /share/WilkeLab/archive.

    Any important research work should be stored in /share/WilkeLab/archive. This covers both data and code for completed projects and important data and code for ongoing projects. As a simple rule of thumb, important work products should be stored in /share/WilkeLab/archive, whereas temporary data sets or data that can easily be replicated should be stored in /share/WilkeLab/work. For example, let's assume you want to analyze five mammalian genomes for certain sequence similarities. You should store the genomes in /share/WilkeLab/work, since they can be re-downloaded easily. You should store the analysis code and the resulting data in /share/WilkeLab/archive, since they are important work products.

    An exception to this rule would be large output files from simulations. If you have written (or are using) a simulation that produces 500Gb of data in 24h, we cannot archive all this output in the long run. In this case, make sure you store the configuration files and other important information to reproduce the simulations, as well as any smaller files you obtain from analyzing the simulation output.

    As with /share/WilkeLab/work, you should create a subdirectory named like your user name and store all your data in that subdirectory.

  • /share/WilkeLab/scratch. This directory is similar to /share/WilkeLab/work. We currently have 2Tb of space here. /share/WilkeLab/scratch is low-priority storage. Data you have in this location could be deleted without much warning, even though we will generally try not to delete any data without first talking to you. If you just need some temporary storage for a few days, /share/WilkeLab/scratch would be a good choice.

    As with /share/WilkeLab/work, you should create a subdirectory named like your user name and store all your data in that subdirectory.

  • /share/WilkeLab/archive/published_papers and /share/WilkeLab/archive/submitted_papers. We want to archive all data and code necessary to reproduce any paper the lab publishes. Therefore, once a paper has been submitted or published, please deposit a zip file (or compressed tar archive) of all the required data and code, with proper documentation, in /share/WilkeLab/archive/submitted_papers (before final acceptance) or /share/WilkeLab/archive/published_papers (after final acceptance). See the file /share/WilkeLab/archive/README.txt for best practices.

Important: Never store student grades, other FERPA protected data, social security numbers, etc. on the cluster. These are Category I data and require special security protocols that the cluster doesn't satisfy.

Useful unix commands related to storage

It is a good idea to regularly check up on how much storage you are using and how much is available. To find out to total amount of storage used in a directory, you can use the command du -sh <directory name>. For example:

> du -sh projects
5.2G    projects

To find out how much storage is available and how much is used, use the df command. For example:

> df
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda1              5952252   4850948    794064  86% /
/dev/sda3             27948600     77912  26450944   1% /state/partition1
                     2857947040 1520440096 1337506944  54% /share/WilkeLab

Each line corresponds to one separate storage location. In this example, the first line corresponds to the root directory of the local disk. This is where the operating system is stored. The second line represents local temporary storage on the compute node. The third line represents the network storage where /share/WilkeLab resides.

Compressing files

Compress large data files. bzip2 gives the best compression ratio and should be used for very large files (~1G or bigger uncompressed). gzip provides more convenience (better integration with other tools, e.g. zless) and should be used for smaller files. You can also use zip/unzip as a replacement of gzip/bzip2 + tar. Don't use proprietary software to compress files.

As always, you can use the unix command man to find out how to use any of these programs. For example, enter man gzip to learn about how to use gzip. A few common use cases follow below.

Compress file data.txt using bzip2:

> bzip2 data.txt

Uncompress the resulting file data.txt.bz2:

> bunzip2 data.txt.bz2

Create compressed tar archive (using bzip2) from directory data. The resulting file will be called data.tbz2:

> tar cvfj data.tbz2 data

Extract data from compressed tar archive (compressed using bzip2):

> tar xvfj data.tbz2

List contents of a tar file (compressed using bzip2) without actually extracting the files:

> tar tvfj data.tbz2

To use gzip instead of bzip2 from tar, replace the j with a z in the above tar commands. For example, to list the contents of a tar file compressed using gzip, you would enter:

> tar tvfz data.tgz
Personal tools