Wilke:Using lab cluster: Difference between revisions

From OpenWetWare
Jump to navigationJump to search
Line 15: Line 15:


==Storage==
==Storage==
===Storage locations===


There are several locations for data storage. Which location you should use depends on the nature and size of data you have to store.
There are several locations for data storage. Which location you should use depends on the nature and size of data you have to store.
Line 29: Line 31:


'''Important: Never store student grades, other FERPA protected data, social security numbers, etc. on the cluster.''' These are [http://www.utexas.edu/its/glossary/iso#GL_letterC Category I data] and require special security protocols that the cluster doesn't satisfy.
'''Important: Never store student grades, other FERPA protected data, social security numbers, etc. on the cluster.''' These are [http://www.utexas.edu/its/glossary/iso#GL_letterC Category I data] and require special security protocols that the cluster doesn't satisfy.
===Useful unix commands related to storage===
===Compressing files===
Compress large data files. <code>bzip2</code> gives the best compression ratio and should be used for very large files (~1G or bigger uncompressed). <code>gzip</code> provides more convenience (better integration with other tools, e.g. zless) and should be used for smaller files. You can also use <code>zip</code>/<code>unzip</code> as a replacement of <code>gzip</code> or <code>bzip2</code> + <code>tar</code>. Don't use proprietary software to compress files.

Revision as of 15:32, 2 April 2013

Notice: The Wilke Lab page has moved to http://wilkelab.org.
The page you are looking at is kept for archival purposes and will not be further updated.
THE WILKE LAB

Home        Contact        People        Research        Publications        Materials

Using the lab cluster

General

The lab cluster is phylocluster.ccbb.utexas.edu. You will need an account on the cluster to be able to use it.

The operating system on the cluster is Linux. If you have no experience with the Linux shell and command line, you will have to learn about that first. This is a good tutorial for beginners: [1]

The cluster uses the Sun Grid Engine (SGE) to distribute computing tasks over the various compute nodes that are available. You will have to start all jobs that you want to run using SGE. This is different from how you run jobs on your computer at home, so even if you have plenty of experience with Linux on your own computer you may not know your way around SGE.

A brief introduction to using SGE is given here: [2]. There's a lot of material in this document. The most important part is that every job on the cluster should use the provided bare-bones script as a starting point: [3]. If you have many independent runs that can run in parallel and don't need to talk to each other, you should also read this page: [4].

Finally, it is good practice to copy your jobs to /state/partition1 on the compute node before running them. /state/partition1 refers to the local hard drive on the compute node. We will soon provide a tutorial on how to do this exactly.

Storage

Storage locations

There are several locations for data storage. Which location you should use depends on the nature and size of data you have to store.

  • Your home directory (/home/<your user name>). Your home directory is where you will end up by default after logging in to the cluster. Home directories are backed up, and they are generally a good place to store important files and data. However, home directories should not be used for large amounts of data (>10Gb is large). Also, most data (and code) related to research projects should be stored elsewhere (/share/WilkeLab/work or /share/WilkeLab/archive, see below).
  • /share/WilkeLab/work. This directory is the right place to store large amounts of data that you are currently working with. It is not backed up, but everything is stored on a RAID array, which protects the data against basic hardware malfunction. We currently have approximately 5.5Tb of storage available on /share/WilkeLab/work.

    Remember that /share/WilkeLab/work is used by all lab members, so we need to make sure data are well organized and we can figure out who stored what where. Therefore, you should create a subdirectory named like your user name, and then store all your data in that subdirectory. For example, if your username is ab123, create a directory /share/WilkeLab/work/ab123 and place all your data and code in there.

  • /share/WilkeLab/archive. This directory functions just like /share/WilkeLab/work, with the exception that it is backed up on top of being on a RAID. This means that not only is this storage protected against basic hardware malfunction, but there is an extra level of safety against catastrophic failure, such as a fire in the server room. We currently have about 2.8Tb of storage in /share/WilkeLab/archive.

    Any important research work should be stored in /share/WilkeLab/archive. This covers both data and code for completed projects and important data and code for ongoing projects. As a simple rule of thumb, important work products should be stored in /share/WilkeLab/archive, whereas temporary data sets or data that can easily be replicated should be stored in /share/WilkeLab/work. For example, let's assume you want to analyze five mammalian genomes for certain sequence similarities. You should store the genomes in /share/WilkeLab/work, since they can be re-downloaded easily. You should store the analysis code and the resulting data in /share/WilkeLab/archive, since they are important work products.

    An exception to this rule would be large output files from simulations. If you have written (or are using) a simulation that produces 500Gb of data in 24h, we cannot archive all this output in the long run. In this case, make sure you store the configuration files and other important information to reproduce the simulations, as well as any smaller files you obtain from analyzing the simulation output.

    As with /share/WilkeLab/work, you should create a subdirectory named like your user name and store all your data in that subdirectory.

  • /share/WilkeLab/scratch. This directory is similar to /share/WilkeLab/work. We currently have 2Tb of space here. /share/WilkeLab/scratch is low-priority storage. Data you have in this location could be deleted without much warning, even though we will generally try not to delete any data without first talking to you. If you just need some temporary storage for a few days, /share/WilkeLab/scratch would be a good choice.

    As with /share/WilkeLab/work, you should create a subdirectory named like your user name and store all your data in that subdirectory.

  • /share/WilkeLab/archive/published_papers and /share/WilkeLab/archive/submitted_papers. We want to archive all data and code necessary to reproduce any paper the lab publishes. Therefore, once a paper has been submitted or published, please deposit a zip file (or compressed tar archive) of all the required data and code, with proper documentation, in /share/WilkeLab/archive/submitted_papers (before final acceptance) or /share/WilkeLab/archive/published_papers (after final acceptance). See the file /share/WilkeLab/archive/README.txt for best practices.

Important: Never store student grades, other FERPA protected data, social security numbers, etc. on the cluster. These are Category I data and require special security protocols that the cluster doesn't satisfy.


Useful unix commands related to storage

Compressing files

Compress large data files. bzip2 gives the best compression ratio and should be used for very large files (~1G or bigger uncompressed). gzip provides more convenience (better integration with other tools, e.g. zless) and should be used for smaller files. You can also use zip/unzip as a replacement of gzip or bzip2 + tar. Don't use proprietary software to compress files.