Butlin:Unix for Bioinformatics - basic tutorial: Difference between revisions
Line 69: | Line 69: | ||
Open a terminal, then type at the command line prompt: | Open a terminal, then type at the command line prompt: | ||
ssh | ssh -X your_username@iceberg.shef.ac.uk | ||
You will be asked for your iceberg password and at the first time it usually issues a warning about accessing an untrusted server. Just confirm that you want to add iceberg to your trusted server connections. The <code>-X</code> switch opens a connection with X11 forwarding. If you don’t intend to open a [http://en.wikipedia.org/wiki/Graphical_user_interface GUI] on iceberg, skip that switch. | |||
==Some basics== | ==Some basics== |
Revision as of 13:56, 12 August 2013
Before you jump into this tutorial
- This tutorial was given at a next generation sequencing data analysis workshop in March 2013 at Sheffeld University.
- Your command line prompt will end with a $ sign. So a $ sign in this tutorial tells you to type the stuff that comes after the $ sign into your command line.
- The words "folder" and "directory" mean the same thing. So I use them interchangeably.
- Linux is Unix re-coded under an open-source licence, the same way as R is a re-coded version of S. Here, when I use the term Unix, I refer to all Unix-like computing environments, i. e. the original Unix that comes with Macs as well as most Linux flavours.
- The practical part of the workshop will be done on the computer cluster of the University of Sheffield called Iceberg. It has scientific Linux installed. You will log into your own accounts on Iceberg.
- If you have never used a Unix command line, we suggest you concentrate on the boldly printed text in the first two sections in this module. Make sure you get through them in time so that you are fit to do the rest of the workshop.
- There are still small (hopefully not large) bugs lurking in this protocol. Please help improve it by adding comments.
Iceberg - making first contact
Access with the programme PuTTY
If PuTTY can’t be found on your computer, go to section 1.2 Access via an internet browser.
Chose “iceberg” and press “Open”.
Type in the login name that you have been given, hit Enter, then type in the password and hit Enter.
Access via an internet browser
If PuTTY is not installed, you have to use an internet browser to access Iceberg.
Follow the link “Connect to Iceberg now!” on this site.
Insert the username/login name and password that you have been given.
Under Iceberg Applications select Iceberg terminal.
A new window will pop up.
Iceberg access for Mac and Linux users
Open a terminal, then type at the command line prompt:
ssh -X your_username@iceberg.shef.ac.uk
You will be asked for your iceberg password and at the first time it usually issues a warning about accessing an untrusted server. Just confirm that you want to add iceberg to your trusted server connections. The -X
switch opens a connection with X11 forwarding. If you don’t intend to open a GUI on iceberg, skip that switch.
Some basics
On the head node called “iceberg1” start a new session on one of the worker nodes by typing:
qrsh
No work should ever be done on the head node ‘iceberg1’ or else others and ultimately you will suffer the consequences :-)
If you’ve logged in via the web browser and used qsh (instead of qrsh) to start a new session, then you could open a programme that uses a GUI, e. g. firefox:
firefox
Where am I in the file system?
pwd
Here is a visual representation of a Unix file system:
taken from the Unix Tutorial for Beginners
Every Unix operating system has a root folder simply called /. Let’s see what’s in it:
ls /
List the files in the current directory, i. e. your home directory:
ls
Your home directory is still empty, or is it?
ls -a
The -a switch makes ls show hidden files, which start with a dot in their file name. Let’s create a new directory and a new empty file.
mkdir NGS_workshop ls cd NGS_workshop ls touch test ls -l
The list output of ls prints out a lot of information about each file and directory.
drwxr-xr-x 4 cliff user 1024 Jun 18 09:40 directory_name -rw-r--r-- 1 cliff user 767392 Jun 6 14:28 file_name ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ | | | | | | | | | | | | | | | | owner group size date time name | | | | number of links to file or directory contents | | | permissions for world | | permissions for members of group | permissions for owner of file: r = read, w = write, x = execute -=no permission type of file: - = normal file, d=directory, l = symbolic link, and others...
$ ls -lF
Note the forward slash at the end of file names, when you use the -F option. This indicates a directory.
$ ls -lh
How can I look up the manual for the ls command and most other Unix commands?
$ man ls
Keyboard | What it does |
---|---|
f or Space | one screen size down |
b | one screen size up |
d | half a screen size down |
u | half a screen size up |
G | jump to end of file |
g | jump to beginning of file |
h | get help |
q | exit help |
q | exit less |
Save yourself some typing:
$ alias ll=’ls -lFh’ $ man alias
However, this neat little shortcut is only active in the current terminal window. In order to create this alias each time you login into iceberg, add the above alias command line to your .bash_profile file:
$ nano .bash_profile
The dot at the beginning is part of the file name, so don’t forget it. Files whose names start with a dot are hidden files. At the bottom of the file add:
alias ll=’ls -lFh’
hit Ctrl+O and Enter on your keyboard to save changes, then Ctrl+X to exit nano.
$ source .bash_profile $ ll
What is my file storage quota on iceberg?
$ quota
Exit iceberg by typing:
$ exit
… to quit the interactive session and get back to the head node and:
$ logout
… to log off the cluster.
Gearing up for work with files and directories
Log back into iceberg and start an interactive session with qsh. Create a new directory in the directory NGS_workshop:
$ cd ~/NGS_workshop
Note this is equivalent to:
$ cd /home/your_username/NGS_workshop
$ mkdir output $ ll $ ll output
Change into the new directory:
$ cd output
Note how your command line prompt has changed.
$ ll
Create five new empty files:
$ touch test test1 test12 test123 Test This_is_a_really_long_file_name_isnt_it $ ll
Note: Spaces are important for Unix to parse the command line (but there is no difference between one and many spaces). So replace them with underscores in your file names. Generally, you can safely use the characters [a-zA-Z0-9._] in your file names.
Bash (short for Bourne Again Shell), the programme that provides the command line interface to Unix that you are currently using, comes with so-called wildcards:
$ ll test*
Two things to note here: 1) The asterisk stands for anything, including nothing and 2) Unix is case sensitive.
$ ll test? $ ll *_*
Copy, move and remove files:
$ cp test* .. $ ll ..
“..” stands for the parent directory.
$ ll ../../../.. $ cd .. $ ll $ rm test
Note, the rm command deletes the file (and with the -r switch also directories). It doesn’t move them into a “trash can”, in case you have second thoughts. It also, by default, doesn’t ask you for confirmation.
$ cp output/test* .
Note the dot at the end of the last command line. It’s short for the current working directory or “here”.
$ ll $ cd output $ echo haha
echo is bash’s print command.
$ echo haha > test1 $ echo hihi > test2
The redirection operator > redirects the output of the echo command into the file test1. Otherwise, echo prints to STDOUT, which is the terminal screen.
$ cat test1 test2 $ echo hohoho >> test1 $ cat test1
The >> operator appends the output of echo to the end of the file test1.
$ echo “hoohooo” > test1 $ cat test1
Redirecting the output of a command into an existing file overwrites it without notice. Remember this and be careful!
$ cp ../test12 test1
We are copying the file test12 from the parent directory into the current directory and save it as test1. Let’s see what’s in test1 now.
$ cat test1
?!?!?!?!?!? The cp command has overwritten the test1 file in the current directory with the content of the file test12 from the parental directory. This file was empty.
$ cat test2 $ mv ../test123 test2 $ cat test2
The mv command (which does cut and paste) has just done the same as the cp command. It has silently overwritten the file test2 in the current directory with the empty file test123.
Ok, you should be sufficiently scared by now. Here’s how you make these three commands safer: All three commands have a switch that causes them to prompt the user for confirmation before overwriting an existing file. It’s -i for all three commands. Check with:
$ man rm $ man cp $ man mv
We could always type rm -i, cp -i or mv -i , but that’s tedious. Instead we can make this the default behaviour of the three programmes by adding aliases into the .bash_profile file (as before with the alias for the ls command):
$ nano ~/.bash_profile
Then add to the end of the file:
alias rm=”rm -i” alias cp=”cp -i” alias mv=”mv -i”
Ctrl+O, Ctrl+X.
$ rm test*
It still removed without asking for confirmation. That is because we have to tell bash about the changes we made to the .bash_profile file for these changes to take effect in the current terminal session:
$ source ~/.bash_profile $ touch test1 test2 test3 $ rm test?
Each time you start a new terminal session, bash will read your .bash_profile file. From now on, for every file you want to remove, the rm command will ask you for confirmation. Now, if that becomes too tedious, use the -f switch:
$ rm -f *
which in this case will remove without prompting for confirmation all files (but not folders) in the current director.
Simple file renaming and TAB-completion
For the long file name, type “Th + TAB” (TAB - the button left of Q on your keyboard), bash should complete the rest.
$ mv This_is_a_really_long_file_name_isnt_it shorter_file_name.txt $ ll $ man mv
You see that the mv command can also be used to rename single files. Later we will see how to rename lots of files with one command line.
Accessing the command history
If you don’t want to keep retyping things you have already entered before:
Up-arrow | scroll through previous commands |
View your command line history:
$ history | less
The command line history can get long and would flood your screen with hundreds of lines of output. So we pipe the output of the history command into the text viewer less. With the | operator you can pipe the output of one command into another command and thus glue many commands together into a pipeline. You’ll later see what a powerful feature that is.
Command line shortcuts
Keyboard | what it does |
---|---|
Ctrl + a | jump to beginning of command line |
Ctrl + e | jump to end of command line |
Ctrl + w | delete word left of cursor |
Ctrl + u | delete everything left of cursor |
Ctrl + r | search for a command line in your history |
$ sleep 60
Ctrl + c | stop the foreground programme and get the command line prompt back |
Tidying up
$ cd .. $ rm -R output $ rm test*
Let’s copy the example data files for the rest of this Unix module into your newly created directory “NGS_workshop”. When you type the following command line, try using TAB-completion to save typing. You don’t have to break the command line onto two lines, but if you do, you need to escape the return character with a backslash, i. e. type a backslash, then hit return:
$ cp -Rv /usr/local/extras/Genomics/Example_data/Unix_module/ \ ~/NGS_workshop
$ ll ~/NGS_workshop $ ll ~/NGS_workshop/Unix_module
Gearing up for work with data and programmes
How do I do a local install of a suite of programmes that come as C++ source code?
A local install of a programme does not require administrator privileges ($ sudo su) and installs programmes, libraries and documentation in your home folder or any other folder you have write privileges for. In your home directory type:
$ ll -a
The first folder at the top stands for the directory you are currently in, "./" . At the left you can see that you have read and write access for this directory. For more on permissions, see here.
$ mkdir prog src
Let’s install samtools. In order to open a GUI on Iceberg, X11 forwarding must be enabled and work (unfortunately it doesn’t with PuTTY). Login in via the web browser as described above and on the iceberg head node (iceberg1), type:
$ qsh
instead of qrsh. Then firefox will open remotely.
$ firefox
Google for Samtools and download the latest source code into your new “src” directory.
$ cd src $ tar -jxf samtools-0.1.18.tar.bz2 $ ll $ cd samtools-0.1.18 $ less INSTALL $ ll $ make
Copy these three executables into the folder “prog” in your home directory.
$ cp samtools bcftools/bcftools bcftools/vcfutils.pl ~/prog
Unix automatically expands the tilde to the path of your home directory.
$ mkdir -p ~/man/man1 $ cp samtools.1 ~/man/man1 $ ll ~/prog $ ll ~/man/man1 $ pwd $ samtools
Unix can’t find a programme called “samtools”, but where does it actually look for programmes? Here:
$ echo $PATH
The PATH is a so-called environment variable of bash. To see all current environment variables and what they contain type:
$ env
$ ./samtools
aaah, why does this work? Because you have specified the path to the executable. "./" stands for the current directory, remember? You could also have typed:
$ /home/your_username/src/samtools-0.1.18/samtools
If you want to execute samtools from any directory without having to specify the whole path to its location in the file system, then simply add the folder where you store your executables to the beginning of the PATH environment variable:
$ export PATH=~/prog:$PATH $ echo $PATH
Note, in case you were wondering about the $ that appears here in front of PATH. It’s part of the bash syntax and simply means “give me the contents of that variable”. Now type,
$ samtools $ which samtools
$ samtools $ which samtools
Change to any other directory and call samtools. It should still work. Now, if you log out of the interactive session and back into another with qsh, you’ll see that your changes to the PATH variable have been lost. To make them permanent, add them to your ~/.bash_profile file as you did earlier for the aliases.
$ nano ~/.bash_profile
At the end of the file, enter the same command line you used before at the prompt:
export PATH=~/prog:$PATH
Ctrl + o, Ctrl + x .
$ source ~/.bash_profile $ echo $PATH $ samtools
Since your personal .bash_profile file is read everytime you open a terminal session, your custom addition to the PATH will be read each time. Do the analogous thing for the MANPATH variable, i. e.:
export MANPATH=~/man/man1:$MANPATH
$ man samtools
My NGS library has finally been sequenced and my sequencing centre has informed me that the sequence data files are on one of their password protected servers ready for download. How do I get those many Gb large files into my iceberg account (or any other Unix system)?
The file storage limit in your home directory on iceberg is only 5Gb, but you have 50Gb available under “/data/your_username”. You can find more about file storage on Iceberg here.
$ quota $ cd /data/your_username $ mkdir raw_data
On your local computer follow the link to the server with the sequence data and log in. Export the cookie for this site (in Firefox you’ll need to install the add-on “Cookie-Exporter”). Use a programme like WinSCP for Windows or Cyberduck for Mac to upload that cookie file from your local computer into your Iceberg account (see here for more info on that). On Mac and Linux you can also do this with command line tools like scp or rsync, e. g.:
$ rsync -av -e "ssh -l bop08ck" ~/Downloads/MiSeq_cookie.txt iceberg.shef.ac.uk:/data/bop08ck/raw_data
my username
cookie file
server:path_to_data_directory
Then from your “/data/username/raw_data” directory on iceberg issue the following:
$ wget --load-cookies MiSeq_cookie.txt link_to_sequences.fastq.gz
Downloading say 100Gb of sequence data can take several hours, while wget gives you progress report to the terminal output. So you wouldn’t be able to log off or continue to do some other work during the current interactive session. So stop the download with "Ctrl+c". You could then issue:
$ wget -c --load-cookies MiSeq_cookie.txt link_to_sequences.fastq.gz 2>/dev/null &
The -c switch in the wget command line will continue the download from where you stopped it. This also sends the process into the background by means of the ampersand & and gives you the command prompt back. But you can’t log off iceberg until the job is finished (This is iceberg-specific. So you could indeed log out of a normal Linux compute server without stopping the process).
Any long lasting or compute intensive jobs on iceberg have to be submitted with the qsub command, as we do now:
$ nano ~/NGS_workshop/Unix_module/TASK_2/download.sh
Substitute my email address and username with yours of course:
#!/bin/bash #$ -l h_rt=00:05:00 #$ -l mem=500M #$ -m be #$ -M c.kerth@sheffield.ac.uk # change to the directory where the data should be stored cd /data/bop08ck/raw_data # wget command line wget -c --http-user NGS --http-password regex \ huluvu.shef.ac.uk/NGS_workshop/sequences.fastq.gz
The backslash in the wget command line escapes the (invisible) return character, which would otherwise be interpreted by bash as the end of the command line. Be sure that there is only a return character following the backslash.
Then submit this job submission script to the SGE scheduler with the qsub command:
$ qsub ~/NGS_workshop/Unix_module/TASK_2/download.sh $ Qstat
You could now log off iceberg. Your job will be run on the next node available. Check your emails. Further details on job submission on iceberg can be found here.
Once downloaded have a look at the file:
$ zless raw_data/sequence_reads_01.fastq.gz
This is a version of less that can take a gzipped file as input. There is also zcat.
How can I find files and directories in Unix?
$ cd ~/NGS_workshop $ ll
Let’s search for a file or directory with the word “stacks” in its name:
$ find . -name “*stacks*”
Here, the find command searches from the current directory recursively in all subdirectories for files and folders which contain “stacks” anywhere in their name.
$ find ~ -name “*stacks*”
This would be searching in your whole home directory. You can do much more with the find command. For instance, take a look here.But what if you don’t have any idea about the name of the file you are looking for, but you know it contains a certain word, i. e. how can you do the search on the contents of the files instead of their file names? Let’s say you remember that the file you are looking for contains the word “consensus” and you are not sure about upper/lower cases.
$ grep -Ri “consensus” * $ man grep
We are using grep to search within files. The -R switch tells grep to look recursively in all subdirectories. The -i switch tells it to ignore the difference between upper and lower case. The search pattern is provided in quotation marks and the asterisk at the end is a wildcard that is expanded by bash into all files and folders in the current directory. So grep will search in all files in the current directory and also in all files located in all subdirectories (and subsubdirectories for that matter).
Recommended for further self-study
Unix & Perl Primer for Biologists