Butlin:Unix for Bioinformatics - advanced tutorial: Difference between revisions

Revision as of 07:52, 9 August 2013

Overview

This session will be more challenging if you are still new to Unix. However, it’s main aim is to demonstrate that Unix is much more than an environment in which you can execute big programmes like `bwa` or `samtools`. Rather, with it’s suite of text searching, extraction and manipulation programmes, it is itself a powerful tool for the many small bioinformatic tasks that you will encounter during everyday work. This module expects that you have done the steps in the Basic Unix module, i. e. it expects that you have already created certain folders and have copied the example data into your account. There are still small (hopefully not large) bugs lurking in this protocol. Please help improve it by adding comments.

Before you start

Log into your iceberg account and change to the directory NGS_workshop, that you created in the Basic Unix module:

cd NGS_workshop/Unix_module

I have downloaded my illumina read data. Now I want to know how many reads my sequence run has yielded.

zless sequence.fastq.gz
zcat sequence.fastq.gz | wc -l
man wc

gives you the number of lines in your sequence file, which is the number of reads times four for fastq format. Note, you can usually avoid uncompressing data files, which saves disk space.

I have a .fastq file with raw sequences from a RAD library. How can I find out how many of the reads contain the correct sequence of the remainder of the restriction site?

Let's assume you had 5bp long barcode sequences incorporated into the single-end adapters, which should show up at the beginning of each sequence read. Let's also assume that you have used the restriction enzyme SbfI for the creation of the library, which has the following recognition sequence: CCTGCAGG . So the correct remainder of the restriction site that you expect to see after the 5bp barcode is "TGCAGG". First have a look at your fastq file again:

zless -N sequence.fastq.gz

Each sequence record contains four lines. The actual sequences are on the 2nd, 6th, 10th, 14th, 18th line and so on. The following can give you the count:

zcat sequences.fastq.gz | perl -ne 'print unless ($.-2)%4;'  | grep -c "^.....TGCAGG"

This is a pipeline which first uncompresses your sequence read file and pipes it into the perl command, which extracts only the DNA sequence part of each fastq record. “$.” in perl stands for the current line number, “%4” returns modulo 4 of the current line number minus 2. If that results in a floating point number (non-interger), then the line is not printed out. Get it?

zcat sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | less

Grep searches each line from the output of the perl command for the regular expression given in quotation marks. ^ stands for the beginning of a line. A dot stands for any single character. There are five dots, because the barcodes are all 5 base pairs long. The -c switch makes grep return the number of lines in which it has found the search pattern at least once.

man grep
man perlrun

zcat sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | grep  "^.....TGCAGG" | less

In less, type /^.....TGCAGG then hit enter.

Notes

Please feel free to post comments, questions, or improvements to this protocol. Happy to have your input!

List troubleshooting tips here.
You can also link to FAQs/tips provided by other sources such as the manufacturer or other websites.
Anecdotal observations that might be of use to others can also be posted here.

Please sign your name to your note by adding '''*~~~~''': to the beginning of your tip.

References

Relevant papers and books

Goldbeter, A and Koshland, DE (1981) - Proc Natl Acad Sci U S A 78(11) 6840-4 PMID 6947258
Jacob, F and Monod, J J (1961) - Mol Biol 3(3) 318-56 PMID 13718526
Ptashne, M (2004) Genetic Switch: Phage Lambda Revisited - Cold Spring Harbor Laboratory Press ISBN 0879697164

Contact

Who has experience with this protocol?

or instead, discuss this protocol.

Butlin:Unix for Bioinformatics - advanced tutorial: Difference between revisions

Revision as of 07:52, 9 August 2013

Contents

Overview

Before you start

I have downloaded my illumina read data. Now I want to know how many reads my sequence run has yielded.

I have a .fastq file with raw sequences from a RAD library. How can I find out how many of the reads contain the correct sequence of the remainder of the restriction site?

Notes

References

Contact

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools

@@ Line 35: / Line 35: @@
 Grep searches each line from the output of the perl command for the regular expression given in quotation marks. ^ stands for the beginning of a line. A dot stands for any single character. There are five dots, because the barcodes are all 5 base pairs long. The -c switch makes grep return the number of lines in which it has found the search pattern at least once.
+ man grep
+ man perlrun
+ zcat sequences.fastq.gz | perl -ne'print unless ($.-2)%4;' | grep  "^.....TGCAGG" | less
+In less, type /^.....TGCAGG then hit enter.
 ==Notes==