Wikiomics:Repeat finding: Difference between revisions
Darek Kedra (talk | contribs) m (+stub) |
Darek Kedra (talk | contribs) |
||
Line 111: | Line 111: | ||
RepeatMasker input_genome_sequence.fas -lib output_repeats.fas.filtered_1 | RepeatMasker input_genome_sequence.fas -lib output_repeats.fas.filtered_1 | ||
</pre> | </pre> | ||
This is a very long step (36h for 800Mb draft genome) when run in such default mode. See discussion for this page for possible, but so far untested speedups. | |||
Output used for the next step: input_genome_sequence.fas.out | Output used for the next step: input_genome_sequence.fas.out | ||
* filtering putative repeats by copy number. By default only sequences occurring > 10 times in the genome are kept | * filtering putative repeats by copy number. By default only sequences occurring > 10 times in the genome are kept | ||
<pre> | <pre> | ||
cat output_repeats.fas.filtered_1 | filter-stage-2.prl --cat= input_genome_sequence.fas.out > output_repeats.fas.filtered_2 | cat output_repeats.fas.filtered_1 | filter-stage-2.prl --cat=input_genome_sequence.fas.out > output_repeats.fas.filtered_2 | ||
</pre> | </pre> | ||
You can modify the filter using i.e. "--thresh=20" (only repeats occurring 20+ times will be kept) | Fast (< 1min ). You can modify the filter using i.e. "--thresh=20" (only repeats occurring 20+ times will be kept) | ||
== Credits == | == Credits == |
Revision as of 13:44, 24 March 2010
To simplify, this page assumes eucakariotic genomic DNA repeat finding.
Repeat finding can be divided into two tasks, depending on availability of repeat library:
A) Library exists for a given (or possibly closely related species)
or
B) you construct such library de novo.
Task A is usually a prerequisite step for genome annotation and even blast searches. For newly sequences genomes one should start with B (constructing species specific repeat library).
Detecting known repeats
Most comonly used: Repeatmasker
RepeatMasker
- web site: http://www.repeatmasker.org/
- current version (checked on 2010-03.22): 3.2.8
- documentation: http://www.repeatmasker.org/webrepeatmaskerhelp.html
- Online web server [1]
- command line
You have to have a FastA file (it can be multiple FastA). Type:
repmask your_sequence_in_fasta_format
You will get a file: your_sequence_in_fasta_format.masked --- name tells all
species options (choose only one):
-m(us) masks rodent specific and mammalian wide repeats -rod(ent) same as -mus -mam(mal) masks repeats found in non-primate, non-rodent mammals -ar(abidopsis) masks repeats found in Arabidopsis -dr(osophila) masks repeats found in Drosophilas -el(egans) masks repeats found in C. elegans
De novo repeat library construction
For programs recommendations based on test see: Saha et al. Empirical comparison of ab initio repeat finding programs (2008)
For an extensive review listing tens of programs: Lerat E.Identifying repeats and transposable elements in sequenced genomes: how to find your way through the dense forest of programs (Nov 2009)
RepeatScout
command line only, requires compilation
Site: http://bix.ucsd.edu/repeatscout/
current version (2010-03): 1.05
Documentation:
- http://bix.ucsd.edu/repeatscout/readme.1.0.5.txt
- PPT presentation presenting algorithm: http://bix.ucsd.edu/repeatscout/repeatscout-ismb.ppt
- publication (PDF)De novo identification of repeat families in large genomes 2005
- prerequisites
- Perl
- Tandem Repeats Finder (trg) (accessed 2010-03-22), last version: 4.04
- nseg
Simplest run:
- build frequency table
build_lmer_table -sequence input_genome_sequence.fas -freq output_lmer.frequency
output_lmer.frequency file can be still quite large (1.7Gb for 900Mb fasta file)
- create fasta file containing all kinds of repeats
RepeatScout -sequence input_genome_sequence.fas -output output_repeats.fas -freq output_lmer.frequency
Resources:
- RAM usage (RepeatScout): > 17Gb for 800Mb genomic sequence.
- 9.6h Xeon E7450 @ 2.40GHz
The output (output_repeats.fas) is a fasta file with headers (>R=1, >R=232 etc.). It contains also trivial simple repeats (CACACA...), tandem repeats
- filter out short (<50bp) sequences. Remove "anything that is over 50% low-complexity vis a vis TRF or NSEG.". Perl script.
It does require trg and nseg to be on the PATH, or setting env variables TRF_COMMAND and NSEG_COMMAND pointing to their location
filter-stage-1.prl output_repeats.fas > output_repeats.fas.filtered_1
this prints tons of messages
- run RepeatMasker on your genome of interest using filtered RepeatScout library
RepeatMasker input_genome_sequence.fas -lib output_repeats.fas.filtered_1
This is a very long step (36h for 800Mb draft genome) when run in such default mode. See discussion for this page for possible, but so far untested speedups.
Output used for the next step: input_genome_sequence.fas.out
- filtering putative repeats by copy number. By default only sequences occurring > 10 times in the genome are kept
cat output_repeats.fas.filtered_1 | filter-stage-2.prl --cat=input_genome_sequence.fas.out > output_repeats.fas.filtered_2
Fast (< 1min ). You can modify the filter using i.e. "--thresh=20" (only repeats occurring 20+ times will be kept)
Credits
- Darek Kedra wrote this tutorial
For pages on simmilar topics visit: Wikiomics@OpenWetWare