User:Lindenb/Notebook/UMR915/20110222
From OpenWetWare
Belgium
re-aligning on hg19 . see Makefile in
/GENOTYPAGE/data/users/lindenb/20101108_belgium/20110222
/usr/local/package/mosaik-aligner/bin/MosaikBuild -fr /GENOTYPAGE/data/pubdb/ucsc/hg19/chromosomes/hg19.fa -oa _ignore.backup/hg19/reference.dat ------------------------------------------------------------------------------ MosaikBuild 1.1.0018 2010-10-29 Michael Stromberg & Wan-Ping Lee Marth Lab, Boston College Biology Department ------------------------------------------------------------------------------ - converting /GENOTYPAGE/data/pubdb/ucsc/hg19/chromosomes/hg19.fa to a reference sequence archive. - parsing reference sequences: ref seqs: 25 (0.2884 ref seqs/s) - writing reference sequences: 100%[===========================================================================================================] 0.8177 ref seqs/s in 30 s - calculating MD5 checksums: 100%[===========================================================================================================] 1.43 ref seqs/s in 17 s - writing reference sequence index: 100%[===========================================================================================================] 25.0 ref seqs/s in 1 s - creating concatenated reference sequence: 100%[===========================================================================================================] 7.13 ref seqs/s in 3 s - writing concatenated reference sequence... finished. - creating concatenated 2-bit reference sequence... finished. - writing concatenated 2-bit reference sequence... finished. - writing masking vector... finished. MosaikBuild CPU time: 187.420 s, wall time: 200.587 s /usr/local/package/mosaik-aligner/bin/MosaikJump -ia _ignore.backup/hg19/reference.dat -hs 15 -out _ignore.backup/hg19/reference_hs15 ------------------------------------------------------------------------------ MosaikJump 1.1.0018 2010-10-29 Michael Stromberg Marth Lab, Boston College Biology Department ------------------------------------------------------------------------------ - retrieving reference sequence... finished. - hashing reference sequence: 100%[=============================================================================================================] 2,301,657 hashes/s in 22:24 - serializing final sorting vector... finished. - writing jump positions database: 100%[=====================================================================================================] 1,423,918 hash positions/s in 33:29 - serializing jump keys database (17 blocks): blocks: 17 (0.9169 blocks/s) MosaikJump CPU time: 3327.950 s, wall time: 3405.920 s
mean length for sample 1 is 385.282
gunzip -c 1.TCA.454Reads.fna.gz | egrep ">" | tr " " "\n" | grep length | cut -d '=' -f 2 | awk '{n+=1.0; total+=int($1)*1.0;} END { print total/n;}'
using
-mmp 0.05 -act 26 -bw 51 -mhp 100
for the Align param ( http://code.google.com/p/mosaik-aligner/wiki/ParameterSettings )
Align
/usr/local/package/mosaik-aligner/bin/MosaikAligner -mmp 0.05 -act 26 -bw 51 -mhp 100 -p 7 -in ../reads1.mkb -ia _ignore.backup/hg19/reference.dat -j _ignore.backup/hg19/reference_hs15 -out align1.hg19.mka ------------------------------------------------------------------------------ MosaikAligner 1.1.0018 2010-10-29 Michael Stromberg & Wan-Ping Lee Marth Lab, Boston College Biology Department ------------------------------------------------------------------------------ - Using the following alignment algorithm: all positions - Using the following alignment mode: aligning reads to all possible locations - Using a maximum mismatch percent threshold of 0.05 - Using 7 processors - Using a Smith-Waterman bandwidth of 51 - Using an alignment candidate threshold of 26bp. - Setting hash position threshold to 100 - Using a jump database for hashing. Storing keys & positions in memory. - Using a homo-polymer gap open penalty of 4 - loading reference sequence... finished. - loading jump key database into memory... finished. - loading jump positions database into memory... finished. Aligning read library (317992): 100%[============================================================================================================================] 83.7 reads/s in 1:03:18 Alignment statistics (mates): =================================== # filtered out: 19943 ( 6.3 %) # unique: 288449 ( 90.7 %) # non-unique: 9600 ( 3.0 %) ----------------------------------- total: 317992 total aligned: 298049 ( 93.7 %) MosaikAligner CPU time: 26744.380 s, wall time: 4015.994 s /usr/local/package/mosaik-aligner/bin/MosaikAligner -mmp 0.05 -act 26 -bw 51 -mhp 100 -p 7 -in ../reads2.mkb -ia _ignore.backup/hg19/reference.dat -j _ignore.backup/hg19/reference_hs15 -out align2.hg19.mka ------------------------------------------------------------------------------ MosaikAligner 1.1.0018 2010-10-29 Michael Stromberg & Wan-Ping Lee Marth Lab, Boston College Biology Department ------------------------------------------------------------------------------ - Using the following alignment algorithm: all positions - Using the following alignment mode: aligning reads to all possible locations - Using a maximum mismatch percent threshold of 0.05 - Using 7 processors - Using a Smith-Waterman bandwidth of 51 - Using an alignment candidate threshold of 26bp. - Setting hash position threshold to 100 - Using a jump database for hashing. Storing keys & positions in memory. - Using a homo-polymer gap open penalty of 4 - loading reference sequence... finished. - loading jump key database into memory... finished. - loading jump positions database into memory... finished. Aligning read library (563204): 100%[============================================================================================================================] 131.1 reads/s in 1:11:35 Alignment statistics (mates): =================================== # filtered out: 28749 ( 5.1 %) # unique: 519486 ( 92.2 %) # non-unique: 14969 ( 2.7 %) ----------------------------------- total: 563204 total aligned: 534455 ( 94.9 %) MosaikAligner CPU time: 30219.450 s, wall time: 4500.977 s
Result
A few reads on the other chromosomes: e.g:
CTCACCAGACCTCCTAGGCGACCCAGACAATTATACCCTAGCCAACCCCTTAAACACCCCTCCCCACATCAAGCCCGAATGATA TTTCCTATTCGCCTACACAATTCTCCGATCCGTCCCTAACAAACTAGGAGGCGTCCTTGCCCTATTACTATCCATCCTCATCCT AGCAATAATCCCCATCCTCCATATATCCAAACAACAAAGCATAATATTTCGCCCACTAAGCCAATCACTTTATTGACTCCTAGC CGCAGACCTCCTCATTCTAACCTGAATCGGAGGACAACCAGTAAGCTACCCTTTTACCATCATTGGACAAGTAGCATCCGTACT ATACTTCACAACAATCCTAATCCTAATACCAACTATCTCCCTAATTGAAAACAAAATACTCAAATGGGCCTGTCCTTGTAGTAT AAACTAATACACCAGTCTTGTAAACCGGAGATGAAAACCTTTTTCCAAGGACAAATCAGAGAAAAAGTCTTTAACTCCACCATT AGCACCCAAAGCTAAGATTCTAATTTAAACTATTCTCTG
e.g. BLAT confirms on chrM (!):
ACTIONS QUERY SCORE START END QSIZE IDENTITY CHRO STRAND START END SPAN --------------------------------------------------------------------------------------------------- browser details YourSeq 541 1 543 543 99.9% M + 15482 16024 543 00001 ctcaccagacctcctaggcgacccagacaattataccctagccaacccct 00050 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15482 ctcaccagacctcctaggcgacccagacaattataccctagccaacccct 15531 00051 taaacacccctccccacatcaagcccgaatgatatttcctattcgcctac 00100 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15532 taaacacccctccccacatcaagcccgaatgatatttcctattcgcctac 15581 00101 acaattctccgatccgtccctaacaaactaggaggcgtccttgccctatt 00150 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15582 acaattctccgatccgtccctaacaaactaggaggcgtccttgccctatt 15631 00151 actatccatcctcatcctagcaataatccccatcctccatatatccaaac 00200 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15632 actatccatcctcatcctagcaataatccccatcctccatatatccaaac 15681 00201 aacaaagcataatatttcgcccactaagccaatcactttattgactccta 00250 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15682 aacaaagcataatatttcgcccactaagccaatcactttattgactccta 15731 00251 gccgcagacctcctcattctaacctgaatcggaggacaaccagtaagcta 00300 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15732 gccgcagacctcctcattctaacctgaatcggaggacaaccagtaagcta 15781 00301 cccttttaccatcattggacaagtagcatccgtactatacttcacaacaa 00350 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15782 cccttttaccatcattggacaagtagcatccgtactatacttcacaacaa 15831 00351 tcctaatcctaataccaactatctccctaattgaaaacaaaatactcaaa 00400 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15832 tcctaatcctaataccaactatctccctaattgaaaacaaaatactcaaa 15881 00401 tgggcctgtccttgtagtataaactaatacaccagtcttgtaaaccggag 00450 >>>>> |||||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15882 tgggcctgtccttgtagtataaactaatacaccagtcttgtaaaccggag 15931 00451 atgaaaacctttttccaaggacaaatcagagaaaaagtctttaactccac 00500 >>>>> | |||||||||||||||||||||||||||||||||||||||||||||||| >>>>> 15932 acgaaaacctttttccaaggacaaatcagagaaaaagtctttaactccac 15981 00501 cattagcacccaaagctaagattctaatttaaactattctctg 00543 >>>>> ||||||||||||||||||||||||||||||||||||||||||| >>>>> 15982 cattagcacccaaagctaagattctaatttaaactattctctg 16024