Etchevers:Notebook/STRA6 in eye development/2009/06/21

From OpenWetWare
Jump to navigationJump to search
Genetics of human eye development Main project page
Previous entry      Next entry

End of PAX6 mapping

First, when I compare the results from OTX2-1 and PAX6, there are all the expected files in the outdirPAX6 whereas the cns.fq for instance is not present in outdir (the first directory I made, for OTX2-1 [GDZ-1 for Fasteris].

The log for OTX2-1 is:

- maq-0.7.1

[ma_load_reads] loading reads...

[ma_load_reads] set length of the first read as 71.

[ma_load_reads] 780016*2 reads loaded.

[ma_longread2read] encoding reads... 1560032 sequences processed.

[ma_match] set the minimum insert size as 72.

[match_core] Total length of the reference: 3095693983

[match_core] round 1/3...

[match_core] making index...

[match_search] 0% processed in 5.268 sec: 0 / 0 = 0.000 [match_search] 1% processed in 136.225 sec: 252710 / 18804593 = 0.013 [match_search] 2% processed in 255.760 sec: 425682 / 17166538 = 0.025 [match_search] 3% processed in 351.322 sec: 531698 / 13530050 = 0.039

(...) [match_search] 97% processed in 9514.435 sec: 800762 / 14316347 = 0.056 [match_search] 98% processed in 9607.204 sec: 677102 / 13661122 = 0.050 [match_search] 99% processed in 9694.450 sec: 604232 / 13506399 = 0.045 [match_search] 100% processed in 9700.078 sec: 67240 / 621474 = 0.108

[match_core] round 2/3...

[match_core] making index...

[match_search] 0% processed in 9701.598 sec: 216 / 3418 = 0.063 [match_search] 1% processed in 9828.930 sec: 248328 / 17657956 = 0.014 [match_search] 2% processed in 9945.282 sec: 430350 / 16183714 = 0.027 [match_search] 3% processed in 10039.319 sec: 531947 / 12874851 = 0.041

(...) [match_search] 98% processed in 19135.004 sec: 677459 / 13038677 = 0.052 [match_search] 99% processed in 19222.537 sec: 603627 / 12796895 = 0.047 [match_search] 100% processed in 19228.034 sec: 67351 / 586443 = 0.115

[match_core] round 3/3...

[match_core] making index...

[match_search] 0% processed in 19229.582 sec: 217 / 3454 = 0.063 [match_search] 1% processed in 19359.666 sec: 247251 / 17461382 = 0.014 [match_search] 2% processed in 19478.005 sec: 430582 / 16019918 = 0.027 [match_search] 3% processed in 19572.679 sec: 532315 / 12743822 = 0.042

(...) [match_search] 98% processed in 28779.071 sec: 676271 / 12874849 = 0.053 [match_search] 99% processed in 28864.708 sec: 603668 / 12688061 = 0.048 [match_search] 100% processed in 28870.408 sec: 67255 / 582084 = 0.116

[match_core] sorting the hits and dumping the results...

[ma_load_reads] loading reads...

[ma_load_reads] 780016*2 reads loaded.

[mapping_count_single] 9, 16, 26, 57

[maq_indel_pe] the indel detector only works with short-insert mate-pair reads.

-- Dumping unmapped or poorly mapped reads [match_data2mapping] 657880 out of 1560032 raw reads are mapped with 0 in pairs.

-- (total, isPE, mapped, paired) = (780016, 0, 657880, 0)

I used all the defaults for mapping; this might be optimizable.

I only have these files in the “outdir” folder:

  • aln1@4000001.map 43.4 Mb
  • aln1@4000001.map.log
  • read1@1.bfq 107.1 Mb
  • read1@2000001.bfq 106.9 Mb
  • read1@4000001.bfq 42.1Mb
  • ref.bfa 1400 Mb
  • unmap1@4000001.txt 20.5 Mb

The computer thinks that the *.map and *.bfq files are gzip archives and would uncompress them to be a lot bigger.

Meanwhile, sudo apt-get install subversion, which allows me to use the “svn” command that was recommended on the install instructions for PeakFinder. After installation, get the line “Checked out revision 1274”.

Now check out “parameters”: http://vancouvershortr.wiki.sourceforge.net/FP4Parameters

There is a lot to do here. If I just want to look at the text, and I generate it from the aln1@4000001.map file by doing $ maq mapview [-b] aln1@4000001.map > OTX2-1.txt

is it tab-delimited? And if so, can I open it in OpenOffice? Only one way to find out.

When I try, though, I get “Segmentation fault”. But that is because aln1@4000001.map is not in that folder. Hm. It seems to work now, although I launched the command from within the “outdir” folder. I get lines like this:

HWI-EAS038:3:98:854:1277#0/1 chr1 160527014 - 0 099 99 99 0 0 1 0 71

HWI-EAS038:3:88:1470:427#0/1 chr1 160537928 + 0 064 64 64 1 0 0 1 71

HWI-EAS038:3:98:1400:931#0/1 chr1 160538243 - 0 099 99 99 0 0 1 0 71


They are tab-delimited, but it's a shame that there is not a length. I suppose I could manipulate things to add or subtract 76 nt from the first position, in the direction of the read.

Also, changed to maqview folder, and read README file. This suggests to compile as so:

$> ./autogen.sh

$> ./configure

$> make


But at the first step, I get:

running: aclocal

eval: 1: aclocal: not found

error: while running 'aclocal'

and can not do configure either. Indeed, the autogen.sh specifies to look for an “aclocal” as well as other absent files. Perhaps they are part of the mentioned “OpenGL and GLUT” programs?

Meanwhile, putzed around by taking the OTZ2-1.txt file produced by maq mapview above, and opening in the spreadsheet. Need to learn to do this in another way so can keep more than 65K records. Meanwhile, added 76 bp to every coordinate, and removed all other columns to keep a barebones *.bed format and renamed as such. Zipped it to *.gz format and threw up just chromosome 1 onto UCSC.

All the reads look on top of one another when look chr band by band; but as one focuses in, there are usually spaces between them. A couple sites do seem to have a tight superposition of reads. It will be very important to see if these correspond to predicted and higher-than-threshold peaks once I get to generating such peaks.

In particular, as examples,

This one is 12-deep and in a functionally relevant gene:

>hg18_dna range=chr1:91852810-91853120 5'pad=0 3'pad=0 strand=+ repeatMasking=none CCCAGGCTGGAGTACAGTGGCACCAACATAGCTCACTGTAACTTCAAACTCCAGGACTCAAGCGATCCTACCGCTTCAGCTTCCTGAGAAGCTGGGACTACAGGCGCATGCCACCACGTCTGACTAATTTTTGTTTTTATTTTTTGTGGAGATGGGGTCTCAAACTCCTGTCCTCAAGCAATCCTCCCATCTTAGAGTTTGCCTTTTAAAGTCATCATCCTTAGAACTGGAAGAAACTAAAAGAGATGTCCTGTTCTAGCCCACAGATTTCACACAGGAAAGAGCTACCTTCTTCATTTTACAAAATAAAC

This one was 8-deep but very telomeric, and function?:

>hg18_dna range=chr1:10000-10155 5'pad=0 3'pad=0 strand=+ repeatMasking=none CCCAGGGGCCAGCACTGCTCGAAATGTACAGCATTTCTCTTTGTAACAGGATTATTAGCCTGCTGTGCCCGGGGAAAACATGCAGCACAGTGCATCTCGAGTCAGCAGGATTTTGACGGCTTCTAACAAAATCTTGTAGACAAGATGGAGCTATGG

Otx consensus may be more TAATCC like in Ogino 2008 paper, or more Bicoid-like with the GxCTAATT of Binato 2007.

Likewise, should take Christelle's ChIP-seq peaks and see if they correspond to tight superposition of reads.


For PAX6 [GDZ-5] the log is similar, but the folder contains:

all.map assemble.log cns.snp read1@2000001.bfq aln1@1.map cns.filter.snp cns.win ref.bfa aln1@1.map.log cns.final.snp consensus.cns unmap1@1.txt aln1@2000001.map cns.fq mapcheck.txt unmap1@2000001.txt aln1@2000001.map.log cns.indelse read1@1.bfq unmap.indel

And the cns.fq file, when I run "cat", looks like this: nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnatgactctccttggccaacgcaagcccgtccagctgcccaccctccctggcccagtc ctcccacaaaaccannnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnagtaacagacagggtctcagctgagttgtctcctgatccgctggcctcaccacccc taaatggaaaccagtnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn nnnnnnnnnnnnnnnnnn... with lots of "n"s.

Is that the entire consensus sequence, masked everywhere except where my sequences map? If so, the sequences are of variable lengths, but do not go over the 76 bp original.

The "all.map" seems to be the file I will want.