Hebin:TFBS asymmetry
Question
- Are enhancers enriched for binding sites than random sequences?
- Are binding sites evolving any slower than XXX? This involves explaining the excessive conservation in the spacer sequences!
- How many sites are affected by substitutions between mel and sim?
- Confirm the asymmetry! and show if conservation occurs at higher level--Pocc (compensation within homeotypic sites) or even between different factors'sites
- One direction: simulations to investigate properties of compensatory mutations
- how easy is it to create a new site de-novo
- where do compensatory mutations tend to happen?
- confirm that overlapping sites tend to be more conserved. also show by simulation under a simple compensatory scheme.
Cloning
the Linear Templates were cloned into <a class="external-link" href="http://catalog.takara-bio.co.jp/en/product/basic_info.asp?unitid=U100005842">pUC118</a>
cat 3322
bcd his-tag out of frame
the 3 prime end of the 2-step pcr product. I cloned them into the vectors and picked two clones, both show this wrong 3' end that put his tag out of frame.
GCAGTTTGCCTACTGCTTTAAT-ATCATCACCATCACCACTAATAA C C
Now I'm not sure if this is a problem with pcr or the primer. I'm going to do the 2-step pcr again and do bulk sequencing from the 3' end and also sequence the other three clones I picked. Might need to reorder the primer.
General Method
Approaches
Compare evolutionary constraint in TFBS, spacers and other regions
117/704 sites contain fixed substitutions between mel and sim, around 1/7 most desirably: compare to the nearby gene's neutral rate. Calculate π total: 0.01242 tfbs : 0.00729 nonsyn 0.0026 syn 0.0352 intron 0.0166 5'UTR 0.0108 3'UTR 0.0113 inter 0.0204
Divide sites by their information content
Method
- First add pseudo count P[b] * K (k=1) to every cell of the matrix
- ic = sum( f[b,p] * ln ( f[b,p] / P[b] ) )
- Note that this is different from the other method which define ic = 2 + sum( f[b,p] * ln( f[b,p] ) )
- ic range: 0.089 - 1.552. divide into (0, 0.2, 0.4, 0.6, 0.8, 1.0, 1.2, 1.4, 1.6) bins, count the total number of sites in each bin as well as number of sites div/poly
Results
For all 30 factors <0.2 <0.4 <0.6 <0.8 <1.0 <1.2 <1.4 <1.6 653 1233 1235 852 837 1069 32 207 21 34 31 25 12 26 0 3 7 18 16 14 5 6 0 6 <0.8 >0.8 2.79% =1.91%
<img class="image-inline" src="result/Div-IC.jpg/image_preview" alt="Divergence~Information Content" />
Determine Ancestral Binding Site Energy
Rationale
When the inferred ancestral sequence for a melanogaster binding site is not predicted to be a binding site, the simplest interpretation is that one or more melanogaster-specific substitutions create a novel site. However due to the degeneracy and other properties of the TFBS motif, it is conceivable that there might be a weak site overlapping with the current position of the mel binding site in the ancestral sequence. Maybe what the mel-specific mutations does is to make the binding site stronger while also sliding it a little bit.
<p>anc AATCCCCACTT >>>>>> mel ...T.......</p>
Code
<a title="Count_Variation" class="internal-link" href="code/Count-variations.pl">Count_Variation</a>
How many new gains?
select if(anc_score >=0, 'p',if(anc_score=-1, '-1','-2')) anc, count(*) from tfbs where tfbs_score >= 0 group by if(anc_score >=0, 'p',if(anc_score = -1, '-1', '-2'));
+-----+----------+ | anc | count(*) | +-----+----------+ | -1 | 14 | | -2 | 5 | | p | 695 | +-----+----------+
2% are newly gained
How many polymorphism in binding sites available for analysis?
select count(*), allele_number from tfbs_poly where tf_name is not NULL and tf_name != 'spacer' and polytype = 'nt' and poly_Or_not = 'poly' group by allele_number order by allele_number;
+----------+---------------+ | count(*) | allele_number | +----------+---------------+ | 3 | 2 | | 6 | 3 | | 22 | 4 | | 40 | 5 | | 17 | 6 | +----------+---------------+
Determine The Effect of Fixed/Polymorphic Nucleotide Substitutions
Rationale
#Table structure of the result:
ID unique for each change type mel / sim / poly / amb fpid fpid of the binding site coord genomic coordinate factor transcription factor effect change in PWM score or pvalue? freq allele frequency. only for polymorphic sites
#Notes
For each fixed nucleotide substitution in melanogaster, if the ancestral inference is available, I will calculate the difference in PWM score by this substitution alone on the background of the ancestral sequence. If the ancestral inference is not available, I will mark it as ambiguous. The same procedure will be done for simulans specific fixed changes with the consideration of whether the ancestral sequence is a good binding site. In the mean time I would like to produce a list of candidate targets for MITOMI.
Details: I'll go in to get a b.s. for factor X. First I'll determine if this site is gapped or not. If gapped, this site would be discarded. If not, I'll then go through each substitution. For fixed sites, I'll determine the direction of the change and then calculate the Polymorphic sites are treated the same way as fixed substitutions, except that the frequency of the polymorphism is also collected.
How many sites are un-gapped?
Total 793 sites with PWM score Gapped 89 sites Nngap 704 sites
How many sites are gained in mel and how many lost in sim within the footprints?
Within 704 ungapped sites (anc-mel-sim), using 80% recovery rate cutoff for each factor, the percentage is divided by (535+21+2 = 558) cutoff 80% cutoff 0 1-1-1 535 95.9% 601 96.9% 1-1-0 21 3.8% 19 3.1% 1-0-1 2 0 0-1-0 14 2.5% 12 1.9% 0-0-1 2 0 0-0-0 130 72 117/704 sites contain fixed substitutions between mel and sim, around 1/7 21/704 sites are present in the ancestor but lost in simulans
Complexity in evaluating the effects of changes
Here is an example:
anc .........?......? yak .....T...GA.....a mel tgtaaAATAATTattat sim .........T....--- .........T....---
The A->T change actually makes the complementary sequence a better binding site as shown below
a \AATAATT\ a position= 1 score= 4.38 ln(p-value)= -5.73 sequence= AATAATT a position= 1C score= 5.03 ln(p-value)= -6.44 sequence= AATTATT a \AATATTT\ a position= 1 score= 2.47 ln(p-value)= -4.30 sequence= AATATTT a position= 1C score= 1.66 ln(p-value)= -3.73 sequence= AAATATT
I noticed that the consensus for en is AATTAAA. Like this case, when the consensus has some kind of symmetry so that the sequence can be read from both strands, then a mutation might make the other strand a better site.
The complexity lies in the following:
1. The orientation and position of the binding site is not fixed. A change may actually make the best match to PWM shift by several base pair or appear in the complementary strand.
2. Because of 1, even if accepting the additivity and site independence assumption in PWM, changes in different positions may not be independent just because the binding site itself may have moved. In such conditions, inference of the order of the mutations become necessary and even critical.
Luckily this type of situation is not very common. First of all there are not many binding sites where there are more than 1 change. See below for the distribution of number of fixed/poly changes. Besides, this file records all cases where a change seems to have changed the orientation of the best match.
<a title="Complex cases where a change will change the orientation of a site" class="internal-link" href="result/complex-cases-where-a-change-will-change-the">Complex cases where a change will change the orientation of a site</a>
Another example--potential compensatory mutation
6314 4.52 anc ................. yak ..g..-.G..A...c.. mel tatacTATTATTcggtg sim .....A.....G..... T->A : 0.51->1.99 ...........G..... T->G : 4.52->0.51 ...........G..... .....A.....G.....
What's the distribution of # fixed/poly changes per un-gapped binding site?
no div poly 0 587 644 1 90 53 2 21 3 3 4 3 4 2 1 5 0 0 sum 704 704
Note: here I didn't distinguish which lineage the change is on.
In cases of more than one change in a single bs, do the second change tend to be of complementary?
I checked all cases with more than one changes in a single binding site on one lineage (for mel, >1 fixed diff; for sim, >1 fixed/poly changes)
Expectation
The hypothesis is that an initial change that decreases binding affinity would pose stronger constraint on the fixation of subsequent mutations (assuming a non-linear fitness-energy function because the first mutation might push the energy to some critical window where a further decrease would be very deleterious.
The prediction under this hypothesis is that if there are two changes in a single site, they tend to be of opposing effects. This hypothesis will not hold if the complementary change happens but more frequently in another location of the enhancer, in which case, the original site that get damaged is now deprived of its functional importance and thus become prone to further changes.
Observation
Focusing on 2-3 changes in a single site on either lineage, in mel they tend to be both/all increasing or the net effect is obviously increasing. For sim, of the 12 or so cases I checked, three of them that have two changes show opposing effects. Two out of these three cases the net effect is quite close to zero. However while the estimates of the effects are from PWM, it need to be taken with caution.
I tried to solve some of the ambiguous rooted base pairs. It turned out that VISTA browser and UCSC use different algorithms for alignment. While the VISTA one which I used in this case shows a deletion in yak, the sequence is present in UCSC alignment.
After solving 3-4 cases where this would potentially influence my conclusion, I still hold the impression that the 3/13 cases show evidence of complementary changes. Plus these 3 cases have a pretty small predicted effects. This makes sense in that if the initial change only decreases the binding energy by a little, then there are a lot of ways to compensate that. If the initial change strongly influence the binding affinity, then only a very small number of specific changes can help revive the site, in which case creating another one in a new place may be easier!
MK test and Frequency Spectrum
Code
setwd("/Volumes/hebin-1/Documents/tfbs/Result") data <- read.table("effect_changes.txt", head=T) ########### # inc/dec ########### t <- 1; mk <- c(0,0,0,0,0,0); dim(mk) <- c(3,2) rownames(mk) <- c('mel','sim', 'poly'); colnames(mk) <- c('inc','dec') mk[1,1] <- length(data$effects[data$type == 'mel' & data $ effects > t & data$score > 0]) mk[1,2] <- length(data$effects[data$type == 'mel' & data $ effects < -t & data$score > 0]) mk[2,1] <- length(data$effects[data$type == 'sim' & data $ effects > t & data$score > 0]) mk[2,2] <- length(data$effects[data$type == 'sim' & data $ effects < -t & data$score > 0]) mk[3,1] <- length(data$effects[data$type == 'poly' & data $ effects > t & data$score > 0]) mk[3,2] <- length(data$effects[data$type == 'poly' & data $ effects < -t & data$score > 0]) mk setwd("/Volumes/hebin-1/Documents/tfbs/Result") data <- read.table("effect_changes.txt", head=T) ############## # large/small ############## t2 <- 4; cond <- (data$effects > t2 | data$effects < -t2) mk2 <- c(0,0,0,0,0,0); dim(mk2) <- c(3,2) rownames(mk2) <- c('mel','sim', 'poly'); colnames(mk2) <- c('large','small') mk2[1,1] <- length(data$effects[data$type == 'mel' & cond & data$score > 0]) mk2[1,2] <- length(data$effects[data$type == 'mel' & !cond & data$score > 0]) mk2[2,1] <- length(data$effects[data$type == 'sim' & cond & data$score > 0]) mk2[2,2] <- length(data$effects[data$type == 'sim' & !cond & data$score > 0]) mk2[3,1] <- length(data$effects[data$type == 'poly' & cond & data$score > 0]) mk2[3,2] <- length(data$effects[data$type == 'poly' & !cond & data$score > 0]) mk2 ###################### # frequency spectrum ###################### t3 <- 1 comb <- summary(data$freq[data$type == 'poly' & data$score > 0 & data$alleles == 5])[1:5] inc <- summary(data$freq[data$type == 'poly' & data$score > 0 & data$alleles == 5 & data$effects > t3])[1:5] dec <- summary(data$freq[data$type == 'poly' & data$score > 0 & data$alleles == 5 & data$effects < -t3])[1:5] a <- c(1, 1/2, 1/3, 1/4, 1/5) matrix <- rbind(comb/sum(comb)*sum(a), inc/sum(inc)*sum(a), dec/sum(dec)*sum(a), a) rownames(matrix) <- c('combined', 'increase', 'decrease', 'neutral') barplot(matrix, beside = T, legend.text = rownames(matrix), col=rainbow(10)[c(3,5,7,8)]) title(main = 'frequency spectrum of 5 alleles of different classes') ################################### # To count the number of changes # in spacers # May 8th 08 # hebin ################################### SELECT poly_Or_not, count(*) FROM tfbs_poly WHERE tf_name = 'spacer' and polytype = 'nt' and poly_Or_not in ('poly', 'div') GROUP BY poly_Or_not; +-------------+----------+ | poly_Or_not | count(*) | +-------------+----------+ | div | 2796 | | poly | 1834 | +-------------+----------+ SELECT if(tfbs_poly.anc = mel, 'sim', if(anc = '?', 'amb', 'mel')) type, count(*) FROM tfbs_poly WHERE tf_name = 'spacer' and polytype = 'nt' and poly_Or_not = 'div' group by type; +------+----------+ | type | count(*) | +------+----------+ | amb | 646 | | mel | 1223 | | sim | 927 | +------+----------+
Solving ambiguous ancestry sites
There are 39 changes that cannot be rooted using my original VISTA mel-yak alignment. I extended the alignment to multi-species up to virilis, data retrieved from UCSC genome browser
<a title="Solving ambiguous ancestry using UCSC multi-species alignment" class="internal-link" href="result/ambiguous-sites.txt">Solving ambiguous ancestry using UCSC multi-species alignment</a>
All together I solved 15 / 39 cases with reasonable confidence (based on parsimony principle). 4/5 mel specific changes are predicted to increase binding affinity while 5/6 sim specific changes are expected to decrease binding affinity. So the asymmetry is made even more extreme.
Results
inc: >0; dec: < 0 inc dec mel 34 16 sim 7 42 poly 17 50 excluding mel_specific sites inc dec mel 15 15 sim 4 31 poly 14 42 inc: >1; dec: < -1 inc dec spacer mel 26 8 1223 sim 1 35 927 poly 12 42 1834 excluding the mel_specific sites inc dec mel 11 7 sim 0 25 poly 10 35 using Pi instead of PWM score inc: >0.01; dec: < -0.01 inc dec mel 22 7 sim 4 26 poly 13 33 inc: >0.005; dec: < -0.005 inc dec mel 23 9 sim 4 30 poly 14 40 individual regions pattern the same CE1049 27 changes (including ambiguous sites) inc dec spacer mel 4 2 26 sim 0 5 22 poly 1 5 58 CE1004 11 changes inc dec spacer mel 0 0 3 sim 0 2 1 poly 1 2 3 CE1068 24 changes inc dec spacer mel 2 2 4 sim 0 7 7 poly 0 1 10 large : small cutoff = 3 large small spacer mel 16 34 1223 sim 23 26 927 poly 28 39 1834 large : small cutoff = 4 large small mel 13 37 sim 16 33 poly 16 51
Problems
- because the number is small, the ambiguous sites can have a large effect on the result.
- pwm is derived from melanogaster footprints. They are adapted to whatever equilibrium frequency is in melanogaster. Does this a priori predict more gains than loss in mel?
- because the footprints are defined in melanogaster, imagine the ancestral pool of sites--the sampling is biased towards changes that increase binding affinity because those are the ones that will be detected in the footprint assays
Further questions: 0. what's the expected number of large/small effects? 1. what's the number for non-binding sites? 2. frequency spectrum? 3. if really biased by the sampling, maybe expand to PWM predictions? where are the functional binding site?
Further directions:
- verify with selex PWM?
- unbiased prediction of binding sites in both species. But now how could you distinguish functional sites from site like sequences, which presumably has very different evolutionary pattern.
- if inc/dec dimension is orthogonal to large/small effect dimension, then the later should not be biased.
<img class="image-inline" src="result/Frequency%20Spectrum.pdf/image_preview" alt="Frequency_Spectrum" height="420" width="420" />
Validate the results with Pollard's PWM
Methods
- Start from effect_changes.txt
- for each documented change, retrieve the footprint sequence, make two copies with the only difference in this change, one being ancestral and one being evolved
- use patser with Pollard's PWM to find the score difference
Results
inc > 1; dec < -1; inc dec mel 15 8 sim 8 16 poly 8 21 inc > 0; dec < 0 inc dec mel 22 14 sim 13 22 poly 17 33
Validate the results with SELEX PWM
Impression on the comparison between SELEX PWM and footprint PWM
Casey's PWM are learned from the footprints. Typically they are less information-dense (containing lots of low information site). They recover more footprints, as they are optimized for. But they also predict a lot more additional sites when used to scan the whole enhancer
Methods
- extract matrix from xml file and format as consensus. <a title="XML_PWM2.pl" class="internal-link" href="code/XML-PWM2.pl">XML_PWM2.pl</a>
- use patser to transform the matrix into scoring matrix
- for each change for the factor, use the selex matrix to reexamine the effect and record. <a title="compare_selex.pl" class="internal-link" href="code/compare-selex.pl">compare_selex.pl</a>
Results
footprint PWM inc dec mel 8 2 sim 1 13 poly 4 12 SELEX PWM inc dec mel 8 3 sim 2 11 poly 3 11
<img class="image-inline" src="result/compselex.jpg/image_large" alt="compselex.jpg" height="597" width="675" />
Unbiased prediction
How to choose good PWMs and proper cutoffs
Case study: bcd23(SELEX) vs. bcd
There are 5 footprint sites. Use the lowest footprint score 3.11 as cutoff, Casey's PWM bcd finds 16 hits while the same cutoff on bcd23 (selex) misses one footprint site but finds only 8 hits. However there are 3 sites that are consistently recovered but not found in footprints for unknown reasons.
eve position= 431 score= 4.29 ln(p-value)= -6.66 sequence= TCCAATCC eve position= 455 score= 5.51 ln(p-value)= -7.57 sequence= CCCAATCC eve position= 461 score= 5.51 ln(p-value)= -7.57 sequence= CCCAATCC
Protocol
- ../patser/patser -A a:t 0.3 c:g 0.2 -m ../OptimalPatserMatrices/bcd23 -b 1 -u2 -s -c -ls 0 -f bcd_test.fasta
| grep "position" > ./temp.txt - Convert the output to a table format <a title="convert.pl" class="internal-link" href="../code/temp3.pl">convert.pl</a>
- Validate that the patser predicted position are consistent with the genome coordinate <a title="Validate the relative position" class="internal-link" href="../code/Unbiased-prediction.pl">Validate the relative position</a>
- Given a cutoff, evaluate the percentage of footprint recovered and the ratio of total hits to hits in footprints. <a title="PWM and cutoff evaluation" class="internal-link" href="../code/pwm-and-cutoff-evaluation">PWM and cutoff evaluation</a>
<img class="image-inline" src="result/hb-cutoff.jpg/image_preview" alt="hb_cutoff" />
Method
type |
mel |
sim | anc |
---|---|---|---|
1 | √ | √ | √ |
2 | √ | √ | |
3 | √ | √ | |
4 | √ | ||
5 | √ | ||
6 |
√ | ? |
|
7 |
√ | ? |
-
Extract the whole enhancer sequence alignment for mel and w501 sim, anc as well. Consider filling up the w501 gaps with other sim sequences (optional).
- Do predictions in both mel and sim sequences with the chosen pwm and cutoff.
- Transform the coordinates (- is not counted in patser) and get rid of overlapping sites
- Store the binding site map in array of arrays. (MySQL and R have more powerful matrix manipulation tools, but they are not good at string processing and looping structure.)
- First examine sites predicted in mel sequences--see if the asymmetry holds.
- To transform the coordinate, I used a hash $coordinate -- $coordinate{$n} stores the absolute position for relative pos $n. This value is used for extracting sequences using substr.
- To deal with overlapping site, I recorded the previous site information and judge if the next site is overlapping. Throw away the lower score one.
- Loop through each of the remaining binding sites. Extract the alignment block, throw away gapped sites. Evaluate the effects of changes. Record "region factor position type effect"
- For each predicted binding site, judge the ancestral sequence if possible. Assign the phylogenetic type of the site (available types listed below)Then examine all changes in those binding sites including mel and sim predicted ones.
- Look at compensatory changes
-
Use Ah-Ram's algorithm to plot
~/Documents/tfbs/patser/patser -A a:t 0.3 c:g 0.2 -m ~/Documents/tfbs/OptimalPatserMatrices/bcd -b 1 -ls 3.11 -s -c S2E sequence CAATATAACCCAATAATTTGAAGTAACTGGCAGGAGCGAGGTATCCTTCCTGGTTACCCGGTACTGCATAACAATGGAACCCGAACCGTAACTGGGACAGATCGAAAAGCTGGCCT GGTTTCTCGCTGTGTGTGCCGTGTTAATCCGTTTGCCATCAGCGAGATTATTAGTCAATTGCAGTTGCAGCGTTTCGCTTTCGTCCTCGTTTCACTTTCGAGTTAGACTTTATTGC AGCATCTTGAACAATCGTCGCAGTTTGGTAACACGCTGTGCCATACTTTCATTTAGACGGAATCGAGGGACCCTGGACTATAATCGCACAACGAGACCGGGTTGCGAAGTCAGGGC ATTCCGCCGATCTAGCCATC--------GCCATCTTCTGCGGGCGTTTGTTTGTTTGTTTGCTGGGATTAGCCAAGGGCTTGACTTGGAATCCAATCCCGATCCCTAGCCCGATCC CAATCCCAATCCCAATCCCTTGTCCTTTTCATTAGAAAGTCATAAAAACACATAATAATGATGTCGAAGGGATTAGGGGCGCGCAGGTCCAGGCAACGCAATTAACGGACTAGCGA ACTGGGTTATTTTTTTGCGCCGACTTAGCCCTGATCCGCGAGCTTAACCCGTTTTGAGCCGGGCAGCAGGTAGTTGTGGGTGG---ACCCCACGATTTTTTTG select fpid, tfbs_start-5489522 pos, tfbs_score, tfbs_pvalue, strand from tfbs where region = 'CE1056' and factor = 'bcd';
Result
- Examine mel-predicted sites only, for hb, cutoff = 4, t=1
In footprint sites: inc dec mel 4 1 sim 0 6 Total: 97, Gapped: 16, 12 ungapped sites contain fixed differences In mel-predicted sites: inc dec mel 6 2 sim 0 10 Total: 206 binding sites, Gapped: 29 of the 10 changes that cannot be rooted, 9/10 shows a stronger binding energy in mel and 1/10 is the other way. So resolving the ancestry for them would only enhancer the asymmetry
-
Examine sim-predicted sites only, for hb, cutoff = 4, t=1
w501 only inc dec mel 0 4 sim 3 2 Total: 151 binding sites, Gapped: 19 mixed sim inc dec mel 0 8 sim 7 3 Total: 191 binding sites, Gapped: 23
Interpretation
- Extreme: if there is no turnover at all, then one would expect equal number of increase and decrease to balance out on each lineage
- the observed extreme asymmetry indicates either evolution of total binding or high turnover plus biased sample of footprint sites from the focal species, likely a mixture of the two.
- Turnover rate estimated from the prediction
Things to get to:
net change in the number of sites per enhancer, in total the number of gains and losses in each species, number of sliding cases
Region Overlap mel sim Sliding CE1030 0 1 0 0 1 1 0 CE1013 10 2 0 0 14 3 2 CE1056 3 1 0 0 4 1 1 CE1149 4 3 1 1 6 3 1 CE1089 16 3 2 1 20 6 3 #has poor alignment! CE1012 14 3 1 1 21 6 7 CE1004 14 3 2 2 16 5 3 CE1035 18 2 0 0 23 3 3 CE1162 38 14 9 5 20 23 6 #has poor alignment! CE1034 7 0 0 0 7 2 2 CE1068 10 12 11 5 14 13 10? #10<11, due to the mixed sim allele in the previous case CE1181 12 1 3 0 11 3 4 CE1029 12 3 4 3 13 4 5 total 158 48 33 18 footpr 66 23 no evidence that the mel-specific ones tend to be non-footprints. this is sort of consistent with Justin's finding that semi-conserved sites do not tend to have lower affinity. Estimate from hb alone: mel: 206 sites in total, 25% mel specific sim: 191 sites in total, 17% sim specific inc dec mel 6 9 sim 7 16 Total: 239 binding sites, Gapped: 45 Among the ambiguous sites, 2/9 (inc/dec) using hb39 SELEX matrix CE1030 3 0 0 0 CE1013 8 0 0 0 CE1056 2 1 0 0 CE1149 8 3 0 0 CE1089 15 0 0 0 CE1012 11 0 0 0 CE1004 12 2 1 0 CE1035 16 1 0 0 CE1162 36 13 9 0 CE1034 7 0 0 0 CE1068 13 10 13 0 CE1181 22 2 4 0 CE1029 14 4 3 0 total 167 36 30
mel Polymorphism
Materials
- eve-stripe 2 enhancer element from Ludwig and Kreitman 1995
- from other literatures
name Alleles Chr_arm Coordinate Region Start End ref Ubx 22+1 3R 12,526,539-12,527,269 CE1034 12526644 12527144 -Phinchongsakuldit, MacArthur & Brookfield, MBE, 2004 Ubx 22+1 3R 12,599,005-12,600,040 CE1181 12598627 12600114 -same as above ftz 13 3R 2,689,483-2,689,995 CE1121 2689374 2690045 Dermitzakis, Bergman & Clark, MBE, 2003 en 12 2R 7,043,788-7,044,631 CE1149 7044399 7044957 same as above hb 12 3R 4,517,317-4,520,669 CE1162 4520300 4524677 Tautz & Nigro, MBE 1998 50 2L 12,537,331-15,942,868 CE1061 13904731 13905265 50 genomes project CE1138 14614893 14616302 CE1078 15474198 15476199 50 X 1,911,660-5,005,056 CE1086 1980273 1980606 CE1085 1981449 1981745 CE1084 1996366 1996806 CE1172 2033396 2033534 CE1192 2304109 2304228 CE1188 2651290 2651409 CE1187 2652700 2654032 CE1099 3105441 3106264 CE1126 4129700 4134947 CE1038 4231703 4233191 CE1146 4908617 4909698 there are altogether 144 footprint sites
| dec inc spacer ----------------+--------------------------------- mel div |5 1 0 4 poly |4 1 1 2 | 4/6 1/6 3/5, 1/5 ----------------+--------------------------------- sim div |7 2 0 5 poly |3 0 0 3 | 2/6, 2*1/6
hb promoter CE1162 overlaps with fpid 6413-6421, 8 bcd sites and 1 hb site disappointingly, there are no polymorphism or fixed substitutions in these 8 sites.
Transcription factors divergence
Methods
- get the TF protein sequence from UCSC
- blast for AA->AA on flybase, save in XML
- run TF_align_XML.pl to get alignment
- run emma on EMBOSS with alignment, save in MSF
- run pretty plot with MSF
- on UCSC go to uniprot to get structure information
- annotate in preview for DNA binding domain
Calculate Pocc and its application
Methods
- patser score = sigma{ ln f(b,j)/p(b) }
- deltaG = RT sigma{ ln f(b,j)/p(b) }
- Pi = [X] / (exp{-deltaG / RT} + [X])
- let [X] = exp{deltaG[consensus]}
- Pi = 1 / (exp{ (deltaG[consensus] - deltaG) / RT} + 1) = 1 / (exp{patserS[consensus] - patserS} + 1)
- Consensus sequence has the highest score -> highest deltaG -> lowest Kd -> more difficult to dissociate (because the more negative deltaG is, the more self-initiating the process is.)
- I've verified that my hand-written way of calculating score is the same as patser calculation, while the later is much slower.
Validation of the method
I manually check some of the pair Pocc in the output. Sometimes there seem to be a lot of changes on sim lineage that decreases binding energy score by quite a bit. For example, for Region CE1057, factor Trl, here is the list of binding sites energy scores for the two species
fpid mel sim 4152 0.13 -0.19 4154 4.93 -0.63 4157 7.36 7.29 4165 9.10 10.59 Pocc_mel = 0.066709956 Pocc_sim = 0.188498675 The reason is better explained below in properties of Pocc. Basically, changes in binding score doesn't correlate linearly with Pocc. Thus change in the strong site 4165 contribute disproportionally to Pocc.
Properties of Pocc
Pocc is the probability that AT LEAST one TF is bound to the enhancer. As expected, abolishing one binding site in an enhancer consisting a handful of strong sites will not affect Pocc to any substantial amount. However, why some enhancers are built with a single site while some contain a dozen similarly strongly sites?
Several levels of considering binding sites profile divergence:
- count the kind and number of binding sites
- calculate Pocc, which integrates over all the binding sites of the same kind
- model the output of the enhancer thus integrating binding sites position and combination information and so on
- Because Pocc is not responsive when there is redundancy. (it is enough to have one or several strong sites to ensure that at least one TF is bound), this comparison will only reveal those cases with single site in the enhancer that is altered in one species or those that have many sites altered at the same time.
One thing to note is what Pocc value reflects. When there are many strong sites in a sequence, abolishing one of them seem to have very small changes on Pocc. Necessary to read Pocc papers.
1. permute the sequence (preserve dinucleotide frequency)
For bcd using 1. selex PWM 2. PWM from Down et al.
<tbody> </tbody>Region |
Pocc | pvalue |
---|---|---|
CE1030 | 0.1604 | 0.232 |
0.2641 |
0.618 |
|
CE1056 |
0.8398 |
0 |
0.9209 |
0.001 |
|
CE1089 |
0.0166 |
0.935 |
0.0881 |
0.989 |
|
CE1012 |
0.0894 |
0.372 |
0.4948 |
0.428 |
|
CE1004 |
0.0480 |
0.389 |
0.6110 |
0.116 |
|
CE1035 |
0.4169 |
0.125 |
0.8910 |
0.029 |
|
CE1028 |
0.0445 |
0.135 |
0.0753 |
0.385 |
|
CE1074 |
0.1199 |
0.315 |
0.8253 |
0.03 |
|
CE1162 |
0.7363 |
0.165 |
0.9691 |
0.667 |
code
<a title="Pocc pair calculation and expected" class="internal-link" href="code/Pocc-pair.pl">Pocc expected</a>
<a title="Pocc pair calculation and expected" class="internal-link" href="code/Pocc-pair.pl">Pocc pair calculation</a>
Result
- General pattern:
<a title="Pocc expected" class="internal-link" href="code/Pocc-pairexp.pl"></a>
For hb Pocc = 0.0490626779367482 p-value = 0.951 CE1030 Pocc = 0.857583818550552 p-value = 0.009 Pocc = 0.644644214013364 p-value = 0.066 Pocc = 0.726122781523663 p-value = 0.164 Pocc = 0.99205125002337 p-value = 0 Pocc = 0.98936891052637 p-value = 0 Pocc = 0.994712693935612 p-value = 0 Pocc = 0.996523508773069 p-value = 0 Pocc = 0.999922656447524 p-value = 0.005 Pocc = 0.953240878140783 p-value = 0 Pocc = 0.985433379804871 p-value = 0.035 Pocc = 0.955171520728597 p-value = 0.08 Pocc = 0.972211787235409 p-value = 0.006
2. same sequence, scramble the pwm (preserve the total information content)
hb 1000 iterations CE1030 Pocc = 0.0485909445226903 p-value = 0.793 CE1013 Pocc = 0.855046878947466 p-value = 0.008 CE1056 Pocc = 0.641396463335304 p-value = 0.162 CE1149 Pocc = 0.723027286433472 p-value = 0.087 CE1089 Pocc = 0.991711235632589 p-value = 0 CE1012 Pocc = 0.988940266852853 p-value = 0 CE1004 Pocc = 0.994466512897498 p-value = 0.002 CE1035 Pocc = 0.996344086058097 p-value = 0.003 CE1162 Pocc = 0.999915746792497 p-value = 0.178 CE1034 Pocc = 0.952078221580422 p-value = 0.001 CE1068 Pocc = 0.984885043860919 p-value = 0.031 CE1181 Pocc = 0.953952078454468 p-value = 0.019 CE1029 Pocc = 0.971370913035186 p-value = 0.016 bcd24 1000 iterations CE1030 Pocc = 0.161788727877099 p-value = 0.13 CE1056 Pocc = 0.842029413353345 p-value = 0.002 CE1089 Pocc = 0.0167758142211993 p-value = 0.733 CE1012 Pocc = 0.0902420614763332 p-value = 0.334 CE1004 Pocc = 0.0484866852062439 p-value = 0.375 CE1035 Pocc = 0.419809501350481 p-value = 0.101 CE1028 Pocc = 0.0449268578762426 p-value = 0.133 CE1074 Pocc = 0.12098154880125 p-value = 0.144 CE1162 Pocc = 0.739505636509969 p-value = 0.109
Pocc pairs of the two species' enhancer
<img class="image-inline" src="result/Pocc-pair.jpg/image_large" alt="Pocc_pair" height="657" width="628" />
Obviously there are more decreases than increases.
Something interesting I discover
Here I can divide all cases into three categories: 1. Affinity in footprints decreases a lot in simulans but Pocc are similar -- functional compensation 2. Affinity in footprints decreases a lot in simulans and Pocc drops dramatically as well -- functional evolution or hetero-factor compensation? 3. Affinity in footprints doesn't change a lot but Pocc changes -- either sim evolves a stronger binding or there are hidden sites somewhere! Category I Changes in footprints: CE1004 3616 Kr 7.69 8640866 poly 0.38 1 4 CE1004 3618 Kr 7.62 8640890 sim -5.81 4 4 CE1004 3620 Kr -1.00 8640926 sim -1.74 4 4 CE1004 3623 Kr -1.00 8641007 poly 0.00 1 5 Pocc values Region factor mel_l sim_l mel_Pocc sim_Pocc CE1004 Kr 548 542 0.391655915895923 0.364895827027112 For this case, actually there doesn't need to be any complementary changes because there are already a lot of strong sites in this region: +------+--------+---------+--------+--------+----------+---------+------------+----------+------------+-------------+-----------+------------+--------+--------+ | fpid | region | chr_arm | factor | target | fp_start | fp_end | tfbs_start | tfbs_end | tfbs_score | tfbs_pvalue | anc_score | anc_pvalue | strand | gapped | +------+--------+---------+--------+--------+----------+---------+------------+----------+------------+-------------+-----------+------------+--------+--------+ | 3612 | CE1004 | 3L | Kr | h | 8640828 | 8640838 | 8640827 | 8640836 | 7.56 | -9.26 | 7.56 | -9.26 | - | N | | 3616 | CE1004 | 3L | Kr | h | 8640856 | 8640866 | 8640857 | 8640866 | 7.69 | -9.39 | 7.69 | -9.39 | + | N | | 3618 | CE1004 | 3L | Kr | h | 8640881 | 8640891 | 8640882 | 8640891 | 7.62 | -9.33 | 7.62 | -9.33 | + | N | | 3620 | CE1004 | 3L | Kr | h | 8640919 | 8640929 | 8640918 | 8640927 | -6.59 | 0 | -1 | 0 | - | N | | 3623 | CE1004 | 3L | Kr | h | 8641003 | 8641011 | 8641003 | 8641012 | -2.72 | 0 | -1 | 0 | - | N | | 3624 | CE1004 | 3L | Kr | h | 8641119 | 8641129 | 8641118 | 8641127 | 8.07 | -9.78 | 8.07 | -9.78 | - | N | | 3627 | CE1004 | 3L | Kr | h | 8641145 | 8641155 | 8641140 | 8641149 | -3.88 | 0 | -1 | 0 | + | N | | 3633 | CE1004 | 3L | Kr | h | 8641252 | 8641262 | 8641251 | 8641260 | 9.42 | -11.46 | 9.42 | -11.46 | - | Y | +------+--------+---------+--------+--------+----------+---------+------------+----------+------------+-------------+-----------+------------+--------+--------+ Simply abolishing one site won't change Pocc a lot. However, what does Pocc means? Category II Changes in footprints: 196 CE1029 260 Kr 3.5 12636476 mel 5.81 5 5 anc ......G............. yak ......G............. mel tgcgtGAAAGGGTGAagcta 9.3 sim nnnn..G............. 3.5 ......G............. ......G............. ......G............. ......G............. Pocc values: CE1029 Kr 1746 1727 0.215347001553784 0.0280790836253705
Interesting literature for CE1029
Shimell, M. J., Peterson, A. J., Burr, J., Simon, J. A., O'Connor, M. B.,
February 2000. Functional analysis of repressor binding sites in the iab-2 regulatory region of the abdominal-a homeotic gene. Developmental Biology 218 (1), 38-52.
URL <a href="http://dx.doi.org/10.1006/dbio.1999.9576">http://dx.doi.org/10.1006/dbio.1999.9576</a>
[abstract] Abdominal-A (ABD-A) homeotic protein. Analysis of a 1.7-kb enhancer element [iab-2(1.7)] from the iab-2 regulatory region shows that in contrast to Ubx enhancer elements, both HB and Kru¨ ppel (KR) are required to set the ABD-A anterior boundary in parasegment 7. DNase I footprinting and site-directed mutagenesis show that HB and KR are direct regulators of this iab-2 enhancer. The single KR site can be moved to a new location 100 bp away and still maintain repressive activity, whereas relocation by 300 bp abolishes activity. These results suggest that KR repression occurs through a local quenching mechanism.
I tried to search flanking sequence 300 bp on each side for compensatory Kr sites for sim but didn't find any. I then calculated Pocc for sim sequence as well as a p-value from 1000 dinucleotide permutation of the sequence. The results indicate that sim sequence just don't contain a good Kr site
Pocc_mel = 0.215347001553784 p-value (for dinucleotide shuffling) = 0.109 Pocc_sim = 0.050892903419477 p-value (for dinucleotide shuffling) = 0.804
Calculate Potential and its application
Method
The calculation is very similar to Pocc and have interesting relationship with it.
- score all the sliding windows of pwm_length and rank the scores
- Pick the highest rank score
- Pick the second highest rank among those that do not overlap with the previous picked ones until no site can be picked because of overlaps
- Sum up the Pi 's
- At equilibrium, this sum value can be interpreted as the "average number of proteins on the enhancer"
- This value will be very close to Pocc while Pocc is small (0.2 for example)
- This value doesn't have the saturation problem with Pocc. So it is better for enhancers containing lots of strong sites and to show the effect of abolishing one of them on the potential of the enhancer to be bound by the factor.
- Its calculation doesn't impose any arbitrary cutoff.
Code
<a title="Potential_pair.pl" class="internal-link" href="code/Potential-pair.pl">Potential_pair.pl</a>
Result
select e.type, e.region, e.fpid, e.factor, e.coordinate, e.effects, e.Pi, p.melpt, p.simpt, p.simpt_exp from effect_changes e, potential p where e.type LIKE "%sim" and e.Pi < -0.01 and e.region = p.region and e.factor = p.factor order by e.type, e.factor, e.region +------+--------+------+--------+------------+----------+------------+-----------+-----------+-----------+ | type | region | fpid | factor | coordinate | effects | Pi | melpt | simpt | simpt_exp | +------+--------+------+--------+------------+----------+------------+-----------+-----------+-----------+ | asim | CE1068 | 6240 | hb | 12590526 | -3.94 | -0.0143405 | 2.91906 | 2.4386 | 2.54402 | | asim | CE1109 | 6321 | Kr | 21022230 | -4.26 | -0.0693678 | 0.121555 | 0.0532071 | 0.0461818 | | asim | CE1109 | 6321 | Kr | 21022232 | -6.22 | -0.070286 | 0.121555 | 0.0532071 | 0.0461818 | | asim | CE1146 | 1599 | ovo | 4909561 | -2.21 | -0.110128 | 0.554871 | 0.370851 | 0.444743 | | asim | CE1131 | 6347 | Ubx | 11455015 | -3.44 | -0.117362 | 0.377224 | 0.231532 | 0.335409 | | asim | CE1068 | 6236 | z | 12590413 | -4.76 | -0.0758525 | 3.14204 | 3.08593 | 3.14403 | | sim | CE1049 | 6089 | abd-A | 2759019 | -3.23 | -0.0409354 | 17.6296 | 18.7606 | 17.3765 | | sim | CE1200 | 6489 | ap | 22997767 | -5.61 | -0.43012 | 3.47621 | 1.64334 | 2.66249 | | sim | CE1099 | 4491 | br-Z3 | 3105855 | -0.85 | -0.0167293 | 3.27923 | 1.67408 | 3.02324 | | sim | CE1016 | 981 | Deaf1 | 2613240 | -2.26 | -0.0356046 | 0.0638928 | 0.0379833 | 0.0283291 | | sim | CE1076 | 6272 | dl | 2581108 | -1.32 | -0.0135067 | 0.495107 | 0.611945 | 0.714109 | | sim | CE1085 | 6314 | en | 1981579 | -4.01 | -0.194292 | 1.31849 | 1.87045 | 1.17026 | | sim | CE1152 | 5108 | en | 2692110 | -1.9 | -0.369892 | 2.5783 | 2.64407 | 2.43348 | | sim | CE1179 | 5366 | en | 5827316 | -0.411 | -0.043412 | 0.838666 | 1.31887 | 0.795254 | | sim | CE1030 | 6017 | hb | 11456134 | -0.87005 | -0.0169816 | 0.0411556 | 0.0238455 | 0.0241783 | | sim | CE1035 | 6032 | hb | 20630455 | -0.4217 | -0.0649021 | 3.04001 | 2.78995 | 2.90002 | | sim | CE1035 | 6035 | hb | 20630490 | -5.82 | -0.0524852 | 3.04001 | 2.78995 | 2.90002 | | sim | CE1056 | 4137 | hb | 5490187 | -5.82 | -0.119851 | 0.728987 | 0.61201 | 0.609136 | | sim | CE1068 | 6238 | hb | 12590469 | -5.82 | -0.234132 | 2.91906 | 2.4386 | 2.54402 | | sim | CE1181 | 5387 | hb | 12599392 | -5.82 | -0.127427 | 2.37144 | 2.74126 | 2.25778 | | sim | CE1004 | 3634 | kni | 8641330 | -1.02 | -0.201558 | 6.03252 | 4.83989 | 5.83096 | | sim | CE1181 | 5390 | kni | 12599840 | -4.586 | -0.0191775 | 7.76955 | 8.90855 | 7.76955 | | sim | CE1004 | 3618 | Kr | 8640890 | -5.81 | -0.0421574 | 0.463782 | 0.421662 | 0.421625 | | sim | CE1035 | 6039 | Kr | 20630567 | -0.38 | -0.085451 | 2.07909 | 1.99488 | 1.99364 | | sim | CE1200 | 6471 | pan | 22997071 | -5.79 | -0.348139 | 4.44358 | 3.00821 | 3.81605 | | sim | CE1199 | 5541 | slbo | 9898793 | -5.2 | -0.455406 | 0.717394 | 0.203589 | 0.262045 | | sim | CE1035 | 6034 | tll | 20630490 | -2.12 | -0.0317674 | 2.14552 | 2.18389 | 2.15739 | | sim | CE1064 | 6172 | Trl | 14722644 | -1.63 | -0.33617 | 0.901073 | 0.540058 | 0.612796 | | sim | CE1068 | 6214 | Trl | 12589882 | -1.49 | -0.156298 | 0.955081 | 0.673408 | 0.716796 | | sim | CE1032 | 300 | twi | 17203868 | -6.2 | -0.268195 | 0.809029 | 0.592913 | 0.564329 | | sim | CE1049 | 6103 | Ubx | 2759403 | -2.54 | -0.310968 | 1.93739 | 1.47474 | 1.33022 | | sim | CE1049 | 6129 | Ubx | 2760478 | -3.18 | -0.151917 | 1.93739 | 1.47474 | 1.33022 | +------+--------+------+--------+------------+----------+------------+-----------+-----------+-----------+ All except 2 cases (CE1035/tll, CE1076/dl), all the others are consistent with sim_exp < mel_pt. In these two cases, CE1035 has one mel sub that decreases; CE1076 has one amb sitethat makes sim a stronger one. Compare sim_pt and sim_pt_exp: inc/dec 14/14! cannot invoke compensation, at least not shown through this Potential calculation.
Look for cases of dramatic Potential changes
See figure in Keynote file for the "Change in Potential, rank ordered".
1. take simPt-melPt > 0.3 or simPt-melPt < -0.7
mysql> select * from Potential where (simpt-melpt) > 0.3 or (simpt-melpt) < -0.7; +--------+--------+------+------+----------+---------+-----------+ | region | factor | mell | siml | melpt | simpt | simpt_exp | +--------+--------+------+------+----------+---------+-----------+ | CE1016 | Dfd | 2658 | 2767 | 23.7409 | 24.5887 | 23.7409 | x possibly non-specific PWM | CE1079 | abd-A | 891 | 863 | 3.04907 | 3.65384 | 3.04907 | | CE1049 | abd-A | 2301 | 2378 | 17.6296 | 18.7606 | 17.3765 | | CE1200 | ap | 812 | 824 | 3.47621 | 1.64334 | 2.66249 | * | CE1004 | bcd | 548 | 542 | 1.16911 | 1.54582 | 1.17638 | | CE1099 | br-Z1 | 823 | 723 | 3.25048 | 2.26729 | 3.22339 | | CE1164 | br-Z1 | 3563 | 3450 | 7.51766 | 5.43528 | 7.36482 | | CE1099 | br-Z2 | 823 | 723 | 5.31897 | 3.67533 | 5.31964 | | CE1099 | br-Z3 | 823 | 723 | 3.27923 | 1.67408 | 3.02324 | * | CE1086 | en | 333 | 252 | 2.62556 | 0.89412 | 2.59706 | * | CE1085 | en | 296 | 288 | 1.31849 | 1.87045 | 1.17026 | ? strong compensation | CE1084 | en | 440 | 434 | 2.62796 | 1.48458 | 2.49766 | | CE1179 | en | 114 | 140 | 0.838666 | 1.31887 | 0.795254 | | CE1134 | eve | 547 | 541 | 4.66232 | 3.91912 | 4.66232 | | CE1181 | hb | 1487 | 1490 | 2.37144 | 2.74126 | 2.25778 | | CE1012 | kni | 933 | 919 | 7.79598 | 6.82782 | 7.7697 | | CE1004 | kni | 548 | 542 | 6.03252 | 4.83989 | 5.83096 | | CE1181 | kni | 1487 | 1490 | 7.76955 | 8.90855 | 7.76955 | | CE1055 | pan | 1578 | 1541 | 8.73959 | 7.96172 | 8.72432 | | CE1200 | pan | 812 | 824 | 4.44358 | 3.00821 | 3.81605 | | CE1187 | z | 1332 | 1250 | 5.32862 | 4.40007 | 5.06548 | +--------+--------+------+------+----------+---------+-----------+ Specifically focus on enhancers that have few numbers of binding site instances. Disruption of sites in these cases might have more profound effects. mysql> select * from Potential order by (simpt-melpt)/simpt limit 30; +--------+--------+------+------+-----------+-----------+-----------+ | region | factor | mell | siml | melpt | simpt | simpt_exp | +--------+--------+------+------+-----------+-----------+-----------+ | CE1029 | Kr | 1746 | 1727 | 0.218233 | 0.0212329 | 0.0275388 | | CE1144 | ovo | 1212 | 1214 | 0.320677 | 0.069831 | 0.123308 | | CE1199 | slbo | 327 | 330 | 0.717394 | 0.203589 | 0.262045 | | CE1086 | en | 333 | 252 | 2.62556 | 0.89412 | 2.59706 | | CE1112 | br-Z1 | 348 | 309 | 1.01843 | 0.38619 | 1.01843 | | CE1042 | Antp | 477 | 447 | 1.02567 | 0.409653 | 0.992764 | | CE1169 | en | 220 | 220 | 0.191158 | 0.0802844 | 0.185892 | | CE1190 | z | 119 | 119 | 0.574322 | 0.249853 | 0.760966 | | CE1109 | Kr | 981 | 1026 | 0.121555 | 0.0532071 | 0.0461818 | | CE1042 | Ubx | 477 | 447 | 1.19577 | 0.565161 | 1.19577 | | CE1200 | ap | 812 | 824 | 3.47621 | 1.64334 | 2.66249 | | CE1159 | Trl | 368 | 366 | 0.190474 | 0.0967834 | 0.190474 | | CE1099 | br-Z3 | 823 | 723 | 3.27923 | 1.67408 | 3.02324 | | CE1134 | zen | 547 | 541 | 0.528344 | 0.28739 | 0.528344 | | CE1084 | en | 440 | 434 | 2.62796 | 1.48458 | 2.49766 | | CE1030 | hb | 517 | 520 | 0.0411556 | 0.0238455 | 0.0241783 | | CE1016 | Deaf1 | 2658 | 2767 | 0.0638928 | 0.0379833 | 0.0283291 | | CE1064 | Trl | 1438 | 1431 | 0.901073 | 0.540058 | 0.612796 | | CE1112 | br-Z2 | 348 | 309 | 1.52899 | 0.925284 | 1.46139 | | CE1131 | Ubx | 1093 | 1076 | 0.377224 | 0.231532 | 0.335409 | | CE1069 | Mad | 352 | 352 | 0.0798586 | 0.0490269 | 0.0848505 | | CE1036 | pan | 888 | 868 | 1.32913 | 0.854274 | 1.32913 | | CE1021 | eve | 505 | 476 | 0.195583 | 0.129641 | 0.195583 | | CE1146 | ovo | 1081 | 1092 | 0.554871 | 0.370851 | 0.444743 | | CE1200 | pan | 812 | 824 | 4.44358 | 3.00821 | 3.81605 | | CE1099 | br-Z2 | 823 | 723 | 5.31897 | 3.67533 | 5.31964 | +--------+--------+------+------+-----------+-----------+-----------+
From this calculation I cannot really see evidence for compensation.
Next step to do:
Identify candidate regions for evolution of regulation
experiment to verify some of them?
look for some existing expression dataset?
Do the unbiased prediction for several different factors and summarize the turnover pattern!
see what to do with MK and selection
find out a model to explain the asymmetry--what's the expected turnover rate to explain 1/7 affected and almost all are decreasing? what's the expectation? chances for compensation easier allopatric?
Recalculate MK table
inc dec mel 22 7 sim 4 26 poly 13 33
Pairs of Overlapping binding sites
select a.region, a.fpid, a.factor, b.fpid, b.factor from tfbs a, tfbs b where a.region = b.region and a.fpid != b.fpid and a.tfbs_start between b.tfbs_start and b.tfbs_end; +--------+------+--------+------+--------+ | region | fpid | factor | fpid | factor | +--------+------+--------+------+--------+ | CE1099 | 4472 | br-Z3 | 4471 | br-Z2 | | CE1099 | 4474 | br-Z2 | 4475 | br-Z1 | | CE1099 | 4483 | br-Z2 | 4482 | br-Z3 | | CE1099 | 4486 | br-Z2 | 4485 | br-Z3 | | CE1099 | 4488 | br-Z1 | 4489 | br-Z3 | | CE1010 | 3637 | abd-A | 3638 | Ubx | | CE1010 | 3639 | Ubx | 3640 | abd-A | | CE1010 | 3643 | abd-A | 3644 | Ubx | | CE1010 | 3643 | abd-A | 3646 | Ubx | | CE1010 | 3646 | Ubx | 3644 | Ubx | | CE1010 | 3650 | abd-A | 3649 | Ubx | | CE1010 | 3653 | Ubx | 3654 | abd-A | | CE1010 | 3656 | abd-A | 3655 | Ubx | | CE1010 | 3659 | Ubx | 3658 | abd-A | | CE1010 | 3661 | abd-A | 3660 | Ubx | | CE1010 | 3660 | Ubx | 3659 | Ubx | | CE1010 | 3663 | abd-A | 3662 | Ubx | | CE1030 | 6003 | bcd | 6002 | Kr | | CE1030 | 6004 | cad | 6002 | Kr | | CE1030 | 6004 | cad | 6003 | bcd | | CE1164 | 5446 | br-Z1 | 5448 | br-Z2 | | CE1164 | 5448 | br-Z2 | 5446 | br-Z1 | | CE1164 | 5449 | br-Z2 | 5450 | br-Z3 | | CE1164 | 5449 | br-Z2 | 5451 | br-Z1 | | CE1164 | 5451 | br-Z1 | 5450 | br-Z3 | | CE1164 | 5457 | br-Z1 | 5459 | br-Z3 | | CE1164 | 5458 | br-Z2 | 5459 | br-Z3 | | CE1164 | 5458 | br-Z2 | 5457 | br-Z1 | | CE1164 | 5461 | br-Z3 | 5460 | br-Z2 | | CE1013 | 2269 | kni | 2270 | hb | | CE1013 | 2277 | kni | 2278 | hb | | CE1013 | 2278 | hb | 2277 | kni | | CE1056 | 4118 | Kr | 4117 | bcd | | CE1056 | 4124 | Kr | 4123 | Kr | | CE1056 | 4133 | bcd | 4132 | Kr | | CE1036 | 2466 | pan | 2467 | tin | | CE1036 | 2467 | tin | 2466 | pan | | CE1036 | 2468 | pan | 2469 | tin | | CE1149 | 6375 | eve | 6376 | en | | CE1149 | 6375 | eve | 6374 | zen | | CE1149 | 6374 | zen | 6376 | en | | CE1149 | 6374 | zen | 6375 | eve | | CE1149 | 6380 | en | 6379 | zen | | CE1149 | 6380 | en | 6378 | eve | | CE1149 | 6378 | eve | 6379 | zen | | CE1149 | 6381 | Kr | 6380 | en | | CE1149 | 6381 | Kr | 6379 | zen | | CE1149 | 6381 | Kr | 6378 | eve | | CE1149 | 6384 | eve | 6385 | en | | CE1149 | 6384 | eve | 6383 | zen | | CE1149 | 6383 | zen | 6385 | en | | CE1149 | 6383 | zen | 6384 | eve | | CE1134 | 6354 | eve | 6353 | zen | | CE1050 | 3453 | prd | 3454 | eve | | CE1089 | 5961 | kni | 5960 | hb | | CE1089 | 5912 | kni | 5914 | kni | | CE1089 | 5918 | tll | 5919 | bcd | | CE1089 | 5919 | bcd | 5918 | tll | | CE1089 | 5920 | kni | 5918 | tll | | CE1012 | 5925 | kni | 5924 | tll | | CE1012 | 5925 | kni | 5926 | bcd | | CE1012 | 5926 | bcd | 5924 | tll | | CE1012 | 5941 | bcd | 5940 | kni | | CE1012 | 5951 | kni | 5950 | hb | | CE1004 | 3610 | kni | 3611 | hb | | CE1004 | 3611 | hb | 3610 | kni | | CE1004 | 3613 | kni | 3614 | hb | | CE1004 | 3614 | hb | 3613 | kni | | CE1004 | 3615 | bcd | 3613 | kni | | CE1004 | 3615 | bcd | 3614 | hb | | CE1004 | 3616 | Kr | 3613 | kni | | CE1004 | 3616 | Kr | 3614 | hb | | CE1004 | 3616 | Kr | 3615 | bcd | | CE1004 | 3617 | kni | 3616 | Kr | | CE1004 | 3620 | Kr | 3619 | tll | | CE1004 | 3625 | bcd | 3624 | Kr | | CE1004 | 3626 | hb | 3624 | Kr | | CE1004 | 3626 | hb | 3625 | bcd | | CE1004 | 3630 | bcd | 3629 | tll | | CE1004 | 3633 | Kr | 3632 | hb | | CE1112 | 1201 | br-Z2 | 1199 | br-Z1 | | CE1039 | 2493 | slbo | 2492 | vvl | | CE1000 | 5968 | z | 5967 | Trl | | CE1035 | 6029 | hb | 6028 | Kr | | CE1035 | 6030 | tll | 6028 | Kr | | CE1035 | 6030 | tll | 6029 | hb | | CE1035 | 6032 | hb | 6031 | tll | | CE1035 | 6032 | hb | 6033 | Kr | | CE1035 | 6033 | Kr | 6031 | tll | | CE1035 | 6034 | tll | 6035 | hb | | CE1076 | 6260 | Mad | 6261 | brk | | CE1076 | 6261 | brk | 6260 | Mad | | CE1076 | 6262 | brk | 6263 | Mad | | CE1076 | 6263 | Mad | 6262 | brk | | CE1076 | 6264 | brk | 6265 | Mad | | CE1076 | 6266 | brk | 6267 | Mad | | CE1076 | 6282 | brk | 6281 | Mad | | CE1016 | 987 | Dfd | 986 | Deaf1 | | CE1049 | 6081 | abd-A | 6082 | Ubx | | CE1049 | 6083 | abd-A | 6084 | Ubx | | CE1049 | 6086 | Ubx | 6085 | abd-A | | CE1049 | 6087 | Ubx | 6088 | abd-A | | CE1049 | 6089 | abd-A | 6090 | Ubx | | CE1049 | 6093 | abd-A | 6092 | Ubx | | CE1049 | 6094 | abd-A | 6095 | Ubx | | CE1049 | 6097 | abd-A | 6096 | Ubx | | CE1049 | 6099 | Ubx | 6098 | abd-A | | CE1049 | 6100 | Ubx | 6101 | abd-A | | CE1049 | 6102 | abd-A | 6103 | Ubx | | CE1049 | 6106 | abd-A | 6107 | Ubx | | CE1049 | 6108 | Ubx | 6109 | abd-A | | CE1049 | 6110 | abd-A | 6111 | Ubx | | CE1049 | 6113 | Ubx | 6111 | Ubx | | CE1049 | 6113 | Ubx | 6110 | abd-A | | CE1049 | 6112 | abd-A | 6113 | Ubx | | CE1049 | 6116 | abd-A | 6117 | Ubx | | CE1049 | 6119 | abd-A | 6118 | Ubx | | CE1049 | 6120 | abd-A | 6121 | Ubx | | CE1049 | 6122 | Ubx | 6123 | abd-A | | CE1049 | 6125 | Ubx | 6124 | abd-A | | CE1049 | 6124 | abd-A | 6125 | Ubx | | CE1049 | 6126 | Ubx | 6127 | abd-A | | CE1049 | 6128 | abd-A | 6129 | Ubx | | CE1049 | 6131 | abd-A | 6130 | Ubx | | CE1049 | 6133 | Ubx | 6132 | abd-A | | CE1049 | 6135 | Ubx | 6134 | abd-A | | CE1042 | 4111 | Ubx | 4112 | Antp | | CE1162 | 6417 | bcd | 6416 | hb | | CE1028 | 3680 | tll | 3679 | bcd | | CE1034 | 2448 | hb | 2449 | tll | | CE1034 | 2454 | en | 2453 | twi | | CE1058 | 6146 | Ubx | 6147 | eve | | CE1058 | 6152 | zen | 6150 | eve | | CE1058 | 6152 | zen | 6153 | Ubx | | CE1058 | 6153 | Ubx | 6150 | eve | | CE1058 | 6164 | z | 6163 | Trl | | CE1068 | 6202 | hb | 6201 | Trl | | CE1068 | 6206 | z | 6207 | Trl | | CE1068 | 6210 | z | 6211 | Trl | | CE1068 | 6214 | Trl | 6215 | z | | CE1068 | 6219 | z | 6218 | Trl | | CE1068 | 6221 | hb | 6220 | Trl | | CE1068 | 6221 | hb | 6222 | z | | CE1068 | 6222 | z | 6220 | Trl | | CE1068 | 6228 | z | 6226 | hb | | CE1068 | 6235 | hb | 6234 | z | | CE1068 | 6245 | Trl | 6246 | z | | CE1181 | 5372 | twi | 5371 | hb | | CE1181 | 5375 | twi | 5374 | tll | | CE1181 | 5378 | en | 5377 | hb | | CE1181 | 5377 | hb | 5378 | en | | CE1029 | 265 | hb | 266 | eve | | CE1069 | 753 | tin | 751 | Mad | | CE1200 | 6487 | ap | 6486 | pan | +--------+------+--------+------+--------+ 154 rows in set (9.60 sec)