Latest revision as of 12:49, 14 January 2011

drummondlab.org.

This site will not be updated.

the drummond lab

home people research publications news protocols

Introduction

In 1994, Hiroshi Akashi developed an elegant test for translational accuracy selection on coding sequences [1]. Some codons, particularly those corresponding to abundant tRNAs, are translated more accurately than others. Under selection for translational accuracy, usage of those more-accurate synonymous codons will be favored at important (e.g., evolutionarily conserved) amino-acid sites, where translation errors could disrupt protein folding or function. At less-important (e.g., evolutionarily variable) amino-acid sites, errors are presumably more tolerable, and therefore more-accurate codons are less likely to be favored. Akashi's test asks:

How strong is the association between preferred codons and conserved amino acids, controlling for differences between amino acids and between genes?
How likely is that association to have occurred by chance?

Akashi's test on a single gene

Akashi's test is technically very simple to carry out. The hard part is just tabulating data. Suppose you have two aligned codon sequences (a target sequence and an orthologous sequence) and a list of preferred codons. From the aligned codon sequences, build a 2x2 contingency table with entries [math]\displaystyle{ a }[/math], [math]\displaystyle{ b }[/math], [math]\displaystyle{ c }[/math], and [math]\displaystyle{ d }[/math] like this:

AA=Ser	Conserved	Variable
Preferred	[math]\displaystyle{ a }[/math]	[math]\displaystyle{ b }[/math]
Unpreferred	[math]\displaystyle{ c }[/math]	[math]\displaystyle{ d }[/math]

for each amino acid. You'll usually have 18 tables; W and M have no synonymous codon alternatives and therefore don't contribute to Akashi's test.

[math]\displaystyle{ a }[/math] = the number of codons in your target sequence that encode amino acid AA, are PREFERRED, and encode an AA which is unchanged (CONSERVED) in the orthologous sequence
[math]\displaystyle{ b }[/math] = the number of codons in your target sequence that encode amino acid AA, are PREFERRED and encode an AAwhich is different (VARIABLE) in the orthologous sequence
[math]\displaystyle{ c }[/math] = the number of codons in your target sequence that encode amino acid AA, are UNPREFERRED and encode an AA which is unchanged (CONSERVED) in the orthologous sequence
[math]\displaystyle{ d }[/math] = the number of codons in your target sequence that encode amino acid AA, are UNPREFERRED and encode an AA which is different (VARIABLE) in the orthologous sequence

Now the statistics. Assuming no association -- that is, assuming that the probability of a codon being preferred (which we designate [math]\displaystyle{ p }[/math]) is independent of the probability that it encodes a conserved amino acid (which we designate [math]\displaystyle{ q }[/math]) -- we can write down estimates for the expected value and variance of [math]\displaystyle{ a }[/math], [math]\displaystyle{ E(a) }[/math] and [math]\displaystyle{ V(a) }[/math]:

[math]\displaystyle{ n = a + b + c + d\! }[/math]

[math]\displaystyle{ \hat{p} = \frac{a + b}{n} }[/math]

[math]\displaystyle{ \hat{q} = \frac{a + c}{n} }[/math]

[math]\displaystyle{ \hat{E}(a) = n \hat{p}\hat{q} = \frac{(a+b)(a+c)}{n} }[/math]

[math]\displaystyle{ \hat{V}(a) = \frac{n}{n-1} n\hat{p}(1-\hat{p}) \hat{q}(1-\hat{q}) = \frac{(a+b)(a+c)(b+d)(c+d)}{n^2(n-1)} }[/math]

With the mean and variance, we can write down a [math]\displaystyle{ Z }[/math]-score for one table:

[math]\displaystyle{ \hat{Z} = \frac{a - \hat{E}(a)}{\sqrt{\hat{V}(a)}} }[/math]

And because a [math]\displaystyle{ Z }[/math]-score only gives us a measure of statistical significance (question #2 above), we also want an effect size -- the magnitude of the association between preferred codons and conserved sites -- which we can compute as an odds ratio, the odds of finding a preferred/conserved association divided by the odds of finding a nonpreferred/variable association. The odds ratio answers question #1 above. An unbiased estimate of the odds ratio [math]\displaystyle{ \psi }[/math] is given by:

[math]\displaystyle{ \hat{\psi} = \frac{ad}{bc} }[/math].

Akashi's test on multiple genes

Estimating [math]\displaystyle{ Z }[/math] and [math]\displaystyle{ \psi }[/math] for a single amino acid in a single gene is perhaps of limited interest. How do we combine tables so that we can ask questions like, "What is the overall association between preferred codons and conserved sites across the genome?" or, "How statistically significant is the preferred/conserved association for alanine compared to glycine?"

To combine tables, we use the Mantel-Haenszel procedure. The basic principle is that tables are independent. Expectations add, variances add, and observed values add. That is, indexing tables by [math]\displaystyle{ i }[/math], with table [math]\displaystyle{ i }[/math] given by

AA=X	Conserved	Variable
Preferred	[math]\displaystyle{ a_i }[/math]	[math]\displaystyle{ b_i }[/math]
Unpreferred	[math]\displaystyle{ c_i }[/math]	[math]\displaystyle{ d_i }[/math]

and [math]\displaystyle{ \hat{E}(a_i) }[/math] and [math]\displaystyle{ \hat{V}(a_i) }[/math] computed as before for each table, we have the combined [math]\displaystyle{ Z }[/math]-score

[math]\displaystyle{ \hat{Z} = \frac{\sum_i{a_i} - \sum_i{\hat{E}(a_i)}}{\sqrt{\sum_i{\hat{V}(a_i)}}} }[/math]

and the Mantel-Haenszel estimator for the common odds ratio (i.e., the single odds ratio [math]\displaystyle{ \psi }[/math] assumed to underlie all tables being analyzed),

[math]\displaystyle{ \hat{\psi} = \frac{\sum_i{\frac{a_i d_i}{n_i}}}{\sum_i{\frac{b_i c_i}{n_i}}} }[/math]

With enough tables, we assume that [math]\displaystyle{ \hat{Z} }[/math] follows the standard normal distribution [math]\displaystyle{ \Phi }[/math], so that a [math]\displaystyle{ P }[/math]-value can be computed as [math]\displaystyle{ \Phi(\hat{Z}) }[/math].

Akashi's test equates statistical significance of a preferred-codon/conserved-site association with the influence of translational accuracy selection. One simply asks how statistically significant [math]\displaystyle{ \Phi(\hat{Z}) }[/math] is.

Implementation

The open-source statistical package R includes an implementation of the Mantel-Haenszel test (mantelhaen.test), which is sufficient to carry out Akashi's test.

Examples

Coming...

References

Akashi H. Synonymous codon usage in Drosophila melanogaster: natural selection and translational accuracy. Genetics. 1994 Mar;136(3):927-35. DOI:10.1093/genetics/136.3.927 | PubMed ID:8005445 | HubMed [Akashi-Genetics-1994]

@@ Line 1: / Line 1: @@
+{{Drummond_Top}}
+<div style="width: 750px">
+==Introduction==
+In 1994, Hiroshi Akashi developed an elegant test for translational accuracy selection on coding sequences <cite>Akashi-Genetics-1994</cite>.  Some codons, particularly those corresponding to abundant tRNAs, are translated more accurately than others.  Under selection for translational accuracy, usage of those more-accurate synonymous codons will be favored at important (e.g., evolutionarily conserved) amino-acid sites, where translation errors could disrupt protein folding or function.  At less-important (e.g., evolutionarily variable) amino-acid sites, errors are presumably more tolerable, and therefore more-accurate codons are less likely to be favored.  Akashi's test asks:
+# How strong is the association between preferred codons and conserved amino acids, controlling for differences between amino acids and between genes?
+# How likely is that association to have occurred by chance?
 ==Akashi's test on a single gene==
-Akashi's test is very simple.  Suppose you have two aligned codon sequences (a target sequence and an orthologous sequence) and a list of preferred codons.  The question we wish to answer: Is there an association between preferred codons and conserved amino acids, controlling for differences between amino acids?
+Akashi's test is technically very simple to carry out.  The hard part is just tabulating data.  Suppose you have two aligned codon sequences (a target sequence and an orthologous sequence) and a list of preferred codons.
+From the aligned codon sequences, build a 2x2 contingency table with entries <math>a</math>, <math>b</math>, <math>c</math>, and <math>d</math> like this:
-From the aligned codon sequences, build a 2x2 contingency table with entries a, b, c, and d like this:
+<center>
 <table border="1" cellspacing="0">
-<tr align="center"><td>AA=Ser</td><td>Conserved</td><td>Variable</td>
+<tr align="center"><td>AA=Ser</td><td>Conserved</td><td>Variable</td></tr>
-<tr align="center"><td>Preferred</td><td><math>a</math></td><td><math>b</math></td>
+<tr align="center"><td>Preferred</td><td><math>a</math></td><td><math>b</math></td></tr>
-<tr align="center"><td>Unpreferred</td><td><math>c</math></td><td><math>d</math></td>
+<tr align="center"><td>Unpreferred</td><td><math>c</math></td><td><math>d</math></td></tr>
 </table>
+</center>
 for each amino acid.  You'll usually have 18 tables; W and M have no synonymous codon alternatives and therefore don't contribute to Akashi's test.
@@ Line 17: / Line 25: @@
 * <math>d</math> = the number of codons in your target sequence that encode amino acid AA,  are UNPREFERRED and encode an AA which is different (VARIABLE) in the orthologous sequence
-Now the statistics.  Assuming no association -- that is, assuming that the probability of a codon being preferred (which we designate <math>p</math>) is independent of the probability that it encodes a conserved amino acid (which we designate <math>q</math>) -- we can write down the expected value and variance of <math>a</math>, <math>E(a)</math> and <math>V(a)</math>:
+Now the statistics.  Assuming no association -- that is, assuming that the probability of a codon being preferred (which we designate <math>p</math>) is independent of the probability that it encodes a conserved amino acid (which we designate <math>q</math>) -- we can write down estimates for the expected value and variance of <math>a</math>, <math>E(a)</math> and <math>V(a)</math>:
-:<math>n = a + b + c + d</math>
+:<math>n = a + b + c + d\!</math>
 :<math>\hat{p} = \frac{a + b}{n}</math>
 :<math>\hat{q} = \frac{a + c}{n}</math>
-:<math>E(a) = n \hat{p}\hat{q}</math>
+:<math>\hat{E}(a) = n \hat{p}\hat{q} = \frac{(a+b)(a+c)}{n}</math>
-:<math>V(a) = \frac{1}{n-1} n\hat{p}(1-\hat{p}) n\hat{q}(1-\hat{q})</math>
+:<math>\hat{V}(a) = \frac{n}{n-1} n\hat{p}(1-\hat{p}) \hat{q}(1-\hat{q}) = \frac{(a+b)(a+c)(b+d)(c+d)}{n^2(n-1)}</math>
-With the mean and variance, we could write down a <math>Z</math>-score for one table:
+With the mean and variance, we can write down a [http://en.wikipedia.org/wiki/Standard_score <math>Z</math>-score] for one table:
-:<math>Z = \frac{a - E(a)}{\sqrt{V(a)}}</math>
+:<math>\hat{Z} = \frac{a - \hat{E}(a)}{\sqrt{\hat{V}(a)}}</math>
-And because a <math>Z</math>-score gives us a measure of statistical significance, we also want an effect size -- the magnitude of the association between preferred codons and conserved sites -- which we can compute as an odds ratio, the ratio of finding a preferred/conserved association divided by the odds of finding a nonpreferred/variable association:
+And because a <math>Z</math>-score only gives us a measure of statistical significance (question #2 above), we also want an effect size -- the magnitude of the association between preferred codons and conserved sites -- which we can compute as an [http://en.wikipedia.org/wiki/Odds_ratio odds ratio], the odds of finding a preferred/conserved association divided by the odds of finding a nonpreferred/variable association.  The odds ratio answers question #1 above.  An unbiased estimate of the odds ratio <math>\psi</math> is given by:
-:<math>OR = \frac{ad}{bc}</math>
+:<math>\hat{\psi} = \frac{ad}{bc}</math>.
 ==Akashi's test on multiple genes==
-===Combining 2x2 contingency tables using the Mantel-Haenszel procedure===
-But calculating <math>Z</math> and <math>OR</math> for a single amino acid in a single gene is perhaps of limited interest.  How do we combine tables so that we can ask questions like, "What is the overall association between preferred codons and conserved sites across the genome?"
+Estimating <math>Z</math> and <math>\psi</math> for a single amino acid in a single gene is perhaps of limited interest.  How do we combine tables so that we can ask questions like, "What is the overall association between preferred codons and conserved sites across the genome?" or, "How statistically significant is the preferred/conserved association for alanine compared to glycine?"
-To combine tables, we use the basic principle that tables are independent.  Expectations add, variances add, and observed values add.  That is, indexing tables by <math>i</math>, with table i equal to
+To combine tables, we use the Mantel-Haenszel procedure.  The basic principle is that tables are independent.  Expectations add, variances add, and observed values add.  That is, indexing tables by <math>i</math>, with table <math>i</math> given by
+<center>
 <table border="1" cellspacing="0">
-<tr align="center"><td>AA=Ser</td><td>Conserved</td><td>Variable</td>
+<tr align="center"><td>AA=X</td><td>Conserved</td><td>Variable</td></tr>
-<tr align="center"><td>Preferred</td><td><math>a_i</math></td><td><math>b_i</math></td>
+<tr align="center"><td>Preferred</td><td><math>a_i</math></td><td><math>b_i</math></td></tr>
-<tr align="center"><td>Unpreferred</td><td><math>c_i</math></td><td><math>d_i</math></td>
+<tr align="center"><td>Unpreferred</td><td><math>c_i</math></td><td><math>d_i</math></td></tr>
 </table>
-and <math>E(a_i)</math> and <math>V(a_i)</math> computed as before for each table, we have
+</center>
+and <math>\hat{E}(a_i)</math> and <math>\hat{V}(a_i)</math> computed as before for each table, we have the combined <math>Z</math>-score
+:<math>\hat{Z} = \frac{\sum_i{a_i} - \sum_i{\hat{E}(a_i)}}{\sqrt{\sum_i{\hat{V}(a_i)}}}</math>
+and the Mantel-Haenszel estimator for the common odds ratio (i.e., the single odds ratio <math>\psi</math> assumed to underlie all tables being analyzed),
+:<math>\hat{\psi} = \frac{\sum_i{\frac{a_i d_i}{n_i}}}{\sum_i{\frac{b_i c_i}{n_i}}}</math>
+With enough tables, we assume that <math>\hat{Z}</math> follows the standard normal distribution <math>\Phi</math>, so that a <math>P</math>-value can be computed as <math>\Phi(\hat{Z})</math>.
-:<math>Z = \frac{\sum_i{a_i} - \sum_i{E(a_i)}}{\sqrt{\sum_i{V(a_i)}}}</math>
+Akashi's test equates statistical significance of a preferred-codon/conserved-site association with the influence of translational accuracy selection.  One simply asks how statistically significant <math>\Phi(\hat{Z})</math> is.
-and
-:<math>OR = \frac{\sum_i{\frac{a_i d_i}{n_i}}}{\sum_i{\frac{b_i c_i}{n_i}}}</math>
-With enough tables, we assume that <math>Z</math> follows the standard normal distribution <math>\Phi</math>, so that a <math>P</math>-value can be computed.
+==Implementation==
+The open-source statistical package [http://www.r-project.org R] includes an implementation of the Mantel-Haenszel test ([http://sekhon.berkeley.edu/stats/html/mantelhaen.test.html mantelhaen.test]), which is sufficient to carry out Akashi's test.
 ==Examples==
@@ Line 60: / Line 76: @@
 #Akashi-Genetics-1994 pmid=8005445
 </biblio>
+{{Drummond_Bottom}}

Drummond:Akashi's Test: Difference between revisions

Latest revision as of 12:49, 14 January 2011

Contents

We've moved to http://drummondlab.org.

This site will not be updated.

Introduction

Akashi's test on a single gene

Akashi's test on multiple genes

Implementation

Examples

References

Navigation menu

Page actions

Page actions

Personal tools

Navigation

Search

research

Tools