# User:R. Eric Collins/MBL/Codon

### From OpenWetWare

## Codon models

- "a model is an INTENTIONAL simplification..." Daniel L. Hartl

- parameter estimation
- hypothesis testing
- site identification

- t is mean # of substitutions per CODON site
- t=3 indicates saturation

- assumptions/simplifications about molecular evolution:
- codon frequencies unbiased (1/61 usage)
- all point mutations equally likely (transition/transition bias)
- single/multiple mutations at one point

- models
- Nei and Gojobori 1986 should be avoided at all cost
- others should be used with caution BASED ON KNOWING YOUR DATA
- Goldman-Yang (GY)
- proportional to target codon
- Muse-Gaut (MG)
- proportional to target nucleotide
- no effect of context: (mammals yes, drosophila no...)

- likelihood
- of the data at a site: sum over each codon: product of equilibrium frequency, likelihood of transition from each codon to observed codons
- of entire alignment: product of likelihoods at each site
- in some examples: If we average over the tree, we do NOT detect positive selection;

- Problem: averaging over a pair has very low power if the questions are about “when” or “where”!
- Solution: Phylogenetic estimation of selection pressure
- variable ω over branches (when?)
- variable ω over sites (where?)
- variable ω over branches and sites (when and where?)

- Solution: Phylogenetic estimation of selection pressure

- branch models (time)
- episodic large-scale changes
- site-wise: if have hundreds of sequences, can use this

- M1a + M2a (plus LRT) is very robust
- best model FOR WHAT? 'best' model overall may not be the best for my purposes
- M1a doesn't allow positive selection
- M2a does allow for positive selection
- so LRT between them can demonstrate positive selection
- for most well-behaved datasets, empirical codon bias (pi) can be used
- M2a is not best explanation but has better LRT properties because more robust to changes in assumptions

- M1a + M2a (plus LRT) is very robust

- M7 + M8
- M7 uses beta distribution (0..1) that explains 'boring' stuff
- M8 uses beta distribution (0..1) plus omega (>1) to account for positive selection
- M8 is 'best' model but is more sensitive to changes in assumptions

- M7 + M8

- chi-squared & boundary problem
- the LRT does not follow the χ2 distribution
- no other mixture distribution works well
- too costly to do parametric bootstrapping/simulations for each LRT
- so using chi-square anyway (naively), how bad could it be?
- even still, the LRT is conservative AND powerful
- there's a window of sequence divergence where these will work, outside of those (short branches/very long branches) it may not

- chi-squared & boundary problem

- examples
- need to know data, but how many possible types of "data" are there?
- how many ways are there for a protein family to evolve?

- GFP
- color variation, purifying selection due to antagonistic host interactions

- proteorhodopsin
- Listeria
- looking at variations in omega along the tree, isolating groups/clusters of genes/networks under selection

- need to know data, but how many possible types of "data" are there?

average of tree
average of sites
variation in sites and tree