# User:R. Eric Collins/MBL/Popgen

### From OpenWetWare

## Population Genetics

- assumption of independence among sequences violated in a populatino
- Wright-Fisher
- individual drams parent from previous generation at random
- wait on average 2N to coalesce
- geometric distribution assumes discrete non-overlappy generations
- real populations do not necessarily behave, but we assume that it can be extrapolated
- because all samples contain information about the MRCA (most recent common ancestor), even samples with few individuals can most often recover the same TMRCA as a large sample
- problem when too many samples are collected relative to generation size (> sqrt(4N)) because coalescent simplification assumes that not more than 1 coalescence happens per generation
- large samples coalesce on averge in 4N generations

- mutations-scaled population size
- hard to disentangle, large pop/small mu = small pop/large mu
- confounding between migration rate and divergence rate

- recombination
- if you know where these are you can/should break them up into separate loci
- netrecodon to generate simulated recombination sequences
- in order for recombination to make big differences you need to have VERY high rates of recombination
- at least 1 in 50 per generation in a 2-population model
- when migration is included it doesn't even seem to matter

- unless you have time-series data, don't bother with estimating population size changes through time (e.g. skyline/skyride)
- thanks 5x5 migratory model is even a large model
- simplify model if possible to improve confidence/power
- "test hypotheses... don't on fishing expeditions"

- to get good estimates, you need: 1) a lot of data 2) a good computer
- people with lots of data often don't run analyses long enough to guarantee convergence
- F_ST and coalescence are based on same/similar assumptions so really one is not better than the other for recent divergence
- shape of population size over time can really affect coalescence but need to know how and how it affects parameter estimation
- e.g. bottlenecks, recoveries, expansions, contractions

- when effective population size ~ generations since divergence it can get dicey to separate divergence from migration

- no existing coalescent program take selection into account

- felsenstein 2005: after ~10 individuals, should add another locus rather than more individuals

## Species Tree Estimation

- with long times between speciation, the gene tree matches the species tree with increasing probability
- two ways to coalesce to ((A,B),(C,D)), one way each to coalesce (((C,D),B),A) and (((C,D),A),B)
- so symmetric trees can be overrepresented

- concatenated gene sequences are not the way to add information, can lead to statistically inconsistent results
- but with long branch lengths and lots of genes you get enough power that it's ok
- Bootstrap procedure can be positively misled in this situation

- STEM: when only source of variability in single-gene histories is due to thecoalescence process

- species definition? a group of individuals that fit a model of random branching

- questions:
- if order generations, can follow min and max to find ancestor of all existing species
- migration as horizontal gene transfer?

"the reason to do a bayesian analysis is not to get a tree but to get the posterior distributions"

HGT versus huge ancestral population size + long coalescent times
"bacteria are special"

## LAMARC

- if there were exponential growth in a population, estimating the mutation rate assuming a constant population size will UNDERESTIMATE the instantaneous mutation rate.
- given the instantaneous mutation rate and a population growth model you can estimate the past mutation rate