340 likes | 486 Views
genomic diversity and differentiation. heading toward exam 3. genome region of arbitrary size, what can you measure and describe? what else might you want to know? if given these data and nothing else, what could you say about them?. learning goals for coalescent theory.
E N D
genomic diversity and differentiation • heading toward exam 3
genome region of arbitrary size, what can you measure and describe? • what else might you want to know? • if given these data and nothing else, what could you say about them?
learning goals for coalescent theory • how do patterns in sequence data tell us about effective population size? • what if there are multiple populations contributing information? • how is our answer changed if the population changes in size, or if there is selection for a particular allele? • why is this important for understanding phylogenetics (species trees)?
patterns • mutations happen at a more-or-less constant rate at random location along genome (assumptions can be tested) • drift, selection, gene flow, recombination, etc. influence how these mutations turn into patterns • we interpret with statistical models - mostly beyond this class
assume genealogy • descent with modification • focus on non-reticulate gene trees • assume every mutation happens at new genome location AVISE 1987, 1994
neutral model • assume all these mutations have NO effect on fitness (null model) • thus, only drift influences whether allele goes to fixation • remember: probability allele goes to fixation is its frequency in population • so every new mutation has low but equal probability that will get FIXED (frequency 100%)
SPECIES GENE COPY • so you are collecting data not generally knowing the history of inheritance or how discrete these units may be (actually discrete, resolvably discrete) • we are working on how to infer (at least probabilities) how this diversity partitions in space (population), time (frequencies), across genome (paralogs), across species (orthologs) • also: copy number variation among loci, among populations, among species POPULATION(DEME)
how many whales? • Roman and Palumbi 2003 • currently ~10,000 humpback whales; pre-whaling (genetic estimate) maybe ~250,000
how could there be so many? • count whales - currently done using censusing and monitoring of whaling vessels, about 10,000 right whales in Atlantic • collect DNA samples from some of them, and sequence at least one gene (more is better!) • remember π is proportional to effective population size (times mutation rate µ) • we know µ (~0.00000001 substitutions per DNA replication/reproduction) from fossil and biogeographic data, and we can calculate π (average # differences between every pair of sequences) • Ne = π/µ, adjusted for inheritance of marker (haploid, maternally inherited mtDNA, versus diploid, biparental nuclear gene) • Ne of right whales ~250,000 even though only 10,000 whales now! • the genetic diversity is older than human whaling efforts and tells us about the past
AUTOSOMES: ALL 4 COPIES CAN CONTRIBUTE MUTATIONS MTDNA: ONE COMPONENT CONTRIBUTES MUTATIONS WHEN PEOPLE REFER TO THE SMALLER EFFECTIVE SIZE OF THE MITOCHONDRIAL GENOME, THEY ARE REFERRING TO COPY NUMBER NOT THE NUMBER OF INDIVIDUALS IN THE POPULATION!
another look at Ne: drift • neutrality: mean Time to Most Recent Common Ancestor (tmrca)=time to homozygosity =-4Ne[ plnp + (1-p)ln(1-p) ] gens • proportional to Ne; for p=0.5, ~2.77Ne gens • heterozygosity declines by 1/(2Ne) per generation • compare nuclear gene vs. mitochondrial gene...? DO NOT MEMORIZE THIS
basic summary stats • S, number of segregating sites (how many below?) • π, average number of differences among sequences (what is it below?) • ηi, folded site pattern: how many segregating sites appear i times? caccgtattagcattatgctggtata cgccgtactggcattatgctggtata caccgtactagcattgtgctggtatg caccgtactagcattatgccggtatg cactgtactggcattatgctggtgta cactgtactggcattatgctggtata
standard coalescent • sample size n has n-1 coalescent events • steps of extant size Ti ,E[Ti]=2/(i(i-1)) measured in units of N • genetic (label) differences have no fitness consequence • single population • constant population size (for now) THE TREE IS UNKNOWN, ANALYSIS IS ASKING WHICH TREES FIT THE DATA AND WHAT THAT TELLS US ABOUT THE INTERVAL BETWEEN BRANCH NODES
mutation • # mutations (K) Poisson distributed on genealogy, based on total time t = (Ttotal) • Poisson process: stochastic, each time interval is independent, waiting time is exponentially distributed across time intervals (but when many branches, multiplies opportunity in interval) • Applications • The classic example of phenomena well modelled by a Poisson process is deaths due to horse kick in the Prussian army, as shown by Ladislaus Bortkiewicz in 1898.[4][5] The following examples are also well-modeled by the Poisson process: • Requests for telephone calls at a switchboard. • Goals scored in a soccer match.[6] • Requests for individual documents on a web server.[7] • Particle emissions due to radioactive decay by an unstable substance. In this case the Poisson process is non-homogeneous in a predictable manner - the emission rate declines as particles are emitted.
DO RECOGNIZE THIS Ewens distribution • under neutral model, mutations arise at rate µand are lost or drift to higher frequency (frequency proportional to AGE) • thus we’ve come to expect a certain distribution of allele frequencies, e.g. p=q is unlikely • generally a small number of very common alleles, and increasing number of very rare alleles DO NOT MEMORIZE THIS
um, huh? • here is the context: DRIFT causes some alleles to increase in frequency, some to be lost (moving forward in time) • moving back in time from NOW, the same process can explain the frequency of alleles in the context of how individuals are related • this means we have expectations for how long it takes for a sample of sequences from NOW to coalesce to a common ancestor in the past (about 2 times effective population size) • one reason two separate evolutionary populations may not APPEAR completely different, it takes time for ancestral diversity to sort out (most recent common ancestor) (now)
>1 population? this pop descended from ‘red allele’ ancestor this pop descended from ‘green allele’ ancestor lets imagine two populations that rarely exchange migrants but have a common ancestry in the recent evolutionary past drift (moving forwards in time from ancestral population) leads to many that descended from one particular allele different in each population -> how do we know two populations?
evolutionary biology: the populations tell us who they are! • shown at right are two LOCATIONS, not necessarily two distinct populations • may be one evolutionary population • however: if one is 90% A1 and 10% A2, the other is 10% A1 and 90% A2 • that means overall 50% A1, 50% A2 • should see 25% A1A1 homozygotes, 25% A2A2 if Hardy-Weinberg fits • instead see overall ~41% A1A1, 41% A2A2 because we are ‘pooling’ 2 diverged populations
excess of common alleles • excess homozygosity could mean that two evolutionary populations are being analyzed as though they are one • so we don’t trust “even” allele frequencies: now think frequency dependent selection, balancing selection, or pooling of multiple evolutionary populations
excess common alleles = positive selection or long-term decline excess rare alleles = purifying selection or population expansion just right = “neutral” η1=0 η2=1 η3=2 η4=3 (2, +1 for “η5”) η1=3 η2=2 η3=1 η4=0 neutral theory: sort of like Goldilocks story η1=2 η2=2 η3=1 η4=1
learning goals for coalescent theory • how do patterns in sequence data tell us about effective population size? • what if there are multiple populations contributing information? • how is our answer changed if the population changes in size, or if there is selection for a particular allele? • why is this important for understanding phylogenetics (species trees)?
why is this important for understanding phylogenetics (species trees)? • coalescent theory lets us test our assumptions of how DNA sequences evolve before we use them to reconstruct phylogeny • coalescent theory explains why recently-diverged populations may not yet have synapomorphies despite already being on different evolutionary paths • this model gives us basis for estimating time to ancestor of ANY two sequences
DNA characters are just like phenotypic characters • 4 character states A,C,T,G plus information in insertion-deletion, gene copy number, etc. • same concerns of homology and shared descent apply
“mitochondrial Eve” sets up misunderstanding • every locus sampled now has a point in the past where all current alleles coalesce to a common ancestor • in recently diverged species, diversity is often older than the species human population isolated ~200kya
isolation isolation Ne understanding coalescence 1. larger effective size (Ne), more diversity 2. when time between branching events short relative to Ne, more likely that allelic diversity is older than branching event
"This coalescence does not mean that the population originally consisted of a single individual with that ancestral allele. It just means that particular individual’s allele was the one that, out of all the alleles present at that time, later became fixed in the population."
phylogeny inference • 2 basic approaches: algorithm vs. criterion • “neighbor joining” shown in book is an algorithm that generates a single tree by finding shortest “distances” (proportion of differences at nucleotide sites) • algorithm approaches do not help identify our uncertainty: one answer comes out, whether well supported or not
criterion-based phylogeny • 30 tips results in 8.7 x 1036 possible trees • computer search necessary
3 of >10,000 possible trees which fits data best? depends on the criterion
11 changes 11 changes 7 changes = most parsimonious of these 3 3 of >10,000 possible trees which fits data best? depends on the criterion
criteria used in phylogeny • parsimony - the fewest # of changes indicates the most acceptable tree topology • maximum likelihood - both topology (arrangement of branches) and branch lengths are iteratively searched for tree(s) that fit statistical model of molecular evolution (e.g. transitions > transversions) • Bayesian - criterion is still maximum likelihood, search strategy is different (sums result over many similar-likelihood trees)
why different criteria? • we are making our assumptions explicit for inference of the unknown • different scientists have different backgrounds that drive their assumptions • using multiple methods/criteria lets us test how safe our assumptions are • next time: how do we decide if a tree hypothesis is strongly supported?