1 / 34

genomic diversity and differentiation

genomic diversity and differentiation. heading toward exam 3. genome region of arbitrary size, what can you measure and describe? what else might you want to know? if given these data and nothing else, what could you say about them?. learning goals for coalescent theory.

india
Download Presentation

genomic diversity and differentiation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. genomic diversity and differentiation • heading toward exam 3

  2. genome region of arbitrary size, what can you measure and describe? • what else might you want to know? • if given these data and nothing else, what could you say about them?

  3. learning goals for coalescent theory • how do patterns in sequence data tell us about effective population size? • what if there are multiple populations contributing information? • how is our answer changed if the population changes in size, or if there is selection for a particular allele? • why is this important for understanding phylogenetics (species trees)?

  4. patterns • mutations happen at a more-or-less constant rate at random location along genome (assumptions can be tested) • drift, selection, gene flow, recombination, etc. influence how these mutations turn into patterns • we interpret with statistical models - mostly beyond this class

  5. assume genealogy • descent with modification • focus on non-reticulate gene trees • assume every mutation happens at new genome location AVISE 1987, 1994

  6. neutral model • assume all these mutations have NO effect on fitness (null model) • thus, only drift influences whether allele goes to fixation • remember: probability allele goes to fixation is its frequency in population • so every new mutation has low but equal probability that will get FIXED (frequency 100%)

  7. SPECIES GENE COPY • so you are collecting data not generally knowing the history of inheritance or how discrete these units may be (actually discrete, resolvably discrete) • we are working on how to infer (at least probabilities) how this diversity partitions in space (population), time (frequencies), across genome (paralogs), across species (orthologs) • also: copy number variation among loci, among populations, among species POPULATION(DEME)

  8. how many whales? • Roman and Palumbi 2003 • currently ~10,000 humpback whales; pre-whaling (genetic estimate) maybe ~250,000

  9. how could there be so many? • count whales - currently done using censusing and monitoring of whaling vessels, about 10,000 right whales in Atlantic • collect DNA samples from some of them, and sequence at least one gene (more is better!) • remember π is proportional to effective population size (times mutation rate µ) • we know µ (~0.00000001 substitutions per DNA replication/reproduction) from fossil and biogeographic data, and we can calculate π (average # differences between every pair of sequences) • Ne = π/µ, adjusted for inheritance of marker (haploid, maternally inherited mtDNA, versus diploid, biparental nuclear gene) • Ne of right whales ~250,000 even though only 10,000 whales now! • the genetic diversity is older than human whaling efforts and tells us about the past

  10. AUTOSOMES: ALL 4 COPIES CAN CONTRIBUTE MUTATIONS MTDNA: ONE COMPONENT CONTRIBUTES MUTATIONS WHEN PEOPLE REFER TO THE SMALLER EFFECTIVE SIZE OF THE MITOCHONDRIAL GENOME, THEY ARE REFERRING TO COPY NUMBER NOT THE NUMBER OF INDIVIDUALS IN THE POPULATION!

  11. another look at Ne: drift • neutrality: mean Time to Most Recent Common Ancestor (tmrca)=time to homozygosity =-4Ne[ plnp + (1-p)ln(1-p) ] gens • proportional to Ne; for p=0.5, ~2.77Ne gens • heterozygosity declines by 1/(2Ne) per generation • compare nuclear gene vs. mitochondrial gene...? DO NOT MEMORIZE THIS

  12. basic summary stats • S, number of segregating sites (how many below?) • π, average number of differences among sequences (what is it below?) • ηi, folded site pattern: how many segregating sites appear i times? caccgtattagcattatgctggtata cgccgtactggcattatgctggtata caccgtactagcattgtgctggtatg caccgtactagcattatgccggtatg cactgtactggcattatgctggtgta cactgtactggcattatgctggtata

  13. standard coalescent • sample size n has n-1 coalescent events • steps of extant size Ti ,E[Ti]=2/(i(i-1)) measured in units of N • genetic (label) differences have no fitness consequence • single population • constant population size (for now) THE TREE IS UNKNOWN, ANALYSIS IS ASKING WHICH TREES FIT THE DATA AND WHAT THAT TELLS US ABOUT THE INTERVAL BETWEEN BRANCH NODES

  14. mutation • # mutations (K) Poisson distributed on genealogy, based on total time t = (Ttotal) • Poisson process: stochastic, each time interval is independent, waiting time is exponentially distributed across time intervals (but when many branches, multiplies opportunity in interval) • Applications • The classic example of phenomena well modelled by a Poisson process is deaths due to horse kick in the Prussian army, as shown by Ladislaus Bortkiewicz in 1898.[4][5] The following examples are also well-modeled by the Poisson process: • Requests for telephone calls at a switchboard. • Goals scored in a soccer match.[6] • Requests for individual documents on a web server.[7] • Particle emissions due to radioactive decay by an unstable substance. In this case the Poisson process is non-homogeneous in a predictable manner - the emission rate declines as particles are emitted.

  15. DO RECOGNIZE THIS Ewens distribution • under neutral model, mutations arise at rate µand are lost or drift to higher frequency (frequency proportional to AGE) • thus we’ve come to expect a certain distribution of allele frequencies, e.g. p=q is unlikely • generally a small number of very common alleles, and increasing number of very rare alleles DO NOT MEMORIZE THIS

  16. um, huh? • here is the context: DRIFT causes some alleles to increase in frequency, some to be lost (moving forward in time) • moving back in time from NOW, the same process can explain the frequency of alleles in the context of how individuals are related • this means we have expectations for how long it takes for a sample of sequences from NOW to coalesce to a common ancestor in the past (about 2 times effective population size) • one reason two separate evolutionary populations may not APPEAR completely different, it takes time for ancestral diversity to sort out (most recent common ancestor) (now)

  17. >1 population? this pop descended from ‘red allele’ ancestor this pop descended from ‘green allele’ ancestor lets imagine two populations that rarely exchange migrants but have a common ancestry in the recent evolutionary past drift (moving forwards in time from ancestral population) leads to many that descended from one particular allele different in each population -> how do we know two populations?

  18. evolutionary biology: the populations tell us who they are! • shown at right are two LOCATIONS, not necessarily two distinct populations • may be one evolutionary population • however: if one is 90% A1 and 10% A2, the other is 10% A1 and 90% A2 • that means overall 50% A1, 50% A2 • should see 25% A1A1 homozygotes, 25% A2A2 if Hardy-Weinberg fits • instead see overall ~41% A1A1, 41% A2A2 because we are ‘pooling’ 2 diverged populations

  19. excess of common alleles • excess homozygosity could mean that two evolutionary populations are being analyzed as though they are one • so we don’t trust “even” allele frequencies: now think frequency dependent selection, balancing selection, or pooling of multiple evolutionary populations

  20. excess common alleles = positive selection or long-term decline excess rare alleles = purifying selection or population expansion just right = “neutral” η1=0 η2=1 η3=2 η4=3 (2, +1 for “η5”) η1=3 η2=2 η3=1 η4=0 neutral theory: sort of like Goldilocks story η1=2 η2=2 η3=1 η4=1

  21. learning goals for coalescent theory • how do patterns in sequence data tell us about effective population size? • what if there are multiple populations contributing information? • how is our answer changed if the population changes in size, or if there is selection for a particular allele? • why is this important for understanding phylogenetics (species trees)?

  22. why is this important for understanding phylogenetics (species trees)? • coalescent theory lets us test our assumptions of how DNA sequences evolve before we use them to reconstruct phylogeny • coalescent theory explains why recently-diverged populations may not yet have synapomorphies despite already being on different evolutionary paths • this model gives us basis for estimating time to ancestor of ANY two sequences

  23. DNA characters are just like phenotypic characters • 4 character states A,C,T,G plus information in insertion-deletion, gene copy number, etc. • same concerns of homology and shared descent apply

  24. “mitochondrial Eve” sets up misunderstanding • every locus sampled now has a point in the past where all current alleles coalesce to a common ancestor • in recently diverged species, diversity is often older than the species human population isolated ~200kya

  25. isolation isolation Ne understanding coalescence 1. larger effective size (Ne), more diversity 2. when time between branching events short relative to Ne, more likely that allelic diversity is older than branching event

  26. "This coalescence does not mean that the population originally consisted of a single individual with that ancestral allele. It just means that particular individual’s allele was the one that, out of all the alleles present at that time, later became fixed in the population."

  27. phylogeny inference • 2 basic approaches: algorithm vs. criterion • “neighbor joining” shown in book is an algorithm that generates a single tree by finding shortest “distances” (proportion of differences at nucleotide sites) • algorithm approaches do not help identify our uncertainty: one answer comes out, whether well supported or not

  28. criterion-based phylogeny • 30 tips results in 8.7 x 1036 possible trees • computer search necessary

  29. 3 of >10,000 possible trees which fits data best? depends on the criterion

  30. 11 changes 11 changes 7 changes = most parsimonious of these 3 3 of >10,000 possible trees which fits data best? depends on the criterion

  31. criteria used in phylogeny • parsimony - the fewest # of changes indicates the most acceptable tree topology • maximum likelihood - both topology (arrangement of branches) and branch lengths are iteratively searched for tree(s) that fit statistical model of molecular evolution (e.g. transitions > transversions) • Bayesian - criterion is still maximum likelihood, search strategy is different (sums result over many similar-likelihood trees)

  32. why different criteria? • we are making our assumptions explicit for inference of the unknown • different scientists have different backgrounds that drive their assumptions • using multiple methods/criteria lets us test how safe our assumptions are • next time: how do we decide if a tree hypothesis is strongly supported?

More Related