220 likes | 340 Views
Genome evolution. Lecture 5: Species, Genomes and Trees. What is a species?. Multiple definitions.. free flow of genetic information within population Weak (or zero) flow of information across species barriers. Strain 2. Strain 1.
E N D
Genome evolution Lecture 5: Species, Genomes and Trees
What is a species? • Multiple definitions.. • free flow of genetic information within population • Weak (or zero) flow of information across species barriers Strain 2 Strain 1 We change wright-fischer’s or Moran model, by removing the assumption of random mixing. Instead, we can assume subpopulations are more likely to mate among themselves. Different models are possible, all end up increasing the genetic distance between subpopulations Species 2 Species 1
Speciation The Phenomenon of new species emergence is called speciation It is well accepted that speciation is driven by the formation of reproductive barriers Allopatric speciation – occurs through geographical separation Parapatric speciation – occurs without geographical separation but with weak flow of genetic information Sympatric speciation – occurs while information is flowing Barriers can genetic, physical, and behavioral
Allopatric speciation “Finally, then, I suppose that a large number of closely allied or representative species... were originally formed in parts formerly isolated " (Darwin) Åland Islands, Glanville fritillary population: same species Charis Butterflies in South America: different species Factors that limit gene flows are quite diverse, and go beyond geography: Habitat, Sexual preferences, Season. Pollinator… Many factors can contribute to form a barrier: Physical incompatibility, Hybrid sterility (mule), pre- or pos-zygotic lethality…
Sympatric speciation Following Darwin, and prior to population genetics and genetics in general evolutionary biologists considered sympatric speciation as the leading factor generating new species. The idea was that species are adapting to niches while co-existing in the same habitat Sympatric speciation is however difficult to explain using standard population genetics of interbreeding populations. Myer (and Dobjhansky) have made strong arguments that suggested allopatric speciation is the major (or only) driver of bio-diversity Results from the last 20-30 years have however suggested that sympatric speciation may still be important Studies of cichlid fish species in African lakes showed incredible diversity: 500 endemic species in lake victoria, up to 1000 in lake Malawi The history of some of these lakes may have included massive dry-out and geographical separation.. In smaller lake (shown here is Barombi Mbo in Cameron), dry-out is geographically unlikely, and several species (7) with a probable cone ommon ancestor do suggest sympatry
Strain 2 Strain 2 Strain 2 Strain 1 Strain 1 Strain 1 Species 2 Species 3 Species 4 Species 1 Species 1 Species 2 Species trees Speciation is irreversible! (with some minor exceptions – think parasites) We end up with a branching process: forming a tree extinction Present time
Facts on trees • A tree is a connected graph without cycles • We will use directed trees: each edge/lineage have a direction (time) • Directed acyclic graph (DAG): a directed graph without cycles • a Binary tree: one or 0 parents (incoming edges), two or 0 children (outgoing edges) • A binary tree on n extant species will have n-1 inner nodes: (prove) • Each node partition a binary tree into three disconnected parts (up, left, right) • The root of the tree is the only node without parents • Topological order: a permutation of the nodes such that each node appears after its parents • BFS/DFS
Evolutionary inference • We can usually observe only the extent populations • But we want to infer the history of the evolutionary process • How did the ancestral populations/species looked like? (nodes in the tree) • What was the evolutionary process that brought an ancestral genome into an extant one? (edges in the tree) • So we will develop methods for inference: estimating the values of missing variables based on partial observations
Do we need inference? Getting direct evidence on the evolutionary history is only partially possible: The fossil record had probably given us more evolutionary understanding than any other resource (definitely more than genomes) But it cannot teach us much on evolution at the genome level – and we cannot use it to learn how to read the genome itself New technologies promise to sequence the genome of extinct species (mammoth, Neanderthals). But this is inherently limited by material availability
The past is correlated with the present Low substitution probability High correlation A:past B:present A:past B:present Why do we have a chance with inference? We are trying to infer the past based on the present. Does this make any sense at all?
C A 2 substitutions 1 substitution C A Maximum parsimony If we assume that the traits on the tree are changing slowly Then the ancestral traits is usually the same as the extant one We for each ancestral node, we have evidence coming in from 3 directions – almost always two of them should agree • Formally: given a tree T, and observations (from some alphabet) Si on the extent species: • 1) compute the minimal number of changes along the tree, • 2) Find the possible values at each ancestral node given an evolutionary scenario involving the minimal number of changes ? A ? C A
up_set[5] up_set[4] Computing the parsimony score Maximum Parsimony Algorithm (Following Fitch 1971): Start with D=0, up_set[i] a bitvector for each node Up(i): if(extant) { up_set[i] = Si; return} up(right(i)), up(left(i)) up_set[i] = up_set[right[i]] ∩ up_set[left[i]] if(up_set[i] = 0) D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]] Compute the minimal number of changes by calling Up(root) ? S3 ? S2 S1
Parsimony “inference” down_set[5] up_set[3] down_set[4] ? S3 ? Set[i] = up_set[i] ∩ down_set[i] S2 S1 Algorithm (Following Fitch 1971): Up(i): if(extant) { up_set[i] = Si; return} up(right(i)), up(left(i)) up_set[i] = up_set[right[i]] ∩ up_set[left[i]] if(up_set[i] = 0) D += 1 up_set[i] = up_set[right[i]] + up_set[left[i]] Down(i): down_set[i] = up_set[sib[i]] ∩ down_set[par(i)] if(down_set[i] = 0) { down_set[i] = up_set[sib[i]] + down_set[par(i)] } down(left(i)), down(right(i)) Algorithm: D=0 up(root); down_set[root] = 0; down(right(root)); down(left(root));
Genomic sequencing In its first 100 years, evolutionary theory was about organismal traits Starting from the 1960’s, molecular traits became available (mostly looking at proteins) Since the 1990’s, and to its full extent today, we can cheaply sequence whole genomes It is expected that within a few years, technology will allow routinely to study whole genomes in large population samples. For example: The 3 billion dollars human genome project can now be done by a single lab within a few weeks for 5,000$, and the price rapidly dropping The 1000 genomes project
Sequencing technology is rapidly evolving: Illumina GAII (here at WIS) ~40,000,000 reads of ~36bp on each, 5k-10k$ Jan 2010: 300 million reads, 150bpx2…
Genome evolution: nucleotides are not simple traits A AAA AA GGAACC C AA AAA GGAAGGAACC Point mutation (substitution) Deletion Insertion duplication We transform nucleotides to traits using alignment An alignment specifies which positions in two or more genomes represent the same “trait” – assuming they are the outcome of a single genealogy As we are seeing this needs not be well defined! (e.g. duplications) – but we will have to usually assume it is. A basic pairwise alignment optimization problem is solved using dynamic programming Pairwise alignment: find the alignment minimizing the number (or some linear cost) of mismatches (including deletions/insertions characters) Affine gap pairwise alignment: find the alignment minimizing the cost of mismatches + the cost of gaps (fixed cost for a new gap, another cost for a gap character) (see any standard text on comp-genomics)
Global Alignment si-1,j-1 + δ(vi, wj) si,j = max s i-1,j + δ(vi, -) s i,j-1 + δ(-, wj) { Local Alignment 0 si,j = max si-1,j-1 + δ(vi, wj) s i-1,j + δ(vi, -) s i,j-1 + δ(-, wj) { The alignment dynamic programming graph (for reference) a.k.a: Smith-Waterman, Needleman-Wunsch Species 1 Species 1 A T C T G A T C j 0 1 2 3 4 5 6 7 8 i 0 Species 2 T 1 Match/Mismatch G Initialize 0,0 to 2 C 3 Species 2 A 4 T 5 A 6 C 7 How can we align all Query to part of the database?
Multiple alignment The problem: given a set of sequences (each from a difference species), find their optimal multiple alignment. Multiple alignment cost: many possible definitions. In most of these the problem is NP-hard. In fact, we should be looking for the complete evolutionary history of these sequences Therefore, the optimal alignment should in principle define the genealogy of each nucleotide, such that these histories are reasonable In practice, multiple alignment algorithms are using heuristics based on these ideas. Designing and implementing a really principled version of these algorithms is not easy 1. Pairwise alignment (distances) 2. Build a “guide tree” 3. Align from leaves to root, each time a pair (sequences or profiles) …ACGAATAGCAGATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGAT… …ACGTTTAGCAAATGGGCAGATGGCAGTCTAGA-----AGCATGAGACTAGATAGAT… …ACGAATAGCAAAT------ATGCCAGTCTAGATCGAAAGCATGCCACTAGATAGAT…
Genome alignment Given a set of genomes, each consisting of several billion nts - Problem becomes quite intensive Heuristics are used to search for pieces of alignment (Blast) Pieces are then combined into chains of large fragments Genome alignment can be projected over some reference genome, complex situations with duplications, large deletions and insertion requires complex solutions and are routinely ignored