430 likes | 442 Views
Outline. Cancer Progression Models SNPs, Haplotypes, and Population Genetics: Introduction. Cancer: Mutation and Selection. Clonal theory of cancer: Nowell (Science 1976). Cancer Genomes. Leukemia. Breast. Tumor genome 2. Tumor genome 3. Tumor genome 4. “Comparative Genomics” of Cancer.
E N D
Outline • Cancer Progression Models • SNPs, Haplotypes, and Population Genetics: Introduction
Cancer: Mutation and Selection Clonal theory of cancer: Nowell (Science 1976)
Cancer Genomes Leukemia Breast
Tumor genome 2 Tumor genome 3 Tumor genome 4 “Comparative Genomics” of Cancer Mutation, selection Human genome Tumor genome • Identify recurrent aberrations • Mitelman Database, >40,000 aberrations • Reconstruct temporal sequence of aberrations • Linear model:Colorectal cancer (Vogelstein, 1988): -5q 12p* -17p -18q • Tree model: (Desper et al.1999) • 3) Find age of tumor, • time of clonal expansion
Tumor genome 2 Tumor genome 3 Tumor genome 4 Observing Cancer Progression • Obtaining longitudinal (time-course) data difficult. • Latitudinal data (multiple patients) readily available. t1 t2 t3 t4 Mutation, selection Human genome Tumor genome
Multiple Mutations • 4 step model for colorectal cancer, Vogelstein, et al. (1988) New Eng. J.Med -5q 12p* -17p -18q • Inferred from latitudinal data in 172 tumor samples.
Oncogenetic Tree models (Desper et al. JCB 1999, 2001) • Given: measurements of chromosome gain/loss events in multiple tumor samples (CGH) • Compute: rooted tree that best explains temporal sequence of events. {+1q}, {-8p}, {+Xq}, {+Xq, -8p}, {-8p, +1q}
Oncogenetic Tree models (Desper et al. JCB 1999, 2000) • Given: measurements of chromosome gain/loss events in multiple tumor samples {+1q}, {-8p}, {+Xq}, {+Xq, -8p}, {-8p, +1q} L = set of chromosome alterations observed in all samples Tumor samples give probability distribution on 2L
e1 e0 e2 e3 e4 Oncogenetic Tree T = (V, E, r, p, L) rooted tree • V = vertices • E = edges • L = set of events (leaves) • r root • p: E (0,1] probability distribution T gives probability distribution on 2L
Results • CGH of 117 cases of kidney cancer
Extensions • Oncogenetic trees based on branching (Desper et al., JCB 1999)
Extensions • Oncogenetic trees based on branching (Desper et al., JCB 1999) • Maximum Likelihood Estimation (von Heydebreck et al, 2004) • Mutagenic trees: mixtures of trees (Beerenwinkel, et al. JCB 2005)
Heterogeneity within a tumor • Final tumor is clonal expansion of single cell lineage. • Can we date the time of clonal expansion? Tsao, … Tavare, et al. Genetic reconstruction of individual colorectal tumor histories, PNAS 2000.
Estimating time of clonal expansion • Microsatellite loci (MS), CA dinucleotides. • In tumors with loss of mismatch repair (e.g. colorectal), MS change size.
Estimating time of clonal expansion • For each MS locus, measure mean mi and variance si of size. • S2allele = average of s12, …, sL2 • S2loci = variance of m1, …, mL
Simulation Estimates of Tumor Age Y2 Y1 • Y1 = time to clonal expansion • Tumor age = Y1 + Y2 • Branching process simulation. Each cell in population gives birth to 0, 1 or 2 daughter cells with +- 1 change in MS size (coalescent: forward, backward, forward simulation) • Posterior estimate of Y1, Y2 by running simulations, accepting runs with simulated values of S2allele, S2loci close to observed.
Results • 15 patients, 25 MS loci • Estimate time since clonal expansion from observed S2allele, S2loci .
Cancer: Mutation and Selection Clonal theory of cancer: Nowell (Science 1976)
Population Genetics • C.C. Maley: selective sweeps of mutations in tumor cell populations • Chin and Gray: solid tumors
Genetics 101 • Humans are diploid: two copies of each chromosome, maternal and paternal • Locus: Region on a chromosome (gene, nucleotide, etc.) • Allele: “Value” at a locus • Genotype: Pair of alleles (maternal and paternal) at loci on a chromosome (homozygous, heterozygous) • Haplotype: Alleles of loci on same chromosome (maternal or paternal)
Allele Measurement • “Old days” (< 1970?): gene variants • More recently: (1980’s-90’s), various sequence based genetic markers: microsatellites, sequence tagged sites (STS), etc. • Today: single nucelotide polymorphisms (SNPs)
Single Nucleotide Polymorphisms Infinite Sites Assumption: Each site mutates at most once 00000101011 10001101001 01000101010 01000000011 00011110000 00101100110 By convention, SNPs are biallelic: only two of four possible nucleotides present in population
Infinite Sites Assumption A B 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 5 8 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 • The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. • Each sequence has single parent. • The history of a population can be expressed as a tree. • The tree can be constructed efficiently
Infinite sites Assumption and Perfect Phylogeny • Each site is mutated at most once in the history. • All descendants must carry the mutated value, and all others must carry the ancestral value i 1 in position i 0 in position i
Perfect Phylogeny • Assume an evolutionary model in which only mutation takes place, • The evolutionary history is explained by a tree in which every mutation is on an edge of the tree. All the species in one sub-tree contain a 0, and all species in the other contain a 1. Such a tree is called a perfect phylogeny. • How can one reconstruct such a tree?
The 4-gamete condition • A column i partitions the set of species into two sets i0, and i1 • A column is homogeneous w.r.t a set of species, if it has the same value for all species. Otherwise, it is heterogenous. • EX: i is heterogenous w.r.t {A,D,E} i A 0 B 0 C 0 D 1 E 1 F 1 i0 i1
4 Gamete Condition • 4 Gamete Condition • There exists a perfect phylogeny if and only if for all pair of columns (i,j), either j is not heterogenous w.r.t i0, or i1. • Equivalent to • There exists a perfect phylogeny if and only if for all pairs of columns (i,j), the following 4 rows do not exist (0,0), (0,1), (1,0), (1,1)
i i0 i1 4-gamete condition: proof • Depending on which edge the mutation j occurs, either i0, or i1 should be homogenous. • (only if) Every perfect phylogeny satisfies the 4-gamete condition • (if) If the 4-gamete condition is satisfied, does a prefect phylogeny exist?
An algorithm for constructing a perfect phylogeny • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. • In any tree, each node (except the root) has a single parent. • It is sufficient to construct a parent for every node. • In each step, we add a column and refine some of the nodes containing multiple children. • Stop if all columns have been considered.
Inclusion Property • For any pair of columns i,j • i < j if and only if i1 j1 • Note that if i<j then the edge containing i is an ancestor of the edge containing j i j
r A B C D E Example 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 Initially, there is a single clade r, and each node has r as its parent
Sort columns • Sort columns according to the inclusion property (note that the columns are already sorted here). • This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0
Add first column 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • In adding column i • Check each edge and decide which side you belong. • Finally add a node if you can resolve a clade r u B D A C E
Adding other columns 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 • Add other columns on edges using the ordering property r 1 3 E 2 B 5 4 D A C
Unrooted case • Switch the values in each column, so that 0 is the majority element. • Apply the algorithm for the rooted case
Summary :No recombination leads to correlation between sites A B 0 0 0 0 0 0 0 0 3 0 0 1 0 0 0 0 0 5 8 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 1 • The different sites are linked. A 1 in position 8 implies 0 in position 5, and vice versa. • The history of a population can be expressed as a tree. • The tree can be constructed efficiently
Haplotype Phasing Problem • Given a set of genotypes, infer the haplotypes. • Use parsimony assumption • Haplotypes satisfy perfect phylogeny (Gusfield) • Find minimum number of haplotypes that explain observed genotypes • Most sequencing technologies measure genotypes not haplotypes 0 1 0 1 1 1 0 1 1 0 0 0 1 0 2 1 0 2 2 1 0 Pair of haplotypes Genotype: 2 = heterozygous
Recombination 00000000 11111111 00011111
Recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to violation of 4 gamete property. • Recombination leads to loss of correlation between columns 00000000 11111111 00011111
Studying recombination • A tree is not sufficient as a sequence may have 2 parents • Recombination leads to loss of correlation between columns • How can we measure recombination?
Linkage (Dis)-equilibrium (LD) A B 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 0 A B 0 1 0 1 0 0 0 0 1 0 1 0 1 0 1 0 Extensive Recombination • Pr[A,B=(0,1)=0.125 • Linkage equilibrium • No recombination • Pr[A,B=0,1] = 0.25 • Linkage disequilibrium