1.05k likes | 1.2k Views
Identifying conserved spatial patterns in genomes. Rose Hoberman Computer Science Department Carnegie Mellon University. University of Chicago 0ct 20, 2006. My focus: Spatial Comparative Genomics.
E N D
Identifying conserved spatial patterns in genomes Rose Hoberman Computer Science Department Carnegie Mellon University University of Chicago 0ct 20, 2006
My focus:Spatial Comparative Genomics Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and evolves.
3 6 2 5 7 3 1 4 17 20 16 19 18 8 12 11 13 10 14 15 9 A simple model of a genome an ordered list of genes
Genomic Change Ancestral genome speciation species 2 species 1 Sequence Mutation + Chromosomal Rearrangements
17 20 16 19 18 8 12 11 13 10 14 15 9 12 8 13 11 15 14 10 9 20 17 19 16 18 17 20 16 19 18 2 3 1 4 6 8 12 2 5 7 11 13 3 10 14 15 1 4 9 17 16 13 14 15 19 18 Inversions Types of Genomic Rearrangements Duplications/Insertions Loss 3 6 2 5 7 3 1 4 20
12 8 2 5 7 13 11 3 3 15 14 10 4 9 1 3 20 17 19 16 18 Inversions Types of Genomic Rearrangements Duplications/Insertions Loss Fissions and fusions 17 16 8 12 20 11 7 10 9 6 2 5 13 2 3 14 15 1 4 3 4 1
20 17 19 16 18 An Essential Task forSpatial Comparative Genomics Identify chromosomal regions that descended from the same region in the genome of the common ancestor Species 1 12 8 11 10 9 2 5 7 13 3 3 14 15 4 1 17 16 20 8 12 11 7 10 9 6 13 2 2 5 2 14 15 3 3 3 4 4 4 1 1 1 Species 2
Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters
Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Pevzner, Tesler. Genome Research 2003
Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Guillaume Bourque et al. Genome Research 2004
Identification of homologous chromosomal segments is a key task in comparative genomics Ancestral chromosome • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Whole genome duplication chromosome 2 chr 1 chromosome 1 chr 2
Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications McLysaght et al Nature Genetics, 2002.
Identification of conserved chromosomal structure is a key task in comparative genomics • Understand gene function and regulation in bacteria • Infer functional associations • Predict operons • Identify horizontal transfers Insertion Loss Inversions
Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters
20 17 19 16 18 Closely related genomes Species 1 12 8 11 10 9 2 5 7 13 3 3 14 15 4 1 17 16 20 8 12 11 7 10 9 13 6 2 2 5 2 14 15 3 3 3 1 1 4 4 1 4 Species 2 Related regions are easy to identify
More Distantly Related Genomes 12 11 8 5 18 9 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 Homologous regions are harder to detect, but there is still spatial evidence of common ancestry • Similar gene content • Neither gene content nor order is perfectly preserved 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1
The signature of diverged regions 12 11 8 5 18 9 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 Gene clusters • Similar gene content • Neither gene content nor order is perfectly 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1
A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Design an algorithm to find clusters • Statistically verify clusters
Why Validate Clusters Statistically? • After sufficient time has passed, gene order will become randomized • Uniform random data tends to be “clumpy” • Some genes will end up close together in both genomes simply by chance 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1
Cluster Statistics Cluster Models
Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters
A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Design an algorithm to find clusters • Statistically verify clusters
Gene Homology • Identification of homologous gene pairs • generally based on sequence similarity • conserved genomic context is also informative • Assumptions • matches are binary (similarity scores are discarded) • each gene is homologous to at most one other gene in the other genome
A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters
Where are the gene clusters? Intuitive notion: pairs of regions that are dense with homologs How can we formalize this intuition?
The r-window cluster definition r = 6 • Two windows of sizer that share at least m homologous gene pairs • r is a user-specified parameter (Calvacanti et al 03, Durand and Sankoff 03, Friedman and Hughes 01, Raghupathy and Durand 05)
A max-gap chain g= 2 gap= 3 • The distance or “gap” between genes is equal to the number of intervening genes • A set of genes in a genome form a max-gap chain if • the gap between adjacent genes is never greater than g (a user-specified parameter)
The max-gap cluster definition gap= 3 g= 2 g= 3 A set of genes form a max-gap cluster in two genomes if • the genes forms a max-gap chain in each genome • the cluster is maximal (i.e. not contained within a larger cluster)
A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters
Max-gap search algorithms • Many genomic studies use a max-gap criteria • Each group designs their own search algorithm • These are often greedy algorithms, but greedy algorithms miss disordered max-gap clusters Hoberman et al, RECOMB Comp Genomics 2005
Greedy, Agglomerative Algorithms g = 2 • initialize a cluster as a single homologous pair • search for a gene in proximity on both chromosomes • either extend the cluster and repeat, or terminate
Greedy Algorithms Impose Order Constraints g = 2 A max-gap cluster of size four A greedy, agglomerative algorithm will not find this cluster since there is no max-gap cluster of size two
Algorithms and Definition Mismatch • Greedy, bottom-up algorithms will not find all max-gap clusters • There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002) • Cluster statistics depend on the search space, which depends on which algorithm is used
A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters An example…
Example: Whole genome self-comparison to detect duplicated blocks • Chose g=30 • Compared all human chromosomes to all other chromosomes to find max-gap gene clusters McLysaght et al. Nature Genetics, 2002.
How can we use statistical analysis? Chr 17 10 genes duplicated out of ~100 29 genes • Could two regions display this degree of similarity simply by chance? • Is g=30 a reasonable choice of gap size? • Are larger clusters less likely to occur by chance? • How large does a cluster have to be before we are surprised to observe it? Chr 3 3 genes duplicated out of ~25 13 genes McLysaght et al. Nature Genetics, 2002.
Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters
Cluster Statistics Cluster Models
The max-gap definition is the most widely used cluster definition in genomic analyses Overbeek et al 1999: inferring functional coupling of genes in bacteria Vision et al 2000: origins of genomic duplications in Arabidopsis Friedman and Hughes 2001: gene duplication and structure of eukaryotic genomes Tamames 2001: evolution of gene order conservation in prokaryotes Vandepoele et al 2002: microcolinearity between Arabidopsis and rice McLysaght et al 2002: genomic duplication during early chordate evolution Simillion et al 2002: hidden duplications in Arabidopsis Blanc et al 2003: recent polyploidy in Arabidopsis Luc et al 2003: gene teams for comparative genomics Chen et al 2004: operon prediction in newly sequenced bacteria Bourque et al 2005: comparison of mammalian and chicken genome architectures … Yet there is noformal statistical model for max-gap clusters
The Question Suppose two whole genomes were compared, and this max-gap cluster was identified: • Is this cluster biologically meaningful? • Could it have occurred in a comparison of two random genomes?
Statistical Testing • Hypothesis testing • Alternate hypothesis: shared ancestry • Null hypothesis: random gene order • Discard clusters that could have arisen under the null model • Determine the probability of observing a similar cluster under the null hypothesis
The Problem Given an allowed gap size of g, what is the probability of observing a max-gap cluster containing exactly hmatching gene pairs? How do we calculate this probability? h=4
The Inputs n=22 m=6 h=4 g=2 • n:number of genes in each genome • m:number of matching genes pairs • g:the maximum gap allowed in a cluster • h:number of matching genes in the cluster
The probability when m = n If gene content is identical… …the probability of a max-gap cluster is 1 (regardless of the allowed gap size)
Probability of a cluster of size hwhen m < n m-h genes m genes h genes Basic approach Enumerate all ways to: • Place m-h remaining genes so they do not extend the cluster • Create chains of h genes in both genomes * • Normalize to get a probability
Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n
Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n
h genes m-h genes m genes Number of ways to place h genes in two genomes so they form a cluster Select h spots in each genome, so they form a max-gap chain Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Hoberman, Sankoff, Durand RECOMB Comparative Genomics, 2004
Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n