Identifying conserved spatial patterns in genomes

Identifying conserved spatial patterns in genomes Rose Hoberman Computer Science Department Carnegie Mellon University University of Chicago 0ct 20, 2006

My focus:Spatial Comparative Genomics Understanding genome structure, especially how the spatial arrangement of elements within the genome changes and evolves.

3 6 2 5 7 3 1 4 17 20 16 19 18 8 12 11 13 10 14 15 9 A simple model of a genome an ordered list of genes

Genomic Change Ancestral genome speciation species 2 species 1 Sequence Mutation + Chromosomal Rearrangements

17 20 16 19 18 8 12 11 13 10 14 15 9 12 8 13 11 15 14 10 9 20 17 19 16 18 17 20 16 19 18 2 3 1 4 6 8 12 2 5 7 11 13 3 10 14 15 1 4 9 17 16 13 14 15 19 18 Inversions Types of Genomic Rearrangements Duplications/Insertions Loss 3 6 2 5 7 3 1 4 20

12 8 2 5 7 13 11 3 3 15 14 10 4 9 1 3 20 17 19 16 18 Inversions Types of Genomic Rearrangements Duplications/Insertions Loss Fissions and fusions 17 16 8 12 20 11 7 10 9 6 2 5 13 2 3 14 15 1 4 3 4 1

20 17 19 16 18 An Essential Task forSpatial Comparative Genomics Identify chromosomal regions that descended from the same region in the genome of the common ancestor Species 1 12 8 11 10 9 2 5 7 13 3 3 14 15 4 1 17 16 20 8 12 11 7 10 9 6 13 2 2 5 2 14 15 3 3 3 4 4 4 1 1 1 Species 2

Outline • Introduction and Motivation • Evolution of spatial organization • Applications: why identify related genomic regions? • Problem Background • Why is this challenging? • Introduction to cluster finding • Results • Statistics for pairwise clusters • Statistics for three-way clusters

Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Pevzner, Tesler. Genome Research 2003

Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Guillaume Bourque et al. Genome Research 2004

Identification of homologous chromosomal segments is a key task in comparative genomics Ancestral chromosome • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications Whole genome duplication chromosome 2 chr 1 chromosome 1 chr 2

Identification of homologous chromosomal segments is a key task in comparative genomics • Genome evolution • Reconstruct history of chromosomal rearrangements • Infer ancestral genetic map • Phylogeny reconstruction • Identify ancient whole genome duplications McLysaght et al Nature Genetics, 2002.

Identification of conserved chromosomal structure is a key task in comparative genomics • Understand gene function and regulation in bacteria • Infer functional associations • Predict operons • Identify horizontal transfers Insertion Loss Inversions

20 17 19 16 18 Closely related genomes Species 1 12 8 11 10 9 2 5 7 13 3 3 14 15 4 1 17 16 20 8 12 11 7 10 9 13 6 2 2 5 2 14 15 3 3 3 1 1 4 4 1 4 Species 2 Related regions are easy to identify

Five hundred million years...

More Distantly Related Genomes 12 11 8 5 18 9 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 Homologous regions are harder to detect, but there is still spatial evidence of common ancestry • Similar gene content • Neither gene content nor order is perfectly preserved 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

The signature of diverged regions 12 11 8 5 18 9 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 Gene clusters • Similar gene content • Neither gene content nor order is perfectly 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Design an algorithm to find clusters • Statistically verify clusters

Why Validate Clusters Statistically? • After sufficient time has passed, gene order will become randomized • Uniform random data tends to be “clumpy” • Some genes will end up close together in both genomes simply by chance 11 20 19 7 18 17 2 3 16 13 10 4 3 14 15 1 12 17 8 6 20 2 2 5 11 7 13 16 3 10 14 15 4 9 4 1 1 1

Cluster Statistics Cluster Models

A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Design an algorithm to find clusters • Statistically verify clusters

Gene Homology • Identification of homologous gene pairs • generally based on sequence similarity • conserved genomic context is also informative • Assumptions • matches are binary (similarity scores are discarded) • each gene is homologous to at most one other gene in the other genome

A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters

Where are the gene clusters? Intuitive notion: pairs of regions that are dense with homologs How can we formalize this intuition?

The r-window cluster definition r = 6 • Two windows of sizer that share at least m homologous gene pairs • r is a user-specified parameter (Calvacanti et al 03, Durand and Sankoff 03, Friedman and Hughes 01, Raghupathy and Durand 05)

A max-gap chain g= 2 gap= 3 • The distance or “gap” between genes is equal to the number of intervening genes • A set of genes in a genome form a max-gap chain if • the gap between adjacent genes is never greater than g (a user-specified parameter)

The max-gap cluster definition gap= 3 g= 2 g= 3 A set of genes form a max-gap cluster in two genomes if • the genes forms a max-gap chain in each genome • the cluster is maximal (i.e. not contained within a larger cluster)

A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters

Max-gap search algorithms • Many genomic studies use a max-gap criteria • Each group designs their own search algorithm • These are often greedy algorithms, but greedy algorithms miss disordered max-gap clusters Hoberman et al, RECOMB Comp Genomics 2005

Greedy, Agglomerative Algorithms g = 2 • initialize a cluster as a single homologous pair • search for a gene in proximity on both chromosomes • either extend the cluster and repeat, or terminate

Greedy Algorithms Impose Order Constraints g = 2 A max-gap cluster of size four A greedy, agglomerative algorithm will not find this cluster since there is no max-gap cluster of size two

Algorithms and Definition Mismatch • Greedy, bottom-up algorithms will not find all max-gap clusters • There is an efficient divide-and-conquer algorithm to find maximal max-gap clusters (Bergeron et al, WABI, 2002) • Cluster statistics depend on the search space, which depends on which algorithm is used

A Framework for Identifying Gene Clusters • Find homologous genes • Formally define a “gene cluster” • Devise an algorithm to identify clusters • Statistically verify clusters An example…

Example: Whole genome self-comparison to detect duplicated blocks • Chose g=30 • Compared all human chromosomes to all other chromosomes to find max-gap gene clusters McLysaght et al. Nature Genetics, 2002.

How can we use statistical analysis? Chr 17 10 genes duplicated out of ~100 29 genes • Could two regions display this degree of similarity simply by chance? • Is g=30 a reasonable choice of gap size? • Are larger clusters less likely to occur by chance? • How large does a cluster have to be before we are surprised to observe it? Chr 3 3 genes duplicated out of ~25 13 genes McLysaght et al. Nature Genetics, 2002.

Cluster Statistics Cluster Models

The max-gap definition is the most widely used cluster definition in genomic analyses Overbeek et al 1999: inferring functional coupling of genes in bacteria Vision et al 2000: origins of genomic duplications in Arabidopsis Friedman and Hughes 2001: gene duplication and structure of eukaryotic genomes Tamames 2001: evolution of gene order conservation in prokaryotes Vandepoele et al 2002: microcolinearity between Arabidopsis and rice McLysaght et al 2002: genomic duplication during early chordate evolution Simillion et al 2002: hidden duplications in Arabidopsis Blanc et al 2003: recent polyploidy in Arabidopsis Luc et al 2003: gene teams for comparative genomics Chen et al 2004: operon prediction in newly sequenced bacteria Bourque et al 2005: comparison of mammalian and chicken genome architectures … Yet there is noformal statistical model for max-gap clusters

The Question Suppose two whole genomes were compared, and this max-gap cluster was identified: • Is this cluster biologically meaningful? • Could it have occurred in a comparison of two random genomes?

Statistical Testing • Hypothesis testing • Alternate hypothesis: shared ancestry • Null hypothesis: random gene order • Discard clusters that could have arisen under the null model • Determine the probability of observing a similar cluster under the null hypothesis

The Problem Given an allowed gap size of g, what is the probability of observing a max-gap cluster containing exactly hmatching gene pairs? How do we calculate this probability? h=4

The Inputs n=22 m=6 h=4 g=2 • n:number of genes in each genome • m:number of matching genes pairs • g:the maximum gap allowed in a cluster • h:number of matching genes in the cluster

The probability when m = n If gene content is identical… …the probability of a max-gap cluster is 1 (regardless of the allowed gap size)

Probability of a cluster of size hwhen m < n m-h genes m genes h genes Basic approach Enumerate all ways to: • Place m-h remaining genes so they do not extend the cluster • Create chains of h genes in both genomes * • Normalize to get a probability

Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

h genes m-h genes m genes Number of ways to place h genes in two genomes so they form a cluster Select h spots in each genome, so they form a max-gap chain Choose h genes to compose the cluster Assign each gene to a selected spot in each genome Hoberman, Sankoff, Durand RECOMB Comparative Genomics, 2004

Probability of observing a cluster of size h number of ways to place h genes so they form a chain in both genomes number of ways to place m-h remaining genes so they do not extend the cluster All configurations of m gene pairs in two genomes of size n

Identifying conserved spatial patterns in genomes

Identifying conserved spatial patterns in genomes

Presentation Transcript

Ch7. Spatial patterns in urban landscape

Identifying probable prophage DNA in mycobacterial genomes

Identifying Patterns

Discovering Spatial Co-location Patterns

Measuring spatial clustering in disease patterns.

Human Settlement: Spatial Patterns

Spatial Patterns

Identifying Patterns

Spatial patterns of EEG

IDENTIFYING PATTERNS WORKSHEET # 1

Identifying conserved segments in rearranged and divergent genomes

Identifying Patterns In Spatial Data

Identifying abnormal patterns in cellular communication flows

Mining for Spatial Patterns

Measuring spatial clustering in disease patterns.

Identifying Patterns in Time Series Data

Spatial patterns of ethnic groups in koreatown

Identifying Repeated Patterns of Behavior in Time

Identifying The Baldness Patterns In Men

Spatial patterns of S.dorsalis distribution

Examples of Spatial Patterns