640 likes | 881 Views
Defining Gene Clusters: 24 Ways of Looking at Mount Fuji. Anne Bergeron, UQAM Dublin, September 19, 2005. 7. Mt Fuji from the Foot. Defining Gene Clusters: 24 Ways of Looking at Mount Fuji. Anne Bergeron, UQAM Dublin, September 19, 2005.
E N D
Defining Gene Clusters:24 Ways of Looking at Mount Fuji Anne Bergeron, UQAM Dublin, September 19, 2005 7. Mt Fuji from the Foot
Defining Gene Clusters:24 Ways of Looking at Mount Fuji Anne Bergeron, UQAM Dublin, September 19, 2005 "It struck me that it would be good to take one thing in life and regard it from many viewpoints, ... " Roger Zelazny
Genome A Genome B Genome C The basic problem We start with a set of genomes, labeled by gene names, domains, or synteny blocks, and a similarity relation on those labels. Highlighting a gene means selecting all labels that are similar. Genes, or other types of signals, can appear in multiple copies in a genome, or even be missing. In this talk, the similarity relation is "given" and is an equivalence relation.
{ } A set of genes : Genome A Genome B Genome C The basic problem We are interested in what happens when a set of genes is highlighted. Boring...
{ } Another set of genes: Genome A Genome B Genome C The basic problem Measures of surprise are studied by Durand, Haque, Hoberman, Sankoff, Raghupathy, etc. Interesting ?
The basic problem Goal : Given a (big) set of genomes, automatically identify all potentially interesting sets of genes.
Towards formal models 1. Mount Fuji from Owari
Towards formal models What do labels stand for? How many labels and genomes do we want to compare ? What do we want to do with the resulting clusters ?
Towards formal models: Example 1 Definition of labels and similarity: Large homology segments disrupted only by local micro-rearrangements. A total of 281 synteny blocks, colored in the human genome by their mouse chromosome number. Interesting features: Chromosome X Chromosome 17 Chromosome 20 Application: Genome evolution From: Eichler and Sankoff, Science (301:793-797), 2003
Towards formal models: Example 2 Definition of labels and similarity: Gene annotations of chloroplasts. Interesting features: Rearrangements Application: Phylogeny
Towards formal models: Example 3 Definition of labels and similarity: PFAM Domain numbers labeling four bacterial genomes. Interesting features: Duplications Insertions Rearrangements Application: Operon identification From: Pasek et al, Genome Research (15:867-874), 2005
With such an high E-value, the potential duplicate would have been missed by a comparison based on sequence similarity. Towards formal models: Example 4 Definition of labels and similarity: PFAM Domain numbers labeling four bacterial genomes. From: Pasek et al, Genome Research (15:867-874), 2005 Application: Identification of orthologs and/or duplicate segments.
From: Bérard et al, WABI 2005 Towards formal models: Example 5 Definition of labels and similarity: Large homology segments disrupted only by local micro-rearrangements. Comparing 16 segments of the mouse and rat chromosome X. Mouse = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Rat = -4 -3 -2 1 -13 -15 14 -16 8 9 10 -11 12 5 6 7 Application: Reconstructing ancestors
Down to earth details 2. Mt Fuji from a Teahouse at Yoshida
Down to earth details Do we allow gaps ? Do we allow rearrangements? Do we allow duplicates and missing genes ? Do we allow multiple genomes or self-comparison ? How about "extensions" ?
{ } A set of genes: Genome A Genome B Genome C Down to earth details : Model 1 No gaps, no duplications, any rearrangement.
{ } A set of genes: Genome A Genome B Genome C Down to earth details : Model 1 No gaps, no duplications, any rearrangement. What about this gene? Should we add it ?
Extension { } A set of genes: Genome A Genome B Genome C Down to earth details : Model 1 No gaps, no duplications, any rearrangement. What about this gene? Should we add it ?
{ } A set of genes: Genes not in the set Genome A Genome B Genome C Down to earth details : Model 2 No gaps, duplications, any rearrangement.
{ } A set of genes: Genome A Genome B Genome C Down to earth details : Model 3 Gaps, no duplications, any rearrangement.
{ } A set of genes: Genome A Genome B Genome C Down to earth details : Model 4 Gaps, missing/inserted genes, any rearrangement.
{ } A set of genes: Genome A Genome B Genome C Down to earth details : Model 5 Gaps, missing genes, duplications, any rearrangement. With gap size = 1, we get 4 occurrences. Reducing the number of genes....
{ } A smaller set of genes: Genome A Genome B Genome C Down to earth details : Model 5 ... yields 5 occurrences.
A general framework 24. Mount Fuji in a Summer Storm
{ } A set S of genes: Occurrence #1 Occurrence #2 A chromosome: > g ≤ g > g > g S = { } T= { } is an extension of A general framework Given a gap g, an occurrence of S is a maximal run of genes of S, separated by gaps of at most g genes not in S, and that contains at least one of each gene of S. A set of genes S is an extension of a set T, included in S, if each occurrence of T is contained in an occurrence of S.
A general framework Given a gap g, an occurrence of S is a maximal run of genes of S, separated by gaps of at most g genes not in S, and that contains at least one of each gene of S. { } A set S of genes: Occurrence #1 Occurrence #2 A chromosome: > g ≤ g > g > g A set of genes S is an extension of a set T, included in S, if each occurrence of T is contained in an occurrence of S. S = { } T= { } is an extension of
A general framework Choices When g = 0, the number of candidates is polynomial in the number of genes. When g > 0, the number of candidates can be exponential in the number of genes. • g = 0 or g > 0 Even with g = 1, there are problems. For example, with g = 0, the sequence of genes: a b c d e f produces one potential cluster that contains both a and f. But with g = 1, there are 8 of them: a b c d e f a b c d f a b c e f a b d e f a c d e f a c e f a b d f a c d f The number of these sequences grows in a Fibonacci progression!
A general framework Choices • g = 0 or g > 0 • Duplications or no duplications Duplications usually means an exponential number of candidates but, most of the time, are unavoidable. Models without duplications are, nevertheless, useful in many situations.
A general framework Choices • g = 0 or g > 0 • Duplications or no duplications Filtering is mostly based on the properties of the extension relation. If the number of candidates is low, filtering is not necessary, but it can be relevant. For models with a huge number of candidates, filtering is a must. • Three ways of filtering candidates
A general framework Choices • g = 0 or g > 0 • Duplications or no duplications • Three ways of filtering candidates • Formal or heuristic Formal models have inherent computational problems when applied to real data. Heuristics will always be useful.
A general framework Choices • g = 0 or g > 0 • Duplications or no duplications • Three ways of filtering candidates • Formal or heuristic 2 x 2 x 3 x 2 = 24 How convenient!
Common intervals: Voluntary simplicity* *Voluntary simplicity is a lifestyle considered by its adherents to be a sustainable, ecologically sensitive alternative to the typical, western consumerist lifestyle. [Ref. Wikipedia] 20. Mount Fuji from Inume Pass
Common intervals: Voluntary simplicity* *Voluntary simplicity is a lifestyle considered by its adherents to be a sustainable, ecologically sensitive alternative to the typical, western consumerist lifestyle. [Ref. Wikipedia] A (partial) list of credits: Uno and Yagiura (2000) Heber and Stoye (2001) Bergeron, Heber and Stoye (2002) Didier (2003) Schmidt and Stoye (2004) Figeac and Varré (2004) Bérard, Bergeron and Chauve (2004) Blin, Chauve and Fertin(2005) Landau, Parida and Weizman (2005) Tannier and Sagot (2005) Bérard, Bergeron, Chauve and Paul (2005) Bergeron, Chauve, de Montgolfier and Raffinot (2005)
Choices • g = 0 • No duplications • No filtering • Formal Genome A Genome B Genome C Common intervals The basic model of common intervals often yields a large number of 'uninteresting clusters'. However, filtering provides unusual information on whole genome organization.
Choices Genome A • g = 0 Genome B • No duplications s t • Filtering Common intervals u v • Formal s Strong intervals v Common intervals -> Strong Intervals Both t and u are two different extensions of the common interval s: Remove them.
From: Bérard et al, WABI 2005 Strong Intervals This tree displays the strong intervals between the synteny blocks of the mouse and rat chromosomes X. Mouse = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Rat = -4 -3 -2 1 -13 -15 14 -16 8 9 10 -11 12 5 6 7 This kind of tree is known as a PQ-tree. Strong intervals possess a rich combinatorial structure that can be exploited both from the biological and computation perspective.
4 3 2 1 1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 4 3 2 1 15 14 8 9 10 11 12 5 6 7 4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7 Strong Intervals : transforming a rat into a mouse This tree provides guidelines to possible rearrangement scenarios that transform the rat chromosome into a mouse chromosome. These scenarios preserve all common intervals.
4 3 2 1 1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 4 3 2 1 15 14 8 9 10 11 12 5 6 7 4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7 Strong Intervals : transforming a rat into a mouse Intervals are first labeled (in red) with respect to their relative orientation.
4 3 2 1 1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 4 3 2 1 15 14 8 9 10 11 12 5 6 7 4 3 2 1 13 15 14 16 8 9 10 11 12 5 6 7 Strong Intervals : transforming a rat into a mouse Intervals are first labeled (in red) with respect to their relative orientation.
1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 15 14 8 9 10 11 12 5 6 7 13 15 14 16 8 9 10 11 12 5 6 7 Strong Intervals : transforming a rat into a mouse 4 3 2 11315 14 16 8 9 10 11 12 5 6 7 4 3 2 1 1315 14 16 8 9 10 11 12 5 6 7 4 3 2 1 4 3 2 1 4 3 2 1 1 Then all strong intervals that disagree with their parent are inverted : 1
1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 15 14 8 9 10 11 12 5 6 7 4 3 2 1 1 2 3 4 1 4 3 2 3 2 13 15 14 16 8 9 10 11 12 5 6 7 4 1 Strong Intervals : transforming a rat into a mouse 4 3 2 11315 14 16 8 9 10 11 12 5 6 7 1 2 3 4 1315 14 16 8 9 10 11 12 5 6 7 Then all strong intervals that disagree with their parent are inverted : 4 3 2 1
8 9 10 11 12 5 6 7 1 2 3 4 1 2 3 8 9 10 11 12 5 6 7 4 Strong Intervals : transforming a rat into a mouse 1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7 1 2 3 4 1315 14 16 8 9 10 11 12 5 6 7 1315 14 16 8 9 10 11 12 5 6 7 13 15 14 16 8 9 10 11 12 5 6 7 1315 14 16 13 15 14 16 15 14 13 15 14 16 13 Then all strong intervals that disagree with their parent are inverted : 13
8 9 10 11 12 5 6 7 1 2 3 4 1 2 3 8 9 10 11 12 5 6 7 4 Strong Intervals : transforming a rat into a mouse 1 2 3 4 13 15 14 16 8 9 10 11 12 5 6 7 1 2 3 4 13 151416 8 9 10 11 12 5 6 7 13 151416 8 9 10 11 12 5 6 7 13 15 14 16 8 9 10 11 12 5 6 7 13 15 14 16 13 151416 15 14 1514 15 16 13 14 14 Then all strong intervals that disagree with their parent are inverted : 14
8 9 10 11 12 5 6 7 1 2 3 4 1 2 3 8 9 10 11 12 5 6 7 4 Strong Intervals : transforming a rat into a mouse 1 2 3 4 13 151416 8 9 10 11 12 5 6 7 1 2 3 4 13 1514 16 8 9 10 11 12 5 6 7 13 151416 8 9 10 11 12 5 6 7 13 1514 16 8 9 10 11 12 5 6 7 13 1514 16 13 151416 1514 15 16 13 14 16 Then all strong intervals that disagree with their parent are inverted : 16
8 9 10 11 12 5 6 7 1 2 3 4 14 15 1514 1 2 3 14 15 8 9 10 11 12 5 6 7 4 15 14 Strong Intervals : transforming a rat into a mouse 1 2 3 4 13 1514 16 8 9 10 11 12 5 6 7 1 2 3 4 13 14 15 16 8 9 10 11 12 5 6 7 13 14 15 16 8 9 10 11 12 5 6 7 13 1514 16 8 9 10 11 12 5 6 7 13 1514 16 13 14 15 16 13 16 Then all strong intervals that disagree with their parent are inverted : 14 15
13 14 15 16 13 1514 16 16 15 14 13 13 14 15 16 8 9 10 11 12 5 6 7 1 2 3 4 15 14 1514 14 15 14 15 1 2 3 15 14 14 15 8 9 10 11 12 5 6 7 4 16 13 13 14 14 15 15 13 16 16 Strong Intervals : transforming a rat into a mouse 1 2 3 4 13 14 15 16 8 9 10 11 12 5 6 7 1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7 16 15 14 13 8 9 10 11 12 5 6 7 13 14 15 16 8 9 10 11 12 5 6 7 Then all strong intervals that disagree with their parent are inverted : 13 14 15 16
13 14 15 16 16 15 14 13 5 6 7 1 2 3 4 14 15 15 14 1 2 3 14 15 5 6 7 4 13 16 15 14 13 16 Strong Intervals : transforming a rat into a mouse 1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7 1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7 16 15 14 13 8 9 10 11 12 5 6 7 16 15 14 13 8 9 10 11 12 5 6 7 8 9 10 11 12 8 9 10 11 12 8 9 10 11 12 11 Then all strong intervals that disagree with their parent are inverted : 11
13 14 15 16 16 15 14 13 5 6 7 1 2 3 4 15 14 14 15 8 9 10 11 12 12 11 10 9 8 1 2 3 15 14 8 9 10 12 5 6 7 4 13 16 14 15 13 16 11 12 11 10 8 9 Strong Intervals : transforming a rat into a mouse 1 2 3 4 16 15 14 13 8 9 10 11 12 5 6 7 1 2 3 4 16 15 14 1312 11 10 9 8 5 6 7 16 15 14 13 8 9 10 11 12 5 6 7 16 15 14 1312 11 10 9 8 5 6 7 Then all strong intervals that disagree with their parent are inverted : 8 9 10 11 12
13 14 15 16 16 15 14 13 5 6 7 1 2 3 4 14 15 15 14 12 11 10 9 8 1 2 3 14 15 5 6 7 4 16 13 15 14 16 13 12 11 10 8 9 7 6 5 7 6 5 Strong Intervals : transforming a rat into a mouse 1 2 3 4 16 15 14 1312 11 10 9 8 5 6 7 1 2 3 4 16 15 14 1312 11 10 9 87 6 5 16 15 14 1312 11 10 9 8 5 6 7 16 15 14 1312 11 10 9 8 7 6 5 Then all strong intervals that disagree with their parent are inverted : 5 6 7
5 6 7 8 9 10 11 12 13 14 15 16 16 15 14 1312 11 10 9 8 7 6 5 13 14 15 16 16 15 14 13 13 14 15 16 8 9 10 11 12 1 2 3 4 5 6 7 14 15 15 14 14 15 12 11 10 9 8 1 2 3 6 14 15 9 10 11 13 4 13 16 5 7 14 15 8 13 16 12 14 15 16 12 11 10 8 9 7 6 5 7 6 5 Strong Intervals : transforming a rat into a mouse 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 16 15 14 1312 11 10 9 87 6 5 Then all strong intervals that disagree with their parent are inverted : 5 6 7 ... 14 15 16