500 likes | 613 Views
Evaluating the Significance of Max-gap Clusters. Rose Hoberman David Sankoff Dannie Durand. original genome. large scale duplication or speciation event. rearrangement, mutation. Gene content and order are preserved. Similarity in gene content. C. h. r. o. m. o. s. o. m. e. C.
E N D
Evaluating the Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand
original genome large scale duplication or speciation event rearrangement, mutation Gene content and order are preserved Similarity in gene content
C h r o m o s o m e C h r o m o s o m e 5 1 1 C r y b a 4 6 0 4 0 C r y b b 1 / 2 / 3 L h x 5 T c f 2 6 5 4 5 T c f 1 C r y b a 1 , N o s 2 L h x 1 T b x 3 / 5 T b x 2 / 4 5 0 7 0 N o s 1 7 5 5 5 Ruvinsky and Silver, Gene, 97 Local Spatial Evidence of Duplication
Gene Clustering for Functional Inference in Bacterial Genomes The Use of Gene Clusters to Infer Functional Coupling, Overbeek et al., PNAS 96: 2896-2901, 1999.
Clusters are Commonly Used in Genomic Analyses Need to assess cluster significance
Given: a genome:G = 1, …, n unique genes a set ofm special genes Can we find a significant cluster of (a subset of) the m homologs?
Given: a genome:G = 1, …, n unique genes a set ofm special genes Can we find a significant cluster of the red genes? How do we formally define a cluster?
size = 3 genes Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3
length = 6 Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red gene • Example: cluster length ≤ 6
length = 6 Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red gene • Example: cluster length ≤ 6
density = 6/11 Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red genes • Example: cluster length ≤ 6 • density: proportion of red genes (size/length) • Example: density ≥ 0.5
density = 6/11 Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red genes • Example: cluster length ≤ 6 • density: proportion of red genes (size/length) • Example: density ≥ 0.5
gap ≤ 4 genes Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red genes • Example: cluster length ≤ 6 • density: proportion of red genes (size/length) • compactness: maximum gap between adjacent red genes
Max-Gap Cluster gap g • Commonly used in genomic analyses • Expandable, ensures minimum local and global density • Efficient algorithm to find them (Bergeron, Corteel, Raffinot 2002)
A gap between statistical tests and cluster definitions used in practice • no analytical statistical model for max-gap clusters • statistical significance assessed through Monte Carlo randomization
A gap between statistical tests and cluster definitions used in practice
This Talk • Analytical statistical tests • reference region model with m genes of interest • complete and incomplete clusters • Relationship between cluster parameters and significance • Future extension to whole-genome comparison
Outline • Introduction • Complete Clusters • Incomplete Clusters • Genome Comparison • Open problems
Complete Clusters • Given • a genome: G = 1, …, n unique genes • a set of m special genes (the “red” genes) • a maximum-gap size g • Null hypothesis: Random gene order • Alternate hypotheses: Evolutionary history Functional selection • Test statistic • the probability of observing all m genes in a max-gap cluster in G ?
Probability of a Complete Cluster Count how many of the permutations ofmred genes andn-mblue genes contain a max-gap cluster • At how many positions in the genome can we locate a cluster (e.g. place the leftmost red gene)? • Given the location of the first red gene, how many ways are there to place the remaining m-1 red genes so that they form a max-gap cluster?
w = (m-1)g + m ways to choose m-1 gaps ways to place the first gene and still have w-1 slots left edge effects
Counting clusters at the end of the genome w-1 For clusters at the end of the genome: • Length of the cluster is constrained • Sum of the gaps is constrained:
How many ways can we form a max-gap cluster of size m with length≤ l ? g2 g3 gm-1 g1 l < w
How many ways can we form a max-gap cluster of size m with length≤ l ? g2 g3 gm-1 g1 l < w Number of ways of choosing m-1 gaps between 0 and g so sum ≤l-m = Number of ways of rolling m-1 dice with faces labeled 0 to g so faces total ≤ l-m
How many ways can we form a max-gap cluster of size m with length≤ l ? g2 g3 gm-1 g1 Number of ways of choosing m-1 gaps between 0 and g so sum ≤l-m = Number of ways of rolling m-1 dice with faces labeled 0 to g so faces total ≤ l-m
Final Probability Accounting for edge effects is the least efficient part of calculation • Eliminate for faster approximation when w << n Now we can calculate probabilities for various parameter values
Probability of Observing a Complete Cluster g m n = 500
Outline • Motivation • Complete Clusters • Incomplete Clusters • Genome Comparison • Open problems
Incomplete clusters Given: mgenes of interest • Is it significant to find a max-gap clusterof size h < m? • Test statistic: probability of finding at least one cluster m=5, g=1 h=3
Probability of incomplete gene clusters Enumerating clusters by starting position will lead to overcounting of permutations that include more than one cluster m = 6, h = 3, g = 1
Alternative Approach • Enumerate all permutations that do not contain any clusters of size h or larger • Dynamic programming • Iteratively place “red” or “blue” genes making sure not to create any cluster of size h or larger by judicious placement of “blue genes”
n = 6, m = 4, h = 3, g=1 k = 4 c = 0,j = 8 r = 6
n = 6, m = 4, h = 3, g=1 k = 4 c = 0,j = 8 r = 6 k = 4 c = 1,j = 0 c = 0,j = 8 k = 3 r = 5 r = 5
n = 6, m = 4, h = 3, g=1 k = 4 c = 0,j = 8 r = 6 k = 4 c = 1,j = 0 c = 0,j = 8 k = 3 r = 5 r = 5 k = 3 c = 1,j = 1 c = 1,j = 0 k = 2 r = 4 r = 4
n = 6, m = 4, h = 3, g=1 k = 4 c = 0,j = 8 r = 6 k = 4 c = 1,j = 0 c = 0,j = 8 k = 3 r = 5 r = 5 k = 3 c = 1,j = 1 c = 1,j = 0 k = 2 r = 4 r = 4 c = 2,j = 0 k = 2 r = 3 X
Incomplete cluster significance • h • g m ( ) n = 1000 m = 50 n = 500 significant region of parameter space shown in paper
Outline • Motivation • Complete Clusters • Incomplete Clusters • Extensions • Genome Comparison
Possible Extensions • Tandem duplications • Gene families • Extensions for prokaryotic genomes • Gene orientation • Circular genomes • Physical distance between genes • Whole genome comparison
Whole genome comparison g 30 g 30 • Assuming identical gene content … • What is the probability that at least k genes form a max-gap cluster • in both genomes?
An Odd Property The probability of finding a max-gap cluster of size at least k is always one Example: g =1
An Odd Property The probability of finding a max-gap cluster of size at least k is always one Example: g =1
An Odd Property The probability of finding a max-gap cluster of size at least k is always one Example: g =1 gap gap gap • A cluster of size kdoes not necessarily • contain a cluster of size k-1!
An Odd Property The probability of finding a max-gap cluster of size at least k is always one There will always be a cluster of size n Example: g =1
Conclusions Presented statistical tests for max-gap clusters • Evaluate the significance of clusters of a pre-specified set of genes • Choose parameters effectively • Understand trends