1 / 50

Evaluating the Significance of Max-gap Clusters

Evaluating the Significance of Max-gap Clusters. Rose Hoberman David Sankoff Dannie Durand. original genome. large scale duplication or speciation event. rearrangement, mutation. Gene content and order are preserved. Similarity in gene content. C. h. r. o. m. o. s. o. m. e. C.

marc
Download Presentation

Evaluating the Significance of Max-gap Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating the Significance of Max-gap Clusters Rose Hoberman David Sankoff Dannie Durand

  2. original genome large scale duplication or speciation event rearrangement, mutation Gene content and order are preserved Similarity in gene content

  3. C h r o m o s o m e C h r o m o s o m e 5 1 1 C r y b a 4 6 0 4 0 C r y b b 1 / 2 / 3 L h x 5 T c f 2 6 5 4 5 T c f 1 C r y b a 1 , N o s 2 L h x 1 T b x 3 / 5 T b x 2 / 4 5 0 7 0 N o s 1 7 5 5 5 Ruvinsky and Silver, Gene, 97 Local Spatial Evidence of Duplication

  4. Gene Clustering for Functional Inference in Bacterial Genomes The Use of Gene Clusters to Infer Functional Coupling, Overbeek et al., PNAS 96: 2896-2901, 1999.

  5. Clusters are Commonly Used in Genomic Analyses

  6. Clusters are Commonly Used in Genomic Analyses Need to assess cluster significance

  7. Given: a genome:G = 1, …, n unique genes a set ofm special genes Can we find a significant cluster of (a subset of) the m homologs?

  8. Given: a genome:G = 1, …, n unique genes a set ofm special genes Can we find a significant cluster of the red genes? How do we formally define a cluster?

  9. size = 3 genes Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3

  10. length = 6 Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red gene • Example: cluster length ≤ 6

  11. length = 6 Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red gene • Example: cluster length ≤ 6

  12. density = 6/11 Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red genes • Example: cluster length ≤ 6 • density: proportion of red genes (size/length) • Example: density ≥ 0.5

  13. density = 6/11 Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red genes • Example: cluster length ≤ 6 • density: proportion of red genes (size/length) • Example: density ≥ 0.5

  14. gap ≤ 4 genes Possible Cluster Parameters • size: number of red genes in the cluster • Example: cluster size ≥ 3 • length: number of genes between first and last red genes • Example: cluster length ≤ 6 • density: proportion of red genes (size/length) • compactness: maximum gap between adjacent red genes

  15. Max-Gap Cluster gap g • Commonly used in genomic analyses • Expandable, ensures minimum local and global density • Efficient algorithm to find them (Bergeron, Corteel, Raffinot 2002)

  16. Max-Gap Clusters are Commonly Used in Genomic Analyses

  17. Statistical models

  18. A gap between statistical tests and cluster definitions used in practice • no analytical statistical model for max-gap clusters • statistical significance assessed through Monte Carlo randomization

  19. A gap between statistical tests and cluster definitions used in practice

  20. This Talk • Analytical statistical tests • reference region model with m genes of interest • complete and incomplete clusters • Relationship between cluster parameters and significance • Future extension to whole-genome comparison

  21. Outline • Introduction • Complete Clusters • Incomplete Clusters • Genome Comparison • Open problems

  22. Complete Clusters • Given • a genome: G = 1, …, n unique genes • a set of m special genes (the “red” genes) • a maximum-gap size g • Null hypothesis: Random gene order • Alternate hypotheses: Evolutionary history Functional selection • Test statistic • the probability of observing all m genes in a max-gap cluster in G ?

  23. Probability of a Complete Cluster Count how many of the permutations ofmred genes andn-mblue genes contain a max-gap cluster • At how many positions in the genome can we locate a cluster (e.g. place the leftmost red gene)? • Given the location of the first red gene, how many ways are there to place the remaining m-1 red genes so that they form a max-gap cluster?

  24. w = (m-1)g + m ways to choose m-1 gaps ways to place the first gene and still have w-1 slots left edge effects

  25. Counting clusters at the end of the genome w-1 For clusters at the end of the genome: • Length of the cluster is constrained • Sum of the gaps is constrained:

  26. How many ways can we form a max-gap cluster of size m with length≤ l ? g2 g3 gm-1 g1 l < w

  27. How many ways can we form a max-gap cluster of size m with length≤ l ? g2 g3 gm-1 g1 l < w Number of ways of choosing m-1 gaps between 0 and g so sum ≤l-m = Number of ways of rolling m-1 dice with faces labeled 0 to g so faces total ≤ l-m

  28. How many ways can we form a max-gap cluster of size m with length≤ l ? g2 g3 gm-1 g1 Number of ways of choosing m-1 gaps between 0 and g so sum ≤l-m = Number of ways of rolling m-1 dice with faces labeled 0 to g so faces total ≤ l-m

  29. Final Probability Accounting for edge effects is the least efficient part of calculation • Eliminate for faster approximation when w << n Now we can calculate probabilities for various parameter values

  30. Probability of Observing a Complete Cluster g m n = 500

  31. Significant Parameter Values(α = 0.0001) n = 500

  32. Significant Parameter Values(α = 0.0001) n = 500

  33. Outline • Motivation • Complete Clusters • Incomplete Clusters • Genome Comparison • Open problems

  34. Incomplete clusters Given: mgenes of interest • Is it significant to find a max-gap clusterof size h < m? • Test statistic: probability of finding at least one cluster m=5, g=1 h=3

  35. Probability of incomplete gene clusters Enumerating clusters by starting position will lead to overcounting of permutations that include more than one cluster m = 6, h = 3, g = 1

  36. Alternative Approach • Enumerate all permutations that do not contain any clusters of size h or larger • Dynamic programming • Iteratively place “red” or “blue” genes making sure not to create any cluster of size h or larger by judicious placement of “blue genes”

  37. n = 6, m = 4, h = 3, g=1 k = 4 c = 0,j = 8 r = 6

  38. n = 6, m = 4, h = 3, g=1 k = 4 c = 0,j = 8 r = 6 k = 4 c = 1,j = 0 c = 0,j = 8 k = 3 r = 5 r = 5

  39. n = 6, m = 4, h = 3, g=1 k = 4 c = 0,j = 8 r = 6 k = 4 c = 1,j = 0 c = 0,j = 8 k = 3 r = 5 r = 5 k = 3 c = 1,j = 1 c = 1,j = 0 k = 2 r = 4 r = 4

  40. n = 6, m = 4, h = 3, g=1 k = 4 c = 0,j = 8 r = 6 k = 4 c = 1,j = 0 c = 0,j = 8 k = 3 r = 5 r = 5 k = 3 c = 1,j = 1 c = 1,j = 0 k = 2 r = 4 r = 4 c = 2,j = 0 k = 2 r = 3 X

  41. Incomplete cluster significance • h • g m ( ) n = 1000 m = 50 n = 500 significant region of parameter space shown in paper

  42. Outline • Motivation • Complete Clusters • Incomplete Clusters • Extensions • Genome Comparison

  43. Possible Extensions • Tandem duplications • Gene families • Extensions for prokaryotic genomes • Gene orientation • Circular genomes • Physical distance between genes • Whole genome comparison

  44. Whole genome comparison g 30 g 30 • Assuming identical gene content … • What is the probability that at least k genes form a max-gap cluster • in both genomes?

  45. An Odd Property The probability of finding a max-gap cluster of size at least k is always one Example: g =1

  46. An Odd Property The probability of finding a max-gap cluster of size at least k is always one Example: g =1

  47. An Odd Property The probability of finding a max-gap cluster of size at least k is always one Example: g =1 gap gap gap • A cluster of size kdoes not necessarily • contain a cluster of size k-1!

  48. An Odd Property The probability of finding a max-gap cluster of size at least k is always one There will always be a cluster of size n Example: g =1

  49. Conclusions Presented statistical tests for max-gap clusters • Evaluate the significance of clusters of a pre-specified set of genes • Choose parameters effectively • Understand trends

  50. Thank You

More Related