370 likes | 503 Views
Locating conserved genes in whole genome scale. Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS). Outline. Motivation Challenges of Whole Genome Alignment Four approaches and their performance
E N D
Locating conserved genesin whole genome scale Prudence Wong University of Liverpool June 2005 joint work with HL Chan, TW Lam, HF Ting, SM Yiu (HKU), WK Sung (NUS)
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks
Mouse & Human Mouse and human are genetically very similar Do they look like the same? What do we mean by similar? Many genes that can be found in human are also found in mouse as well – conserved genes Mouse Chromosome 16 Human Chromosome 16 m16 h03
Whole Genome Alignment Genome A Genome B Gene X Gene Y Gene Z Gene X Gene Z Gene Y Identify regions on the genomes that possibly contain their conserved genes. possibly a mutation Difference in ordering of conserved could be related to mutations. For related species, num. of mutations is usually small.
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks
Data size • Usually very large (e.g., human chromosomes vs mouse chromosomes) Cannot use global alignment tools because of the large size
Observations Gene X Gene Y Gene Y Gene X Noise • a conserved gene may not be identical in the two genomes, nevertheless, there are some common substrings unique to this conserved gene (called MUM) • locate all MUMs over the two genomes, yet not every MUM corresponds to conserved genes
Number of MUMs Size is smaller comparing with chromosome length
MUMs for M16-H03 Conserved genes Mouse Chromosome 16 Human Chromosome 03
How to choose the right MUMs? Generation of MUM using suffix tree
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks
MUM Selection • MUMmer-1[Delcher et al. Nucleic Acids Research 1999] • longest common subsequences (effectively assume no mutations) • MUMmer-2[Delcher et al. Nucleic Acids Research 2002] & MUMmer-3[Kurtz et al. Genome Biology 2004] • clustering heuristics • most popular tool to uncover conserved genes in WG scale • MaxMinCluster[Wong et al. Bioinformatics 2004*] • clustering, optimization • MSSMutation Sensitive Selection [Chan et al. Bioinformatics 2005*] • capture mutations • Hybrid approach [Chan et al. Bioinformatics 2005*] • combine mutation sensitive and clustering approaches * our results
Overview of Results • Average coverage (sensitivity) — in % • coverage: % of published conserved genes reported • sensitivity: % of MUMs reported that reside in published conserved genes
Overview of Results • Average coverage (sensitivity) — in % MSS outperforms MaxMinCluster and MUMmer-3 on closely related species • coverage: % of published conserved genes reported • sensitivity: % of MUMs reported that reside in published conserved genes
Overview of Results • Average coverage (sensitivity) — in % BUT MSS performs worse on species relatively farther apart • coverage: % of published conserved genes reported • sensitivity: % of MUMs reported that reside in published conserved genes
Overview of Results • Average coverage (sensitivity) — in % • coverage: % of published conserved genes reported • sensitivity: % of MUMs reported that reside in published conserved genes both hybrid approaches perform well for species farther apart
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks LCS Approach (MUMmer-1) does not take mutations into account • MUMmer-2 & -3 cluster by heuristic • MaxMinCluster formalizes clustering as a combinatorial optimization problem
Clustering approach • Observations • Noise MUMs are usually short and isolated • A conserved gene usually contains a sequence of MUMs that are close and have sufficient length => clusters Gene X Gene Y Gene Y Gene X Noise
Challenge • Challenge: some conserved genes do not induce clusters of sufficient length • Solution: relax the definition of clusters to allow the presence of noise
Noisy cluster • Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 1-noisy cluster
Noisy cluster • Suppose Gap=100, MinSize=40 > 100 apart length = 20 a 2-noisy cluster
MaxMinClustesr • Problem formulation • find a collection of k-noisy clusters such that the smallest cluster has the maximum weight • Dynamic programmingO(k2n2) time, O(k2n) space
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks Capture mutations more directly
Mutation Sensitive Selection • select subsets of MUMs transformed by a few mutations subset of MUMs • three types of mutations:reversal, transposition, reversed-transposition
k-mutated subsequences • Given two sequences A & B and an integer k, • a pair of subsequence X of A & subsequence Y of B is called a pair of k-mutated subsequences ifX can be transformed to Y by at most k mutations a pair of 2-mutated subsequences reversal transposition MUMs are signed; reversal reverts sign of MUMs
Mutation Sensitive Selection • Problem formulation: • To find a pair of k-mutated subsequences with maximum weight • We believe that the problem is NP-hard • The Genome Rearrangement Problem, believed to be NP-hard, can be reduced to this problem • We give an efficient approximation algorithm • the resulting weight is close to (at least 1/(3k+1) times) the maximum possible weight O(n2logn + kn2) time, O(n2) space
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks
Hybrid Approach • first apply clustering approach to identify clusters which are obviously conserved genes • can apply either MUMmer-3 or MaxMinCluster • these clusters are treated as MUM with bigger weight • then apply MSS to process these MUM together with the remaining MUM
Outline • Motivation • Challenges of Whole Genome Alignment • Four approaches and their performance • Longest Common Subsequence • Clustering Approach • Mutation Sensitive Selection • Hybrid Approach • Remarks
Remarks • Experiments show that • MaxMinCluster > LCS • MMS > MaxMinCluster for closely related species • MMS does not perform well for species relatively farther apart • Hybrid approach is the best for both closely related and farther apart species
Thank you! Q & A
Approximation Algorithm • Super-Backbone • maximum weight common subsequences • Identify k mutation blocks • having high weight • do not overlap with Super-Backbone too much • this is formulated as a sub-problem and solved optimally by dynamic programming • Report Super-Backbone & k mutation blocks O(n2logn + kn2) time, O(n2) space
Mutations reversal transposition reversed-transposition • three types of mutations:reversal, transposition, reversed-transposition a b c d e f g h i j k l m n o p q r s t u v w x y z a d c b e f g h i j k l m n o p q r s t u v w x y z a d c b e k l m n o p q r s t u v w x y f g h i j z a d c b e k l t s r q p o m n u v w x y f g h i j z