510 likes | 590 Views
Detection of Spaced Motifs using Submotif Pattern Mining. S.M. Yiu Department of Computer Science The University of Hong Kong. Joint work with Ken Sung (Genome Institute of Singapore and NUS) Edward Wijaya, Rajaraman Kanagasabai (Institute for Infocomm Research). Outline.
E N D
Detection of Spaced Motifs using Submotif Pattern Mining S.M. Yiu Department of Computer Science The University of Hong Kong Joint work with Ken Sung (Genome Institute of Singapore and NUS) Edward Wijaya, Rajaraman Kanagasabai (Institute for Infocomm Research)
Outline • Introduction to Bioinformatics Research & Motif Finding Problem • Spaced Motifs and Our contributions • Some Technical Background – String representation of Motifs • The Model for Spaced Motif • Gaps in Motifs • Submotif notion • SPACE • Algorithm for Spaced Motifs • Experimental Results & Conclusions • Other Related Projects
Bioinformatics Research Goal: To help bio-medical research community to discover biological meaningful knowledge How CS helps: By providing computational efficient tools for analysis of huge amount of data and/or solving computational intensive problems. May make use of techniques in Algorithms, Database, AI etc.
Typical Process of Bioinformatics Research An example siRNA can be used to de-activate genes causing cancers [How to select good candidates is not easy!] Biological Scenario Math/computational model Formulate criteria for forming siRNA molecules Develop efficient & effective algorithms(solution) Develop a computational tool for selecting siRNA candidates Verify & Test the solution using real data
Typical Process of Bioinformatics Research • Two issues: • Difficult to derive a correct model [even the bio-medical researchers may not know the exact criteria]; • Complexity of the problem and huge amount of data Biological Scenario Math/computational model Develop efficient & effective algorithms(solution) Verify & Test the solution using real data
The motif finding problem - a fundamental and critical problem in bio-medical research • To activate a gene, there is a corresponding short sequence (called binding site) in our DNA needs to be first interacted (binded) by a particular transcription factor. activated to produce a protein Transcription Factor A …..AGCTAAACCACGTGGCATGGGACGTATGCCCAGTA….. Gene A Binding site DNA sequence is considered as a sequence of {A, C, G, T}
AGCTAAACCACGTGGCATGG AGCCACGCGCGTGGCATGG AGCTAAACCACGTGGCATGTGC ACCCGTGCCACGTGGCATGG • The binding sites that can interact with the same transcription factor are similar in pattern (but usually do not have the same pattern). The general pattern of a set of related binding sites is referred as themotif. Given:A set of sequences that possibly contain the binding sites of the same transcription factor. What we know: the binding sites should occur abundantly (unlikely to occur by random) inside these sequences. The Motif Finding Problem:to identify a set of similar patterns that occur a lot in these sequences and predict the motif. …GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAAT… …GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG… …CCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTG… …GTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAG… ………………………………………………………………………………………………………………………………………
Importance of this motif finding problem • Solving this problem provides critical information for the study of genetic diseases and drug design. • Motif finding becomes a daily task for bio-medical researchers. Within the last few years, more than 30 new software tools were developed and more than 100 research papers were published!
Complexity of the motif finding problem: some versions have been proved to be NP-hard. • Practical concern: Dataset with noise • ……… Issues • Representations of Motif • Different representations can model different motifs (binding sites) and have different descriptive power. • Scoring function • To formalize the concept of “occurred abundantly” (not occur by random).
Traditional motif models “Assume that a motif is a contiguous substring in the sequence” Still a lot of known binding sites not being captured by these models Additional Information: e.g. Negative Set – sequences known to be without the binding sites
The Spaced Motif • Our contributions: • - Traditional model assumes no gaps in motif or allows at most one gap. We derive a new model to allow multiple gaps with different lengths in motifs (spaced motifs) • Formulate the spaced motif finding problem as a frequent submotif mining problem (database technique) • We developed an effective algorithm (SPACE) to solve the problem. • Experimental results verified that the new model represents real motifs better than existing models.
TTGACA TCGACA TTGACA TTGAAA ATGACA TTGACA GTGACA TTGACT TTGACC TTGACA Motif: TTGACA Binding sites: with at most 1 mismatch This is a (6, 1)-mismatch motif! Two common representations for motif: 1. String Representation A motif is modeled as a length-L string - two variations: mismatch model and wildcard model A (L, d)-mismatch motif M is a length-L string such that a Length-L string with at most d mismatches from M is considered as a binding site. Note: the mismatch can occur at any position!
TTGACA TCGATA TTGATA TGGAAA TTGAGA TCGACA TGGAGA TTGACA TAGATA TCGAGA In this case, the binding sites should be better modeled by a (6, 2)-wildcard motif: T*GA*A 1. String Representation A motif is modeled as a length-L string - two variations: mismatch model and wildcard model A (L, d)-wildcard motif M is a length-L string such that a binding site is one with at most d wildcards w.r.t. M. 2. Positional Weight Matrix (PWM) Representation
Gaps in motif Examples of motifs that are not contiguous substrings in the sequence!! e.g. ARCA-P (Liu and DeWulf, 2004) The motif is known to be “GTTAAn n n n n nGTTAA” HAP1 (Scherling and Holmberg, 1996) The motif is known to be “CGGn n nTAnCGGn n nTA” Gaps (spacers) Issues: We don’t know the number of gaps andthe length of each gap! The search space may be huge if we try all combinations.
Previous work? • Most previous work usually does not allow gaps in a motif (called monads). • Some algorithms (like MITRA) allow gaps. But they only allow one gap. In this work, we attempt to develop a better approach to locate motifs with arbitrary number of gaps of possibly different lengths.
Our Proposed Motif Model • We define a (spaced) motif M as a length-L string with characters {A, C, G, T, n} with at least c x L non-n characters (c is called the coverage). • E.g. L=15, c=l/2, M=ACGCnnGCGTnCTCA Assumption: total gap length is relatively small w.r.t. L. Each maximal substring of {A, C, G, T} characters represents a segment. Each maximal substring of consecutive n represents a gap/spacer.
Notion of submotif [to capture similarity!]: We fix a parameter ls, any length-ls substring within any segment of M is called a submotif. e.g. if ls = 3, ACG; CGC; GCG; CGT; CTC; TCA are submotifs of M. • E.g. L=15, c=l/2, M=ACGCnnGCGTnCTCA Note: we allow any number of segments (and gaps), and segments (as well as gaps) can have different lengths.
Instance (binding site) of a motif A length-L string I in a given sequence is called an instance (a binding site) of M if I contains all submotifs with at most d mismatch ((ls,d)-substrings) for each submotif. • e.g. Assume (ls, d) = (3,1) • M=ACGCnnGCGTnCTCA • I=ATGCcgGCATtCTTA • I’=ATGCcgCCATtCTTA I is an instance; I’ is not! The problem: Given a set S of t sequences, find all (spaced) motifs M such that M has at least q instances.
Example of Motif • If q = 4, M=ACGCnnGCGTnCTCA is a spaced motif as M has 4 instances in S TTCAACGCACGCGTTCTCAGCTCAGCTGAACGTCGGTCACT CGACATTCCACGGCTATGCCGGCATACGCACCGTACGCTCC CGGCAATTACTCGTCAGTTCTACGTGCGCGACCCCATCCCA ACTGCGCATGAACACTCCTGAGTGATCACTAAAGTTCGGTG
Idea of the SPACE Algorithm • Input sequences S: TTGATACCGAAGATACCGATTAGAAATCACTCA ACTACAGAAAAGCAGTAGTAAAACTGTACAGTC GAAGACCGTCATGAGAAATCGCATACACGAGCA TTCACCCGATAAAAATAAGGCTGTCTGGACTAA TCGGAACAATTACGAAGAAAAGCAGTAGAAAAA
Finding motif candidates Consider any length-L substring in S. e.g. L = 20 S1: TTGATACCGAAGATACCGATTAGAAATCACTCA “GAAGATACCGATTAGAAATC” starting at pos 9 -GAAGATACCGATTAGAAATC -GAAAA TAAAA -GAAAAGCAGTAGTAAAACTG Then, for each other length-L substring in S, check if the sharing (ls, d)-substrings cover at least c x L characters. e.g. L = 20, ls = 5, d = 1, c = ½ (i.e. c x L = 10). S2: ACTACAGAAAAGCAGTAGTAAAACTGTACAGTC
More examples: • S1:TTGATACCGAAGATACCGATTAGAAATCACTCA • S2:ACTACAGAAAAGCAGTAGTAAAACTGTACAGTC • S3:GAAGACCGTCATGAGAAATCGCATACACGAGCA • S4:TTCACCCGATAAAAATAAGGCTGTCTGGACTAA • S5:TCGGAACAATTACGAAGAAAAGCAGTAGAAAAA Note that the set of shared (ls, d) substrings may not be the same for every case.
Extracting sharing (ls, d)-substrings • GAAGATACCGATTAGAAATC • GAAAA TAAAA • GAAAAGCAGTAGTAAAACTG(1,13) • GAAGA ATGAGAAATC • GAAGACCGTCATGAGAAATC(1,11,12,13,14,15,16) • CGATAA ATAAG • CCCGATAAAAATAAGGCTGT(3,4,12) • GAAGAC TAGAAA • GAAGACCAGCAGTAGAAAAA(1,2,13,14)
Mining the frequent pattern • Find the patterns which occur at least q times. • Example (q=3): • (1, 13) • (1, 11, 12, 13, 14, 15, 16) • (3, 4, 12) • (1, 2, 13, 14) • The pattern is (1, 13) • If the patterns can give a coverage of > c x L, then we found a spaced motif. • Hence, GAAGAnnnnnnnTAGAA • This step is done using a well-known data mining technique.
Generate the results and compute the significant score • M = GAAGAnnnnnnnTAGAA TTGATACCGAAGATACCGATTAGAAATCACTCA ACTACAGAAAAGCAGTAGTAAAACTGTACAGTC GAAGACCGTCATGAGAAATCGCATACACGAGCA TTCACCCGATAAAAATAAGGCTGTCTGGACTAA TCGGAACAATTACGAAGAAAAGCAGTAGAAAAA • The motif M is scored based on how unlikely this pattern can occur in random. [Details omitted!]
Summary of the algorithm • Finding motif candidates • For every length-L substring P of S • Find all motif instances of P in S • Extract the sharing (ls,d)-substrings in all motif instances • Mine patterns occuring more than q times • Generate the motif and its significant score • Sort the motifs based significant score • Report the motifs
Some technical details • A straight-forward implementation of the previous algorithm is slow. • The bottleneck is on the step of finding motif candidates (it takes O(Ln2) time, where n is the total length of the sequences).
Idea: Window shifting • Suppose we know coverage between P and I • P=GAAGATACCGATTAGAAATC • GAAGA ATGAGAAATC (1, 11, 12, 13, 14, 15, 16) • I=GAAGACCGTCATGAGAAATC • Coverage = 15 • We can find the coverage between P’ and I’ efficiently • P’=AAGATACCGATTAGAAATCG • ATGAGAAATCT (10, 11, 12, 13, 14, 15, 16) • I’=AAGACCGTCATGAGAAATCT • Coverage = 11 First length-ls substring of P P Last length-ls substring of P’ P’ Time complexity reduced to O(n2).
Idea: Pruning on Coverage Let Sa = S[a..a+L-1] be the motif candidate. Tb = T[b..b+L-1] is the substring being considered. Let C be the coverage of Tb on Sa. We can get the upper bound for the coverage of Sa+p and Tb+p easily for any p > 0. • Suppose we know coverage between P and I • Sa=CGAAGATACCGTTAGAAATC • GAAGA (2) • Tb=TGAAGACCGTCTGACCGATC • Coverage = 5 • If we move both substrings to the right by one character • Sa+1=GAAGATACCGTTAGAAATCG • GAAGA AATCG (2, 16) • Tb+1=GAAGACCGTCTGACCGATCG • Coverage = 10 The coverage of Sa+p and Tb+p is upper bounded by C + (ls-1) + p
Experimental Results Compare the effectiveness of our tool with 13 other existing software tools. • Testing Data: • Motif Assessment Benchmark Datasets (Martin Tompa, 2005): 56 datasets constructed from 4 different species (fly, human, mouse, yeast). • 9 datasets extracted from the literature with some identified spaced motifs. Evaluation Measures: Sensitivity (Sn): % of known binding sites identified by the tool. Specificity (PPV): % of predicted binding sites that match with the known binding sites.
Benchmark Dataset – Comparison on Four Organisms with the best performed software Weeder AnnSPEC Weeder Improbizer + YMF AnnSPEC Improbizer + YMF SeSiMCMC SeSiMCMC
Experimental Results – Spaced motif real datasets SPACE also performs better (in terms of both measures) in all cases. An example ARCA-P Literature GTTAAnnnnnnGTTAA SPACE GTTAnnnnnATGTTA MITRA GTTAACT
Conclusions • SPACE is found to be effective in locating spaced motifs (as well as motifs without gaps). • After the paper was published, quite a number of biologists emailed us for the tool. Some major genome laboratories, such as The Computational Biology Research Center (CBRC) of Japan, are considering to list (& include a link to our web-based tool) our tool in their collection. • We are extending our work to handle motifs for which the same gap may have different lengths across different instances.
Some Related Projects Traditional motif models “Assume that a motif is a contiguous substring in the sequence” Spaced Motifs Motifs with Character Dependence Additional Information: e.g. Negative Set – sequences known to be without the binding sites
Motifs with Nucleotide Dependence • Joint work with Prof. Chin & Henry Leung • A new model to capture the dependency of nucleotides (characters) in a motif The SPSP (Scored Position specific Pattern) Model e.g The 5th, 6th, 7th characters are dependent, i.e., if 5th character is “T”, then the 6th and 7th characters must be “GA”.
Some on-going Bioinformatics Projects Traditional motif models “Assume that a motif is a contiguous substring in the sequence” Spaced Motifs Ensemble Approaches for Combining Results from different Motif Finders Motifs with Nucleotide Dependence Additional Information: e.g. Negative Set – sequences known to be without the binding sites
Protein Motif Pairs Motif Modules Traditional motif models “Assume that a motif is a contiguous substring in the sequence” Spaced Motifs Single Motif Ensemble Approaches for Combining Results from different Motif Finders Additional Information: e.g. Negative Set – sequences known to be without the binding sites Motifs with Nucleotide Dependence <Thank you!>
Positional Weight Matrix (PWM) TTGACA TCGACA TTGACA TTGAAA ATGACA TTGACA GTGACA TTGACT TTGACC TTGACA 2. Positional Weight Matrix (PWM) Representation • A motif is modeled as a 4 x Lmatrix. • The binding site is of length L. • The 4 rows are labeled by “A”, “C”, “G”, “T”. • The jth column provides the probability of the occurrence of each nucleotide at position j of the binding site. Remark: Solution space is infinite, only suboptimal answers will be produced.
SPSP model • Remark: A length-11 string is considered as a • binding site if • it matches with P and • its score (sum of corresponding entries) is at most 3.1. For example, the score of “CGGATGAATGG” is -log(1)+ -log(0.6) + -log(1) + -log(0.8) + -log(1) = 1.05 < 3.1. The score of “CGGACGGAAGG” is -log(1)+ -log(0.4) + -log(1) + -log(0.2) + -log(1) = 3.6 > 3.1. The string “TGGATGAATGG” does not match with P.
Significance Test & Scoring • Intuitively, a motif is significant if • The number of occurrences is a lot more than expected • The pattern is either very conserved or occurs in quite a number of input sequences. • Let Occ_s(M, e) be total no. of observed occurrences of M with at most e mutations. • Let E(M, e) be expected frequency of M with at most e mutations based on a set of background sequences • Occurrence score: • X(M) = log [(Occ_s(M, e)/E(M, e) x total length of seq]
(2) For a sequence s with an occurrence of M, consider the most conserved instance (let e’ be the no. of mutations). Then, E(M, e’) x Len(s) is the expected freq. of this motif in s. Note that this value is small is the motif is very conserved. Sequence-specific score: Y(M) = Sum [log (1/E(M, e’) x Len(s))] over all sequence s. Overall score = x(M) + y(M).
GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAAT GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG CCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTG GTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAG CAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGC GTCTCTTGACCGCTTAATCCTAAAGGGAGCTATTAGTATCCGCACATGT GAACAGGAGCGCGAGAAAGCAATTGAAGCGAAGTTGACACCTAATAACT Given a set of (regulatory) sequences that possibly bind to the same transcription factor, the problem is to search for the common binding site pattern (motif) that is over-represented in the sequences. e.g. TT occurs 17 times, but length of each seq. = 49, total length ~ 350, expected no. of occurrences ~ 20. Unlikely to occur by random!
GCACGCGGTATCGTTAGCTTGACAATGAAGACCCCCGCTCGACAGGAAT GCATACTTTGACACTGACTTCGCTTCTTTAATGTTTAATGAAACATGCG CCCTCTGGAAATTAGTGCGGTCTCACAACCCGAGGAATGACCAAAGTTG GTATTGAAAGTAAGTAGATGGTGATCCCCATGACACCAAAGATGCTAAG CAACGCTCAGGCAACGTTGACAGGTGACACGTTGACTGCGGCCTCCTGC GTCTCTTGACCGCTTAATCCTAAAGGGAGCTATTAGTATCCGCACATGT GAACAGGAGCGCGAGAAAGCAATTGAAGCGAAGTTGACACCTAATAACT TTGACA TCGACA TTGACA TTGAAA ATGACA TTGACA GTGACA TTGACT TTGACC TTGACA Motif: TTGACA Binding sites: with at most 1 mismatch It occurs 10 times, but expected number of occurrences is only: (1/4)6 x 6 x 3 x 350 ~ 1.5
Synthetic Dataset • Motif AGTTGTC with no spacer • Motif CCTGTnnnAGTTGTC containing 2 segments with spacers. • Motif ATCGTnnnTGACCnnnCTTTC containing 3 segments of length 5 with spacers. • Motif CGGCnnnnnnTCTAA containing 2 segments with a spacer. Note: for each segment, the instances may have one mismatch.
Datasets • 9 Real datasets with known spaced motifs. • Motif Assessment Benchmark Datasets (Tompa, 2005). • Consists of 56 datasets from 4 different species (fly, human, mouse, yeast). • Each dataset contain 1 to 35 sequences. • Sequence length up to 3K bp. • Six synthetic datasets containing implanted motif with variations in: • spacer length. • motif parts (segments) and • segments length.
Experimental Results Parameterization • Perform intersections with 12 runs of these parameter combinations: • L = 8,15,20 (max motif length) • c = 0.5, 0.8(coverage) • q = t, 0.5t (min support) • ls = 5 (substring length) • d = 1 (mismatches)
Benchmark Dataset – Comparison on Four Organisms with the best performed software Weeder AnnSPEC Weeder Improbizer + YMF AnnSPEC Improbizer + YMF SeSiMCMC SeSiMCMC