250 likes | 272 Views
An Implementation of The Teiresias Algorithm. Na Zhao Chengjun Zhan. Outline. Introduction of Pattern Discovery Basic Definitions Teiresias Algorithm & Implementation Scan Phase Convolution Phase Results & Conclusion Q & A. What is Pattern Discovery?. Pattern:
E N D
An Implementation of The Teiresias Algorithm Na Zhao Chengjun Zhan
Outline • Introduction of Pattern Discovery • Basic Definitions • Teiresias Algorithm & Implementation • Scan Phase • Convolution Phase • Results & Conclusion • Q & A
What is Pattern Discovery? • Pattern: • A region or portion of a protein sequence. It may has specific structure and functionally significant. • Protein family should have similar patterns so they can be characterized. • Pattern Discovery: • Detect patterns from known protein sequences. • The result can be used to detect unknown protein sequences.
Why Pattern Discovery Useful? • For some proteins that have similar biological properties on structural or functional features: • Group together these protein sequences • Discover a set of common sub-sequences • Study and observe these sub-sequences • The detected patterns may help to classify a protein
Basic Definitions • ∑ (Basic alphabet set): • The amino-acid with names can be abbreviated as the listed symbols in alphabetical order • ∑={A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y} • (Character/Equivalency group alphabets): • a character corresponding to a subset of Σ. • Ex: A-[CJ]-G • . (Wild-card or don’t care): • a special kind of ambiguous character that matches any character in ∑. • Ex: X in protein sequences and N in nucleotide
Basic Definitions (Cont.) • Pattern P is a (L, W) pattern iff: • P is a string of characters (Σ and wild cards ‘.’). • P starts and ends with a character from Σ. Characters in Σ are called residues. • Any sub pattern of P (i.e subsequence starting and ending with a character from Σ) containing exactly L non-wildcard characters (residues) has length of at most W. • Ex. L=3 and W=5, “CD..E” is a (3, 5) pattern.
TEIRESIAS Algorithm • Target: • Search for patterns consisting of characters of the alphabet Σ and wild cards ‘.’ • Basic Idea: • If a pattern P is a (L, W) pattern occurring in at least K sequences, then its sub patterns are also (L, W) patterns occurring in at least K sequences (K>=2). • Ex: pattern “A.BC” is more specific than “A..C” • Designed for non-aligned sequences
Two Phases • Scan Phase: • Scan the input sequences • Identify all the elementary patterns which satisfy the minimum support • Convolution Phase: • Sort the element patterns and extract useful prefixes and suffixes • Convolve the elementary patterns until find the maximal length and composition of the patterns
Scan Phase • Input: • Non-aligned sequences • Parameter: K, L, W • Output: a set of “Elementary Patterns” • Are (L, W) patterns • Occur in at least K sequences • Contain exactly L non-wildcards
Elementary Pattern Examples K=2, L=3, W=5 SDFBASTSLP LFCASTSCP FDASTSNP TGLPACTTCPTFCP AST T.LP
Scan Phase (Cont.) • Empty the stack of elementary patterns. (EP) • For each letter in the alphabet, count how many sequences contain this letter. • If less than K sequences contain this letter, ignore it. • Otherwise, extend it until it is ignored or it is accepted.
Convolution Phase (Overall) • There is a sub-phase named pre-convolution before the convolution phase. • The goal of the convolution phase is to extend a elementary pattern with other elementary patterns.
Pre-convolution Phase • Prefix of a pattern p is the unique substring containing the first L-1 residues of p. • Suffix of pattern p is the unique substring containing the last L-1 residues of p. • Example: for pattern a..c.e, prefix is a..c, suffix is c.e
Pre-convolution Phase (cont.) • Given two pattern p and q, p convolve q is empty if the suffix of p is not the prefix of q, • Otherwise, it is the pattern p followed by the part of q which remains when its prefix is removed.
Pre-convolution Phase (cont.) • Left vector of a pattern q contains all the patterns p that can be left-convolved with it. • Right vector of a pattern p contains all the patterns q that can be right-convolved with it. • Example: Left a.c.d: ea.c b.a.c right a.c.d: fc.d
Pre-convolution Phase (cont.) • Sort elementary patterns by prefix-wise < • Construct Left and Right vector for each elementary pattern
Prefix-wise < Examples • Example 0 abc 5 bc..f 1 bcd 6 a.cd 2 ab.d 7 b.de 3 cd.f 8 a.c.e 4 ab..e 9 a..de
Convolution Phase • For each pattern t of the sorted EPs, firstly it is extended to the left by convolving it with every member in Left[t]. If the result r doesn’t have sufficient support or is redundant, discard it. • If r has the same number of occurrences as t, t is non-maximal and is discarded.
Convolution Phase (cont.) • If result r has adequate support and is not redundant, r is placed on the stack and becomes the new top. • When the pattern at the top of the stack can not be extended to the left any more, extend it to the right. • When all the extensions have been done, the current top is popped from the stack. If it is maximal, it is put into the output set.
An Example Scenario • Suppose there are three sequences: SDFEASTS LFCASTS FDASTSNP • Our goal: to find a maximal pattern, given K = 2, L = 3, W = 5
An Example Scenario (cont.) • First, the scan phase is implemented • The elementary patterns (EP) found are: AST AS.S A.TS F.AS F.A.T F..ST STS • Then these EP are sorted, the result is: AST STS AS.S A.TS F.AS F.A.T F..ST
An Example Scenario (cont.) • Then, the Left and Right vectors are constructed: Left Right AST: F.AS AST: STS STS: AST, F..ST STS: AS.S: F.AS AS.S: A.TS: F.A.T A.TS: F.AS: F.AS: AST, AS.S F.A.T: F.A.T: A.TS F..ST: F..ST: STS
An Example Scenario (cont.) • Now the convolution phase is executed, and the result is as following: F.ASTS
About Our Implementation • The program is written using Visual C++ 6.0 • Command line arguments are provided to users for them to specify the input data file, the K, L, W value (which have default values (2, 3, 5) if users don’t specify) >Teiresias <Filename> [<K>] [<L>] [<W>]