590 likes | 675 Views
Motif Extraction and Grammar Induction in Language and Biology. David Horn Tel Aviv University http://horn.tau.ac.il. Analogy between languages. Human languages are realized in streams of speech or in lines of text All computational problems can be realized by a Turing machine
E N D
Motif Extraction and Grammar Induction in Language and Biology David Horn Tel Aviv University http://horn.tau.ac.il
Analogy between languages • Human languages are realized in streams of speech or in lines of text • All computational problems can be realized by a Turing machine • Hereditary biology makes use of chains of nucleotides in DNA and RNA and chains of amino-acids in proteins All are one-dimensional representations of reality
Reality, however, is not one dimensional hence one needs a set of syntactic rules, to make sense of the text, and a semantic mapping into actions in the real world. The distinction between syntactic and semantic levels (Chomsky 1957) is intuitively clear in human languages. In biology syntax refers to the structure of the sequences whereas semantics relates the different sequence elements to the complicated process involving transcription, the birth of mRNA, to translation, the birth of the protein. Examples of syntax (semantics): segmentation of the chromosome to genes (originators of proteins), promoters (transcription regulation) and 3’ UTRs (translation regulation); segmentation of genes to exons (coding) and introns (noncoding); finding motifs on promoters (TFBS relevant to enabling or forbidding transcription); finding motifs on proteins (relevant for protein interaction and protein functionality) etc.
Can we induce the rules (and, sometimes, the words) from the texts? ADIOS is an algorithm that induces syntax, or grammar, from the text. MEX is an algorithm that extracts motifs, or patterns, from the text. Zach Solan, David Horn, Eytan Ruppin and Shimon Edelman. Unsupervised learning of natural languages. Proc. Natl. Acad. Sci. USA, 102 (2005) 11629-11634.
MEX: motif extraction algorithm • Create a graph whose vertices are words (for text) or letters (for biological sequences) • Load all strings of text onto the graph as paths over the vertices • Given the loaded graph consider trial-paths that may coincide with original strings of text • Use context sensitive statistics to define left- and right-moving probabilities that help to define motifs
Find patterns in strings of letters Motifs EXtraction (MEX) Given a set of strings a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f h a v i n g n o t h i n g t o d o o n c e o r t w i c e s h e h a d p e e p e d i n t o t h e b o o k h e r s i s t e r w a s r e a d i n g b u t i t h a d n o p i c t u r e s o r c o n v e r s a t i o n s i n i t a n d w h a t i s t h e u s e o f a b o o k t h o u g h t a l i c e w i t h o u t p i c t u r e s o r c o n v e r s a t i o n alicewas beginning toget verytiredof sitting by hersister onthebank and of having nothing todo onceortwice shehad peep ed intothe book hersister was reading butit hadno pictures or conversation s init and what is theuseof abook thoughtalice without pictures or conversation
(2,2) (2,1) j s e h d v a c g f u t b z q p o n r i k m l w x y (2,3) (2,4) begin end Creating the graph (directed)… • ∑ = {a-z} (1,1) • alice was (1,6) (1,5) (1,2) (1,4) (1,3)
{1003;12} a {1003;11} b p c {1003;10} o {1003;13} {1003;4} d n {1003;5} structured graph m e {1003;3} {1003;14} {1003;6} {1003;9} l f {1003;7} k g {1002;2} {1002;1} {1003;8} h j i Creating the graph… (Cont’)
a b p c o d n random graph m e l f k g h j i Creating the graph… (Cont’)
(1) a l i c e w a s b e g i n n i n g t o g e t v e r y t i r e d o f s i t t i n g b y h e r s i s t e r o n t h e b a n k a n d o f c o n v e r s a t i o n a i i c c e e n n l l (2) begin begin end end w h e n a l i c e h a d b e e n a l l t h e w a y d o w n o n e s i d e a n d u p t h e o t h e r t r y i n g e v e r y d o o r s h e w a l k e d s a d l y w d h a Creating the graph - cont’d
{1003;12} 1 {1003;11} 2 16 3 {1003;10} 15 {1003;13} {1003;4} 4 14 {1003;5} structured graph 13 5 {1003;3} {1003;14} {1003;6} {1003;9} 12 6 {1003;7} 11 7 {1002;2} {1002;1} {1003;8} 8 10 9 Searching for patterns
search path 5 4 path 4 5 9 1 1 2 6 7 8 3 7 3 2 6 8 Searching for patterns (Cont’)
L matrix (numbers of paths) l(ei;ej) = number of occurrences of sub-path (ei,ej) Where(ei,ej)is: ei→ ei+1 → ei+2 → …→ ej-1 → ej Calculate conditional probabilities
P(a) = 0.08 P(l|a) = 1046/8770 P(i|al) = 486/1046 P(c|ali) = 397/486 P(e|alic) = 397/397 P(w|alice) = 48/397 From L to P Calculating conditional probabilities
if if if P_R P_L Probability Matrix P_R = Right moving probability (end of pattern) P_L = Left moving probability (beginning of pattern)
Significance test • Pn = (e1,en) : a potential pattern edge • m = # of paths from e1 to en r = # of paths from e1 to en+1 • ≡ (Pn is a pattern edge) • Assume P*n+1 = the “true” Pn+1, given (e1,en) • H0: P*n+1 ≥ Pn·η(does not diverge) H1: P*n+1 < Pn·η • odds to receive results at least as “extreme” as r and m are: If the outcome is less than a predetermined α The pattern is significant
Rewiring the graph • Once the algorithm has reached the stop criteria (e.g. ceases to locate new patterns), for each significant patterns, the sub-paths it subsumed are merged into a new vertex • The graph is rewired in a length–significance descending order
ALICE motifs Motifs selected in order of -length -weight (significance of drop) Shown here are results of one run over a trial-path and the beginning of the list of motifs extracted from it
Application to Biology Vertices of the graph: 4 or 20 letters Paths: gene-sequences, promoter-sequences, protein-sequences. Conditional probabilities on the graph are proportional to the number of paths Trial-path: testing transition probabilities to extract motifs
Extracting Motifs from Enzymes • Each enzyme sequence corresponds to a single path • Applying MEX to oxidoreductases • 6602 enzyme sequences • MEX motifs are specific subsequences >P54233 | 1.7.1.1 LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES V. Kunik, Z. Solan, S. Edelman, E.Ruppin and D. Horn. CSB2005
Enzyme Motifs • 3165 motifs were obtained • Distribution of MEX motifs Number of motifs Length of motif
Enzymes Representation • Each enzyme is represented as a ‘bag of motifs’ >P54233 | 1.7.1.1 LLDPRDEGTADQWIPRNASMVRFTGKHPFNGEGPLPRLMHHGFITPSPLRYVRNHGPVPKIKWDEWTVEVTGLVKRSTHFTMEKLMREFPHREFPATLVCAGNRRKEHNMVKQSIGFNWGAAGGSTSVWRGVPLRHVLKRCGILARMKGAMYVSFEGAEDLPGGGGSKYGTSVKREMAMDPSRDIILAFMQNGEPLAPDHGFPVRMIIPGFIGGRMVKWLKRIVVTEHECDSHYHYKDNRVLPSHVDAELANDEGWWYKPEYIINELNINSVITTPCHEEILPINSWTTQMPYFIRGYAYSGGGRKVTRVEVTLDGGGTWQVCTLDCPEKPNKYGKYWCWCFWSVEVEVLDLLGAREIAVRAWDEALNTQPEKLIWNVMGMMNNCWFRVKTNVCRPHKGEIGIVFEHPTQPGNQSGGWMAKEKHLEKSSES >P54233 | 1.7.1.1 RDEGTAD,TGKHPFN,LMHHGFITP,YVRNHGPVP,WTVEVTG,PDHGFPYHYKDN,KVTRVE,YGKYWCW,MGMMNNCWF • These 1222 MEX motifs cover 3739 enzymes
n1:class Enzyme Function • The functionality of an enzyme is determined • according to its EC number • Classification Hierarchy[ Webb, 1992 ] • EC number: n1.n2.n3.n4 (a unique identifier) n1.n2:sub-class / 2nd level n1.n2.n3:sub-subclass / 3rd level n1.n2.n3.n4:precise enzymatic activity
oxidoreductases hydrogen as electron donors NAD+ / NADP+ as electron acceptors NAD+ oxidoreductase 2 EC 1.12.1. H2 + NAD+ = H+ + NADH An example: • EC 1 .12 . 1 . n4
Current knowledge regarding enzyme classification • High sequence similarity is required to guarantee functional similarity of proteins. • A recent analysis of enzymes by Tian and Skolnick 2003 suggests that 40% pairwise sequence identity can be used as a threshold for safe transferability of the first three digits of the Enzyme Commission (EC) number. • The EC number, which is of the form:n1:n2:n3:n4 specifies the location of the enzyme on a tree of functionalities.
Current knowledge regarding enzyme classification • Using pairwise sequence similarity, and combining it with the Support Vector Machine (SVM) classification approach, Liao and Noble 2003 have argued that they obtain a significantly improved remote homology detection relative to existing state-of-the-art algorithms. • Cai at al (2003,2004) have applied SVM to a protein description based on physico-chemical features of their amino-acids such as hydrophobicity, normalized Van der Waals volume, polarity, polarizability, charge, surface tension, secondary structure and solvent accessibility. • Ben-Hur and Brutlag 2004 used the eMotif approach and analyzed the oxidoreductases as `bags of motis’.With appropriate feature selection methods theyobtain success rates over 90% for a variety of classifiers.
SVM classifier input: O17433 1148 262 463 610 7987 1627 260 P19992 124 7290 27 111 3706 18128 3432 Q01284 6652 198 1489 710 425 64 55 Q12723 693 145 7290 3712 65 543 522 P14060 455 2664 848 55 128 256 74 Q60555 7290 3712 65 543 522 6748 7159 The MEX method • Classification Tasks: • 16 2nd level subclasses • 32 3rd level sub-subclasses
tp J= tp + fp + fn Methods • A linear SVM is applied to evaluate the predictive power of MEX motifs • Enzyme sequences are randomly partitioned into a training-set and a test-set (75%-25%) • 16 2nd level classification tasks • 32 3rd level classification tasks • Performance measurement:Jaccard Score • The train-test procedure was repeated 40 times to gather sufficient statistics
Results • Average Jaccard scores: • 2nd level: 0.88± 0.06 • 3rd level: 0.84± 0.09 2nd level results Jaccard score EC subclass # of sequences EC subclass
2nd Level Classification Jaccard Score α < 0.01 EC Subclass
3rd Level Classification Jaccard Score α < 0.01 EC Sub-subclass
Summary so far… • MEX is an unsupervised motif extraction method that finds biologically meaningful motifs • The meaning of the motifs is established by the classification tasks • Classifies enzymes better than Smith-Waterman and outperforms SVMProt • There exists correspondence between MEX motifs and PROSITE biologically significant patterns • Specific motifs of length 6 and longer specify the functionality of enzymes
ADIOS (AutomaticDIstillation Of Structure) Representation of a corpus (of sentences) as paths over a graph whose vertices are lexical elements (words) Motif Extraction (MEX) procedure for establishing new vertices thus progressively redefining the graph in an unsupervised fashion Recursive Generalization
ADIOS: recursive generalization • For each path • Slide a context window of size L • For each location i {0< i <=L} • Look for all paths with identical prefix at ip {0< ip<i} • and identical suffix at is {i< is <L} • If a ‘generalized’ pattern is found (and significant…) • Add E – the equivalence class to the graph • Rewire • Continue with MEX and ADIOS until no significant pattern is found
Loading sentences onto a graph whose vertices are words • Is that a cat? • Is that a dog? • And is that a horse? • Where is the dog? Results will be finding the pattern P=‘is that a E ?’ with an equivalence class E={cat, horse, dog}
The Model: The training process 987 234 132 120 567 621 321 2000 132 120 567 621 321 987 234 1203 321 987 234 1203 321 234 1203 321 987 1204 987 234 2001 987 1204 1205
The Model: The training process 987 234 132 120 567 621 321 2000 132 120 567 621 321 987 234 1203 321 987 234 1203 321 234 1203 321 987 1204 987 234 2001 987 1204 1205
First pattern formation Higher hierarchies: patterns (P) constructed of other Ps, equivalence classes (E) and terminals (T) Trees to be read from top to bottom and from left to right Final stage: root pattern CFG: context free grammar
student learns from teacher • Teacher generates a corpus of sentences • Student generates syntax out of significant patterns and equivalence classes • Unseen teacher-generated patterns are checked by student (recall) • Student-generated patterns are checked by teacher (significance)