420 likes | 552 Views
A Similar Fragments Merging Approach to Learn Automata on Proteins. Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes. Outline of the talk. Protein families signatures Similar Fragment Merging Approach (Protomata-L) Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs
E N D
A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes
Outline of the talk • Protein families signatures • Similar Fragment Merging Approach (Protomata-L) • Characterization • Similar Fragment Pairs (SFPs) • Ordering the SFPs • Generalization • Merging of SFP in an automaton • Gap generalization • Identification of Physico-chemical properties • Experiments
Protein families • Amino acid alphabet : • Protein sequence : • Protein data set : {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V} >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI… >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI… >AQP2_RAT MWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWASSPPSVLQIAVAFGLGIGILVQALGH… >AQP3_MOUSE MGRQKELMNRCGEMLHIRYRLLRQALAECLGTLILVMFGCGSVAQVVLSRGTHGGFLT… Common function & Common topology (3D structure)
Characterization of a protein family • ZBT11 ...Csi..CgrtLpklyslriHmlk..H... • ZBT10 ...Cdi..CgklFtrrehvkrHslv..H... • ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H x x x x x x x x x x x x CH x \ / x x Zn x x / \ x CH x x x x
Expressivity classes of patterns PRATTTEIRESIAS PROSITE PROTOMATA-L
Dialign: Similar Fragment Pairs • Significantly similar fragment pairs (SFPs) • Natural selection • Important area characterization Data set D:
Ordering the SFPs • Problem : • Solution : ordering the SFPs by scoring each SFP • S(f1,f2)= ? • 3 different scoring functions : • dialign Sd • support Ss • implication Si
Dialign Score • Sd ( f1 , f2 ) = - log P ( L , Sim ) • L = |f1| = |f2| • Sim = Sum of the individual similarity values • P = Probability that a random SFP of the same L has the same S Blossum62similarity
Support Score • Taking into account the representativeness of SFP • Ss (f1,f2,D) = Number of sequences supporting <f1,f2> f1 f2 f <f1,f2> is supported by f with respectthe triangular inequality :Sd(f,f1) + Sd(f,f2) Sd(f1,f2)
Implication Score • Taking into account a counter-example set N • Discriminative fragments • Lerman index:Si(f1,f2,D,N) = • avec P(X) = -P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N) P( Ss(f1 ,f2 ,D) ) x |N| |X| |D| + |N|
From protein data sets to automata MASEIKLFW F W A L S M I K E
From protein data sets to automata MASEIKLFW MGYEVKYRV F W A L S M I K E R V G Y Y M V K E
Merging SFPs MASEIKLFW MGYEVKYRV F W A L S M I K E R V G Y Y M V K E
F W A M S L [I,V] K E Y Y R V G M Merging SFPs MASEIKLFW MGYEVKYRV
Merging SFPs MASEIKLFW MGYEVKYRV F W A M S L [I,V] K E Y Y R V G M MASEVKLFM MGYEIKYRV MASEIKYRV MGYEVKLFW MASEVKYRV MGYEIKLFW
Protein Sequence Data Set List of SFPs Ordered List of SFPs MCA MERGING Automaton / Regular Expression
Gap Generalization • Merging on themself non-representative transitions • Treat them as "gaps"
Identification of Physico-chemical properties • Similar Fragments ~ potential function area • Amino acids share out the same position • Physicochemical property at play • => Generalization from a group (of amino acids) to a Taylor group I,V C I,Q,W,P aliphatic no information C I,L,V x [I,V] [I,L,V] C C[I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}
Likelihood ratio test • To decide if the multi-set A has been generated according to a physico-chemical group G or not by a likelihood ratio test: • Given a threshold , we test the expansion of A to G and reject it when LRG/A <
MIP : the Major Intrinsic Protein Family Family MIP Subfamilies AQP, Glpf, Gla
Set « C» (49 seq) Blast(1<e<100) not MIP Data sets Water-specific Set « W+» (24 seq) Set « W-» (16 seq) Set « M» (44 seq) identity<90% Set « E» (79 seq) Set « T » (159 seq) Set « U » (911 seq) MIP in SWISS-PROT UNIPROT
Experiments • First Common Fragment on a Family • MIP family • Positive set • Comparison with pattern discovery tools • Teiresias • Pratt • Protomata-L (short pattern) • Water-specific Characterization • MIP sub-families • Positive and negative sets • Leave-one-out cross-validation • Protomata-L (short to long pattern)
Set « T » (159 seq) Target set Learning set Learning Set Set « M» (44 seq) First Common Fragment Automaton • Results of 4 patterns scannedon Swiss-Prot protein Database
From short automata to long automata • Previous experiment • only the first SFPs of the ordered list of SFPs • short automaton • first common fragment automaton • Next experiment • larger cut-offs in the list of SFPs • Protomat-L is able to create longer automata with more common subparts • Long patterns are closed of the topoly (3D-structure) of the family
Water-specific characterization • Leave-one-out cross-validation • Learning set • W+ \ Si : Positive learning set • W- \ Sj : Negative learning set • Test set • { Si U Sj } • Control set • Set T • Implication score Set « W+» (24 seq) Set « W-» (16 seq) Set « C» (49 seq)
Error Correcting Cost • The error correcting cost of a sequence S represents the distance (blossum similarity) between S and the closest sequence given by the automaton A. • Distibution of sequences with long automata (size Approx. 100)
Conclusion & Perspective • Good characterization of protein family using automata (-> hmm structure) • No need of a multiple alignment • greedy data-driven algorithm • Important subparts localization • Physico-chemical identification and generalization • Counter example sets • Bringing of knowledge is possible in automata (-> 2D structure)
Questions ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Protomata-L ’s Approach • First Common Fragment
Protomata-L ’s Approach • To get a more precise automaton
Data set (Protein sequences) Pairs of fragments EXTRACTION Initial Automaton(MCA) SORT MERGING IDENTIFICATION OF PHYSICOCHEMICAL GROUPS IDENTIFICATION OF « GAPS »
Generalization of an Aquaporins automaton Aromatique Hydrophobe Non Informatif
Physico-chemical properties identification Ratio likelihood test Aliphatic x Small