A Similar Fragments Merging Approach to Learn Automata on Proteins

A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

Outline of the talk • Protein families signatures • Similar Fragment Merging Approach (Protomata-L) • Characterization • Similar Fragment Pairs (SFPs) • Ordering the SFPs • Generalization • Merging of SFP in an automaton • Gap generalization • Identification of Physico-chemical properties • Experiments

Protein families • Amino acid alphabet : • Protein sequence : • Protein data set : {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V} >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI… >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI… >AQP2_RAT MWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWASSPPSVLQIAVAFGLGIGILVQALGH… >AQP3_MOUSE MGRQKELMNRCGEMLHIRYRLLRQALAECLGTLILVMFGCGSVAQVVLSRGTHGGFLT… Common function & Common topology (3D structure)

Characterization of a protein family • ZBT11 ...Csi..CgrtLpklyslriHmlk..H... • ZBT10 ...Cdi..CgklFtrrehvkrHslv..H... • ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H x x x x x x x x x x x x CH x \ / x x Zn x x / \ x CH x x x x

Expressivity classes of patterns PRATTTEIRESIAS PROSITE PROTOMATA-L

Characterization

Dialign: Similar Fragment Pairs • Significantly similar fragment pairs (SFPs) • Natural selection • Important area characterization Data set D:

Ordering the SFPs • Problem : • Solution : ordering the SFPs by scoring each SFP • S(f1,f2)= ? • 3 different scoring functions : • dialign Sd • support Ss • implication Si

Dialign Score • Sd ( f1 , f2 ) = - log P ( L , Sim ) • L = |f1| = |f2| • Sim = Sum of the individual similarity values • P = Probability that a random SFP of the same L has the same S Blossum62similarity

Support Score • Taking into account the representativeness of SFP • Ss (f1,f2,D) = Number of sequences supporting <f1,f2> f1 f2 f <f1,f2> is supported by f with respectthe triangular inequality :Sd(f,f1) + Sd(f,f2)  Sd(f1,f2)

Implication Score • Taking into account a counter-example set N • Discriminative fragments • Lerman index:Si(f1,f2,D,N) = • avec P(X) = -P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N) P( Ss(f1 ,f2 ,D) ) x |N| |X| |D| + |N|

Generalization

From protein data sets to automata MASEIKLFW F W A L S M I K E

From protein data sets to automata MASEIKLFW MGYEVKYRV F W A L S M I K E R V G Y Y M V K E

Merging SFPs MASEIKLFW MGYEVKYRV F W A L S M I K E R V G Y Y M V K E

F W A M S L [I,V] K E Y Y R V G M Merging SFPs MASEIKLFW MGYEVKYRV

Merging SFPs MASEIKLFW MGYEVKYRV F W A M S L [I,V] K E Y Y R V G M MASEVKLFM MGYEIKYRV MASEIKYRV MGYEVKLFW MASEVKYRV MGYEIKLFW

Protein Sequence Data Set List of SFPs Ordered List of SFPs MCA MERGING Automaton / Regular Expression

Gap Generalization • Merging on themself non-representative transitions • Treat them as "gaps"

Identification of Physico-chemical properties • Similar Fragments ~ potential function area • Amino acids share out the same position • Physicochemical property at play • => Generalization from a group (of amino acids) to a Taylor group I,V C I,Q,W,P aliphatic no information C I,L,V x [I,V] [I,L,V] C C[I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

Likelihood ratio test • To decide if the multi-set A has been generated according to a physico-chemical group G or not by a likelihood ratio test: • Given a threshold , we test the expansion of A to G and reject it when LRG/A < 

Experiments

MIP : the Major Intrinsic Protein Family Family MIP Subfamilies AQP, Glpf, Gla

Set « C» (49 seq) Blast(1<e<100) not MIP Data sets Water-specific Set « W+» (24 seq) Set « W-» (16 seq) Set « M» (44 seq) identity<90% Set « E» (79 seq) Set « T » (159 seq) Set « U » (911 seq) MIP in SWISS-PROT UNIPROT

Experiments • First Common Fragment on a Family • MIP family • Positive set • Comparison with pattern discovery tools • Teiresias • Pratt • Protomata-L (short pattern) • Water-specific Characterization • MIP sub-families • Positive and negative sets • Leave-one-out cross-validation • Protomata-L (short to long pattern)

Set « T » (159 seq) Target set Learning set Learning Set Set « M» (44 seq) First Common Fragment Automaton • Results of 4 patterns scannedon Swiss-Prot protein Database

From short automata to long automata • Previous experiment • only the first SFPs of the ordered list of SFPs • short automaton • first common fragment automaton • Next experiment • larger cut-offs in the list of SFPs • Protomat-L is able to create longer automata with more common subparts • Long patterns are closed of the topoly (3D-structure) of the family

Water-specific characterization • Leave-one-out cross-validation • Learning set • W+ \ Si : Positive learning set • W- \ Sj : Negative learning set • Test set • { Si U Sj } • Control set • Set T • Implication score Set « W+» (24 seq) Set « W-» (16 seq) Set « C» (49 seq)

Leave-one-out cross-validation

Error Correcting Cost • The error correcting cost of a sequence S represents the distance (blossum similarity) between S and the closest sequence given by the automaton A. • Distibution of sequences with long automata (size Approx. 100)

Leave-one-out cross-validationWith Error Correcting Cost

Leave-one-out cross-validation

Conclusion & Perspective • Good characterization of protein family using automata (-> hmm structure) • No need of a multiple alignment • greedy data-driven algorithm • Important subparts localization • Physico-chemical identification and generalization • Counter example sets • Bringing of knowledge is possible in automata (-> 2D structure)

Questions ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Demo

Protomata-L ’s Approach • First Common Fragment

Protomata-L ’s Approach • To get a more precise automaton

Data set (Protein sequences) Pairs of fragments EXTRACTION Initial Automaton(MCA) SORT MERGING IDENTIFICATION OF PHYSICOCHEMICAL GROUPS IDENTIFICATION OF « GAPS »

Structural discrimination

Generalization of an Aquaporins automaton Aromatique Hydrophobe Non Informatif

Physico-chemical properties identification Ratio likelihood test Aliphatic x Small

A Similar Fragments Merging Approach to Learn Automata on Proteins

A Similar Fragments Merging Approach to Learn Automata on Proteins

Presentation Transcript

Fragments

One Minute To Learn Programming: Finite Automata

Identifying similar surface patches on proteins using a spin-image surface representation

Retrieving Similar Code Fragments based on Identifier Similarity for Defect Detection

Fragments

Fragments

FIRST APPROACH TO THE SHOTGUN UNKNOWN PROTEINS

A SEARCH-BASED APPROACH TO ANNEXATION AND MERGING IN WEIGHTED VOTING GAMES

Unambiguous automata inference by means of states-merging methods

A new approach to detect similar proteins from 2D gel electrophoresis images

Automata-Theoretic approach

A new data clustering approach-Generalized cellular automata

Synthetic Biology - Progess in Merging Nanoparticles and Proteins

A Cellular Automata Approach to Population Modeling

A New Approach for Alignment of Multiple Proteins

Hybrid Approach to Model-Checking of Timed Automata

An Automata-Theoretic Approach to LTL

One Minute To Learn Programming: Finite Automata

FIRST APPROACH TO THE SHOTGUN UNKNOWN PROTEINS

Dissertation Structure - A Professional Approach to Learn

A Cellular Automata Approach to Population Modeling

Binary Cellular Automata Approach to Anonymous, Self-Stabilizing Leader Election on Rings