210 likes | 417 Views
Ivan Dimitrov. School of Pharmacy Medical University of Sofia. Application of machine learning techniques for allergenicity prediction. 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011.
E N D
Ivan Dimitrov School of Pharmacy Medical University of Sofia Application of machine learning techniques for allergenicity prediction 2nd Regional Conference “Supercomputing Applications in Science and Industry” Rodopi Hotel, Sunny Beach, Bulgaria, September 20-21, 2011
Allergen processing pathways C. M. Hawrylowicz & A. O'Garra, Nature Reviews Immunology 2005, 271-283
FAO and WHO Codex alimentarius guidelines for evaluating potential allergenicity for novel proteins A query protein is potentially allergenicifit: has an identity of 6 to 8 contiguous amino acids or has > 35% sequence similarity over a window of 80 amino acids when compared with known allergens. Codex Principles and Guidelines on Foods Derived from Biotechnology. 2003 Rome, Italy: Codex Alimentarius Commission, Joint FAO/WHO Food Standards Programme, Food and Agriculture Organization.
Bioinformatics approaches to allergen prediction • Sequence-alignment search of query protein • Extensive databases of known allergen proteins and the FAO/WHO guidelines • - Structural Database of Allergenic Proteins • - Allermatch Characteristics: • High sensitivity (true positives/(true positives + false negatives)) • - Produce many false positives and low precision • (true positives/(true positives + false positives)) • - Discovery of novel antigens is restricted by their lack of similarity to known allergens. Ivanciuc et al.Nucleic Acids Res. 2003, 31, 359–362 Fiers et al.BMC Bioinformatics 2004, 5, 133
Bioinformatics approaches to allergen prediction 2. Identification of conserved allergenicity-related linear motifs • Comparing allergens to non-allergens by MEME motif discovery tool • - Clustering of known allergens, wavelet analysis and hidden Markov model • - Automated Selection of Allergen-Representative Peptides (DASARP). • Motif search by Support Vector Machines (SVM), MEME/MAST, IgE epitopes and Allergen-Representative Peptides (ARP) • - Iterative pairwise sequence similarity encoding scheme with SVM as the discriminating engine • Both approaches are based on the assumption that the allergenicity is a linearly coded property. Stadler and Stadler FASEB J. 2003, 17, 1141-1143Saha and Raghava Nucleic Acids Research,2006,34, 202-209 Li et al. Bioinformatics 2004, 20, 2572-2578.Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861 Björklund et al. Bioinformatics. 2005, 21, 39–50
AIM of the study To create an alignment-free method for in silico identification of allergens based on the main chemical properties of amino acid sequences and implement it to a web server. Obstacles: The choice of an appropriate descriptors to represent the physicochemical properties of amino acid sequences. Allergens are proteins with different length.
The z-scales …Phe – Arg – Trp… z1 z2 z3 hydrophobicity molecular size polarity z1 z2 z3 z1 z2 z3 z1 z2 z3 -4.22 1.94 1.08 3.62 2.60 -3.60 -4.36 3.94 0.69 Hellberg et al. J. Med. Chem. 1987; 30, 1126-1135
ACC transformation Auto-covariance Cross-covariance j, k are the zscales (j=1,2,3); i is the amino acid positions; n is the number of amino acids in the sequence; Phe – Arg – Trp – Phe – Arg – Trp protein z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 /5 ACC11(1) z1 z2 z3 - z1 z2 z3 - z1 z2 z3 – z1 z2 z3 - z1 z2 z3 – z1 z2 z3 /5 ACC13(1) Wold et al. Anal. Chim. Acta 1993, 277:239-225
Preliminary study 595 food allergens from CSL allergen database 595 non-allergens from NCBI database Training set 475 food allergens 475 non-allergens Test set 120 food allergens 120 non-allergens ACC transformation of z descriptors matrix with 45 variables (32 x 5) and 950 observations external validation statistical methods, machine learning Sensitivity Specificity Accuracy PLS - discriminant analysis Logistic regression Naïve - Bayes algorithm Decision tree algorithm k Nearest Neighbours http://allergen.csl.gov.uk http://www.ncbi.nlm.nih.gov/
Results from preliminary study TP – true positive, FP – false positive TN – true negative, FN – false negative
Web servers on the test set Algpred - SVM with single aa composition - SVM with dipeptide composition Evaller APPEL Allerhunter Test set 120 food allergens 120 non-allergens Sensitivity Specificity Accuracy Saha and Raghava Nucleic Acids Research,2006,34, 202-209. Barrio et al., Nucleic Acids Research2007, 35, 694-700 http://jing.cz3.nus.edu.sg/cgi-bin/APPEL Muh et al. PLoS ONE, 2009, 4 (6), art. no. e5861
Conclusions from the preliminary study • The model developed by the k Nearest Neighbors method shows • the best performance on the test set comparing to the other methods. • It has a good balance between specificity and sensitivity, and the • highest accuracy. kNN was used further in the study. 2. The server Allerhunter is the best performing among the known servers for allergen prediction. kNN needs some more improvements. 3. A great misbalance exists between sensitivity and specificity for almost all servers. This indicates that the dataset needs some improvement too.
The kNN algorithm Training set 475 allergens, 475 non-allergens Unknown protein ACC transformation of z descriptors ACC transformation of z descriptors vector with 45 variables (32 x 5) matrix of 45 variables (32 x 5) and 950 observations Calculate the Euclidian distance between the vector and each observation Sort the distance by value in ascending order Determine the class of unknown allergen according to the majority of nearest neighbours Determine the k nearest neighbours
Next: Extend the data sets CSL allergen database, FARRP allergen database SDAP database, ADFS database 684 food, 1157 inhalant, 553 toxins, venom or salivary allergens Allergen species NCBI database Create local database Proteins from allergen species Blasts search against all allergens 684 non-allergen from food origin 1157 non-allergens from inhalant origin 553 non-allergens from species with toxins, venom or salivary allergens http://allergen.csl.gov.uk http://www.allergenonline.org/ http://fermi.utmb.edu/SDAP/ http://allergen.nihs.go.jp/ADFS/index.jsp http://www.ncbi.nlm.nih.gov/
Next: kNN optimization 684 food allergens 684 non-allergens Training set 528 allergens 528 non-allergens Test set 156 allergens 156 non-allergens machine learning external validation k nearest neighbours Sensitivity Specificity Accuracy
kNN models 684 food allergens 684 non-allergens 1157 inhalant allergens 1157 non-allergens Test set 156 allergens 156 non-allergens Training set 528 allergens 528 non-allergens Training set 933 allergens 933 non-allergens Test set 224 allergens 224 non-allergens external validation external validation external validation k NN k = 3 k NN k = 3 Sensitivity Specificity Accuracy
AllerTOPweb tool for allergenicity prediction Training set 1952 food, inhalant and others allergens and 1952 non-allergens ACC transformation of z descriptors kNN model external validation AllerTOP http://www.pharmfac.net/alletop
Servers performance on united testset United test set of 441 food and inhalant allergens and 441 non-allergens Two of the servers from preliminary studies: Appel and Evaller were not available during recent study. The results for Allerhunter server are achieved with smaller testset due to its incapability to work with short sequences (<21 amino acids)
Conclusions • An alignment-free method for in silico prediction of allergens based on • the main physicochemical properties of proteins was developed. 2. The method uses z descriptors for representation of amino acids in the protein sequences and ACC transformation for conversion of proteins into uniform vectors. 3. The k Nearest Neighbours clustering method showed the best performance among the other algorithms for classification tested in the study: PLS - discriminant analysis, Logistic regression, Naïve - Bayes and Decision Tree algorithm. 4. The k NN algorithm was optimized and its performance was compared to the freely available web servers for prediction of allergens. 5. The kNN algorithm was implemented on a web server, freely available on: http://www.pharmfac.net/allertop
Drug Design Group School of Pharmacy Medical University of Sofia Irini Doytchinova Ivan Dimitrov Mariyana Atanasova Panaiot Garnev Acknowledgements Darren R. Flower Aston University, Birmingham, UK Funding: National Research Fund, Ministry of Education and Science, Bulgaria, Grant 02-1/2009