310 likes | 449 Views
Associating Biomedical Terms: Case Study for Acetylation. Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac. Overview. Background Previous Work Methods Results. Central Dogma. Background Previous Work Methods Results.
E N D
Associating Biomedical Terms:Case Study for Acetylation Aaron Buechlein Indiana University School of Informatics Advisor: Dr. Predrag Radivojac
Overview • Background • Previous Work • Methods • Results
Central Dogma Background Previous Work Methods Results http://www.accessexcellence.org/RC/VL/GG/images/central.gif
Post-Translational Modifications (PTMs) Background Previous Work Methods Results
Acetylation Background Previous Work Methods Results • Acetylation involves the substitution of an acetyl group (-COCH3) for hydrogen • Typically occurs on N-terminal tails and lysine residues (Lys or K)
Previous Predictors Background Previous Work Methods Results • Several PTM predictors have been created prior to this work • There are also acetylation predictors prior • NetAcet is a predictor for only N-terminal sites • AutoMotif Server is a predictor for various PTMs and includes an acetylation portion • PAIL is a lysine acetylation predictor
Methods Background Previous Work Methods Results • Create Dataset • Download articles relevant to acetylation and extract sites • Rank articles in order to elucidate sites quickly • SwissProt and Human Protein Reference Database (HPRD) • Create Predictors • Leave – one – protein – out validation • Matlab
Article Retrieval Background Previous Work Methods Results • Searched individual journal sites for articles relevant to acetylation • Saved resultant html pages for each journal • These pages were then used as the input for a web crawler to download articles • Due to varying journal site construction each journal required a unique regular expression to extract links for articles
Rank Articles Background Previous Work Methods Results • First locate occurrences of first phrase: “phrase 1” • A = {a1, a2, …, a|A |} • Next locate occurrences of second phrase: “phrase 2” • R = {r1, r2…, r|R|} • c and d are constants • x is the distance in characters between r and the nearest word a
An example: acetylation Background Previous Work Methods Results 1. word “acetylat” A = {a1, a2, …, am} 2. regular expression (k lys lysine)(space)*(digit)+ R = {r1, r2, …, rn}
An example: acetylation Background Previous Work Methods Results Score for article S: where and
An example: acetylation Background Previous Work Methods Results Score for article S: where: and Papers with S > 100 are rich in sites; if S < 30 “twilight” zone
Elucidate Sites Background Previous Work Methods Results • Sites were manually extracted from articles beginning with the highest rank • The original experimental paper for these sites was verified for traceable evidence • Sites were extracted from SwissProt • Sites were extracted from HPRD
Predictors Background Previous Work Methods Results • Support Vector Machine • Artificial Neural Network • Decision Tree
Predictor Input Background Previous Work Methods Results • Positives taken as all lysines found to be acetylated • Negatives taken as all lysines not found to be acetylated • Features created based on characteristics surrounding lysines • Amino acid content, hydrophobicity, charge, disorder, etc.
Predictor Input Background Previous Work Methods Results
Article and Ranking Results Background Previous Work Methods Results • 4888 articles from 10 sites were searched • Nature provided 2147 articles • Science Direct provided1519 articles • The highest ranking article was obtained from the Journal of Biological Chemistry • Score of 151.87 • Contained 10 acetylation sites • The highest ranking article was obtained from Nature when histones are excluded • Previously ranked at #5 • score of 116.36 • Contained 9unique acetylation sites
Top 25 Background Previous Work Methods Results
Ranking Results Background Previous Work Methods Results • Articles with scores greater than 30 had potential for providing at least one site • As scores approached 30, articles became less fruitful
Dataset Results Background Previous Work Methods Results • Dataset included 1442 total sites and 1085 non-redundant sites • HPRD contributed 90 total sites • Swiss-Prot contributed 825 • Our Study contributed 527
Dataset Results Background Previous Work Methods Results
Sensitivity, Specificity, and Precision Background Previous Work Methods Results • Sensitivity(sn) - • Specificity(sp) - • Precision(pr) -
Accuracy and AUC Background Previous Work Methods Results • Accuracy(acc) - • Area Under Curve(AUC) • Refers to the area under the Receiver Operating Curve (ROC) • ROC is the graphical plot of sensitivity vs. 1-specificity
SVM Predictor Background Previous Work Methods Results
Artificial Neural Network Background Previous Work Methods Results
Decision Tree Background Previous Work Methods Results
Algorithm Comparison Background Previous Work Methods Results
I would like to acknowledge those who have helped me throughout the duration of this project, Dr. Predrag Radivojac, Dr. Haixu Tang, and Wyatt Clark
An example: acetylation Background Previous Work Methods Results 1. word “acetylat” A = {a1, a2, …, am} 2. regular expression (k lys lysine)(space)*(digit)+ R = {r1, r2, …, rn}
An example: acetylation Background Previous Work Methods Results Score for article S: where and