330 likes | 503 Views
Exploiting Likely-Positive and Unlabeled Data to Improve the Identification of Protein-Protein Interaction Articles. Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu I ntelligent Agent Systems Lab, I IS, Academia Sinica Taiwan Aug. 29, 2007. Outline. Background
E N D
Exploiting Likely-Positive and Unlabeled Data to Improve the Identification of Protein-Protein Interaction Articles Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu Intelligent Agent Systems Lab, IIS, Academia Sinica Taiwan Aug. 29, 2007
Outline • Background • Traditional Method • Formulation • Traditional weighting functions • Our Proposed Method • New term weighting functions • Select Likely data • Exploiting likely-positive and negative data • Results
Relevant v.s. Irrelevant Articles Relevant Physical interactions among circadian clock proteins KaiA, KaiB and KaiC in cyanobacteria. Irrelevant Differential protein expression in human gliomas and molecular insights.
Curation of Protein-Protein Interaction Databases (PPI-DB) Database PPI Database Filtering and Ranking Filtering and Ranking Human Verification Human Verification Information Extraction Information Extraction Unstructured Text Unstructured Texts
Example of a PPI Record UniProt Protein ID
Difficulties of PPI-Text Classification • Annotation cost is very high • Annotators should be experienced biological researchers • Unbalanced document numbers between the relevant and irrelevant classes • Various definitions of “PPI-relevance” • PPI taxonomy is defined in GO ontology • MINT: physical interaction • BIND: physical interaction, genetic interaction
Data Source 9 1 Second BioCreAtIvE Challenge Workshop : Critical Assessment of Information Extraction in Molecular Biology
Formulation • PPI Abstract Identification is formulated as a text classification problem (PPI-TC) • Class 1: PPI-relevant (+) • extracted from PPI-DBs • Class 2: PPI-irrelevant (-) • annotated by experts
Irrelevant to PPI Relevant to PPI Schema An APAF-1 cytochrome cmultimeric complex is a functional apoptosome that activates procaspase-9 Words to Feature Vectorby Weighting Functions Classify by SVM 11
About BM25 • BM25 is the best known ranking function used by search engines • It is used to rank documents by their relevance to a given search query • It is also commonly used as weighting functions in text classification [1] S. Robertson, “Understanding Inverse Document Frequency: On theoretical arguments for IDF,” Journal of Documentation60, 503-520, 2004
function of TF, abbreviated as (TFd(wi)) BM25 (1) A weighting function for estimating each word’s discrimination ability Relative Freq. Balanced Relative Freq.
BM25 (2) A weighting function for estimating each word’s discrimination ability For articles containing wi positive / negative For articles not containing wi negative / positive (wi, playeri), (article, game), (positive, winning), (negative, loosing) Denominators cancel out (Winning % when he is in) X (losing % when he is out)
Basic Idea • Develop better weighting functions • Expand the training set—select likely data • Select PPI-relevant articles recorded in other PPI-DBs • Select articles not recorded in any PPI-DBs • Exploit likely data
Proposed Variants of BM25 (Robertson, 2004) Relative Freq. Balanced Relative Freq.
Likely (positive, negative) Data • Advantage • Improve the generality and robustness of the classification model • Reduce the number of unseen features • Source for PPI article classification • Likely positive • PPI articles recorded in other PPI-DBs • Likely negative • PubMed articles which are not recorded in any PPI-DBs
+ - + - + + - - - + - + Filtering GeneratingFilter Model Unlabeled Data Negative Articles Select Likely Data Likely Positive Positive Negative Like negative BIND : Containing both genetic & physical interaction of PPI MINT : Containing only physical interaction of PPI
Mixed Model Additional training data
Hierarchical Model This value is used as an additional feature
Evaluation Metrics for Ranking , • Receiver Operating Characteristic (ROC) curve • AUC • the area under the ROC curve ROC curve = Sensitivity TPR AUC = 1-Specificity FPR
Model for Filtering The Hierarchical model is most appropriate for filtering out irrelevant articles
Model for Ranking The mixed model is most appropriate for ranking articles
Conclusion • PPI-TC can save a lot of annotation effort • Integrating multiple PPI resources • Likely data is effective to improve PPI-TC for both filtering and ranking • Suitable term-weighting function • BM25 and its inheritance are the most effective • For Filtering • Hierarchical Model with BM25 • For Ranking • Mixed Model with TFBRF
Classification Algorithm • Support Vector Machines (SVM) • Find a maximal-margin separating hyperplane <w, φ(x)> - b = 0 where x(i) is the ith training instance y(i){1, -1} is its label ξ(i) denotes its training error C is the cost factor