Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu

Exploiting Likely-Positive and Unlabeled Data to Improve the Identification of Protein-Protein Interaction Articles Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu Intelligent Agent Systems Lab, IIS, Academia Sinica Taiwan Aug. 29, 2007

Outline • Background • Traditional Method • Formulation • Traditional weighting functions • Our Proposed Method • New term weighting functions • Select Likely data • Exploiting likely-positive and negative data • Results

BACKGROUND

Relevant v.s. Irrelevant Articles Relevant Physical interactions among circadian clock proteins KaiA, KaiB and KaiC in cyanobacteria. Irrelevant Differential protein expression in human gliomas and molecular insights.

Curation of Protein-Protein Interaction Databases (PPI-DB) Database PPI Database Filtering and Ranking Filtering and Ranking Human Verification Human Verification Information Extraction Information Extraction Unstructured Text Unstructured Texts

Example of a PPI Record UniProt Protein ID

Difficulties of PPI-Text Classification • Annotation cost is very high • Annotators should be experienced biological researchers • Unbalanced document numbers between the relevant and irrelevant classes • Various definitions of “PPI-relevance” • PPI taxonomy is defined in GO ontology • MINT: physical interaction • BIND: physical interaction, genetic interaction

Data Source 9 1 Second BioCreAtIvE Challenge Workshop : Critical Assessment of Information Extraction in Molecular Biology

TRADITIONAL METHOD

Formulation • PPI Abstract Identification is formulated as a text classification problem (PPI-TC) • Class 1: PPI-relevant (+) • extracted from PPI-DBs • Class 2: PPI-irrelevant (-) • annotated by experts

Irrelevant to PPI Relevant to PPI Schema An APAF-1 cytochrome cmultimeric complex is a functional apoptosome that activates procaspase-9 Words to Feature Vectorby Weighting Functions Classify by SVM 11

About BM25 • BM25 is the best known ranking function used by search engines • It is used to rank documents by their relevance to a given search query • It is also commonly used as weighting functions in text classification [1] S. Robertson, “Understanding Inverse Document Frequency: On theoretical arguments for IDF,” Journal of Documentation60, 503-520, 2004

function of TF, abbreviated as (TFd(wi)) BM25 (1) A weighting function for estimating each word’s discrimination ability Relative Freq. Balanced Relative Freq.

BM25 (2) A weighting function for estimating each word’s discrimination ability For articles containing wi positive / negative For articles not containing wi negative / positive (wi, playeri), (article, game), (positive, winning), (negative, loosing) Denominators cancel out (Winning % when he is in) X (losing % when he is out)

PROPOSED METHOD

Basic Idea • Develop better weighting functions • Expand the training set—select likely data • Select PPI-relevant articles recorded in other PPI-DBs • Select articles not recorded in any PPI-DBs • Exploit likely data

Proposed Variants of BM25 (Robertson, 2004) Relative Freq. Balanced Relative Freq.

Likely (positive, negative) Data • Advantage • Improve the generality and robustness of the classification model • Reduce the number of unseen features • Source for PPI article classification • Likely positive • PPI articles recorded in other PPI-DBs • Likely negative • PubMed articles which are not recorded in any PPI-DBs

+ - + - + + - - - + - + Filtering GeneratingFilter Model Unlabeled Data Negative Articles Select Likely Data Likely Positive Positive Negative Like negative BIND : Containing both genetic & physical interaction of PPI MINT : Containing only physical interaction of PPI

EXPLOITING LIKELY DATA

Mixed Model Additional training data

Hierarchical Model This value is used as an additional feature

EXPERIMENTS & RESULTS

Evaluation Metrics for Classification ,

Evaluation Metrics for Ranking , • Receiver Operating Characteristic (ROC) curve • AUC • the area under the ROC curve ROC curve = Sensitivity TPR AUC = 1-Specificity FPR

Model for Filtering The Hierarchical model is most appropriate for filtering out irrelevant articles

Model for Ranking The mixed model is most appropriate for ranking articles

Weighting Scheme for Filtering

Weighting Scheme for Ranking

Conclusion • PPI-TC can save a lot of annotation effort • Integrating multiple PPI resources • Likely data is effective to improve PPI-TC for both filtering and ranking • Suitable term-weighting function • BM25 and its inheritance are the most effective • For Filtering • Hierarchical Model with BM25 • For Ranking • Mixed Model with TFBRF

Thank you for your attention

Architecture of IASL PPI-TC System

Classification Algorithm • Support Vector Machines (SVM) • Find a maximal-margin separating hyperplane <w, φ(x)> - b = 0 where x(i) is the ith training instance y(i){1, -1} is its label ξ(i) denotes its training error C is the cost factor

Richard Tzong-Han Tsai & Wen-Lian Hsu et al. Presenter: Wen-Lian Hsu