10 likes | 122 Views
P.
E N D
P Introduction: The RLIMS-P is a rule-based text-mining program specifically designed to extract protein phosphorylation information on protein kinase, substrate and phosphorylation sites from the abstracts (Hu et al., 2005). The program was originally developed by Narayanaswamy, Ravikumar, and Vijay-Shanker (2005), and was tested and benchmarked by PIR using iProLINK annotated datasets (Hu et al., 2004). The RLIMS-P program is now adopted at PIR and being developed into an online text mining tool for extracting protein phosphorylation information from PubMed literature (Yuan, et al., 2006). The online RLIMS-P currently provides the following functions to: 1) determine whether the MEDLINE abstract contains protein phosphorylation information and to extract protein kinase, protein substrate and phosphorylation site/residue when available; 2) tag extracted phosphorylation objects in the abstract in different colors; 3) map the protein substrate to UniProtKB protein entries based on PMID; 4) map protein names to UniProtKB protein entries based on BioThesaurus. Coupled with BioThesarus, RLIMS-P can facilitate the UniProtKB protein phosphorylation feature annotation. RLIMS-P System Design Manual tagging assisted with computational extraction: Training and testing sets of positive and negative samples for RLIMS-P development Evidence attribution Pattern 1: <AGENT> <VG-active-phosphorylate> <THEME> (in/at <SITE>)? ATR/FRP-1 also phosphorylated p53 in Ser 15 RLIMS-P Bioinformatics. 21:2759-65, 2005 3 objects Protein entity mapping Information retrieval and extraction A A preliminary case study – Using RLIMS-P to facilitate the UniProtKB feature annotation C B Benchmarking of RLIMS-P Nuclear receptor (NR) phosphorylation was under-annotated in databases. Text-mining of 2170 PubMed abstracts (retrieved with query of NR phosphorylation) with RLIMS-P found significantly more phosphorylation sites to add to UniProt feature annotation. High recall for paper retrieval and high precision for information extraction D RLIMS-P: A Rule-Based Literature Mining System for Protein Phosphorylation Hu ZZ1, Yuan X1, Torii M2, Vijay-Shanker K3, and Wu CH1 1Protein Information Resource, 2Department of Biostatistics, Bioinformatics, and Biomathematics, 4Department of Computational Linguistics, Georgetown University, Washington, DC 20007; 3University of Delaware, DE 19716 Annotation tagged literature sets for PTMs from iProLINK literature mining resource Training/benchmarking data sets and pattern rules can be downloaded. Web-based RLIMS-P • Future development of RLIMS-P program: • Extend to mine full-length articles • Mine in vivo protein phosphorylation and its cellular context, such as cell types and pathways References: Hu ZZ, et al., Comp Biol Chem. 28:409-16, 2004. Hu ZZ, et al., Bioinformatics. 21:2759-65, 2005. Narayanaswamy M, et al., Bioinformatics, Suppl.1 21: i319-i327, 2005. Yuan X, et al., Bioinformatics, April 27, 2006. Acknowledgements: NIH (UniProt), NSF (Entity Tagging). PIR team: Wu HT, Fang C, Huang H, Arminski L. Collaborators: Liu H, Narayanaswamya M, Ravikumar KE. Name mapping of phosphorylated protein in RLIMS-P report (C) to UniProtKB entry using BioThesaurus (D). Name mapping includes options to use names appearing in the abstract or user-specified names to search online BioThesaurus. Here, “PBPA” retrieves 10 entries sharing the same name, including PBPA of Mycobacterium tuberculosis (P71586_MYCTU), the phosphorylated protein discussed in the abstract. The online RLIMS-P text-mining results: (A) The summary table lists PMIDs with top-ranking phosphorylation annotation. (B) The full report provides detailed annotation results with evidence tagging and automatic mapping to UniProtKB entry containing the citation (e.g., KPB1_RABIT). Contact: pirmail@georgetown.edu http://pir.georgetown.edu/iprolink/rlimsp