330 likes | 842 Views
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology. In-Silico Analysis of Proteins Celebrating the 20th anniversary of Swiss-Prot Fortaleza, Brazil August 4, 2006 Cathy H. Wu, Ph.D.
E N D
Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis of Proteins Celebrating the 20th anniversary of Swiss-Prot Fortaleza, Brazil August 4, 2006 Cathy H. Wu, Ph.D. Director, Protein Information Resource Professor, Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center
Wu CH, Zhao S, Chen HL. (1996) A protein class database organized with PROSITE protein groups and PIR superfamilies. Journal of Computational Biology, 3 (4), 547-562.
Protein Information Resource (PIR) Integrated Protein Informatics Resource for Genomic/Proteomic Research • UniProt Universal Protein Resource:Central Resource of Protein Sequence and Function • PIRSF Family Classification System: Protein Classification and Functional Annotation • iProClass Integrated Protein Database: Data Integration and Protein Mapping • iProLINK Literature Mining Resource: Annotation Extraction • Other Projects: NIAID Proteomics, caBIG Grid-Enablement http://pir.georgetown.edu
PIR Protein Sequence Database • The PIR-International Protein Sequence Database (PIR-PSD) grew out of the Atlas of Protein Sequence and Structure (1965-1978), Vol 1-5, Suppl 1-3. • Margaret Dayhoff collected all the known protein sequences to study protein evolution. • The first Atlas contained 65 proteins, the final volume had 1081 proteins. • The PIR-PSD was produced from 1984 (Release 1, 2900 proteins) to 2004 (Release 80, 283,416 proteins). • PIR-PSD has been integrated with the UniProt since 2002.
UniProt Activities at PIR • Integration of PIR-PSD into UniProtKB • Incorporation of unique PIR entries • Incorporation of PIR annotations: references, experimental features with literature evidence tag • Functional annotation of UniProtKB proteins • Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins • Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site feature) • Production of UniRef100/90/50 databases • Creation of UniProt web site and help system => Unified UniProt web site & user community interaction
PIRSF Classification System Protein Classification and Functional Annotation • PIRSF: Evolutionary relationships of proteins from super- to sub-families • Curated families with name rules and site rules • Curation platform with classification/visualization tools • Dissemination: UniProtKB annotations, InterPro families, PIRSF reports, PIRSF curation platform
iProClass Integrated Protein Database Data Integration and Protein Mapping • Data integration from >90 databases • Underlying data warehouse for protein ID/name/bibliography mapping & pre-computed BLAST results • Integration of protein family, function, structure for functional annotation • Rich link (link + summary) for value-added reports of UniProt proteins
iProLINK Text Mining Resource Annotation Extraction and Literature-Based Protein Annotation • Curated datasets and literature corpus for development of literature mining and annotation extraction tools • RLIMS-P text-mining tool for extracting protein phosphorylation data • BioThesaurus of gene/protein names to resolve synonym and ambiguity
Adm Ctr PRC Data Type Organism NIAID Biodefense Proteomic Program • Goals • Characterize proteomes of pathogens and host cells • Identify proteins associated with the biology of the microbes • Elucidate mechanisms of microbial pathogenesis • Understand immune responses and non-immune mediated host responses
PIRSF UniProt iProClass Data Integration at NIAID Admin Center Master Protein Directory & Complete Proteomes at GU-PIR Protein ID Peptide/Protein Sequence Mapping Integrated Data at VBI Data Exchange Format Controlled Vocabulary Ontology http://pir.georgetown.edu/proteomics/ Multiple Data Types from Proteomics Research Centers Rich annotation - capture experimental data and scientific conclusion; integrate with major databases
caGridArchitecture NCI caBIG Initiative • caBIG (cancer Biomedical Informatics Grid) • Cancer research platform to enable sharing of research infrastructure, data, tools • Designed and built by an open federation of organizations • Based on common standards and open source/open access principles • One of four caBIG grid reference projects • PIR Grid-Enablement: UniProtKB as central protein information resource for cancer research • caBIG Workspaces • Integrative Cancer Research PIR Developer Project: Grid Enablement of PIR PIR Adopter Project: SEED Genome Annotation PIR Adopter Project: GeneConnect ID mapping • Vocabularies and Common Data Elements PIR Participant Project: Protein models, objects, vocabularies, ontologies
UniProt Knowledgebase: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function • Family Classification-Driven and Rule-Based Curation • Functional inference of uncharacterized hypothetical proteins • Systematic detection and correction of genome annotation errors • Improvement of under- or over-annotated proteins • Text Mining-Assisted and Literature-Based Curation • Annotation extraction from scientific literature • Attribution of experimental evidence • Ontology and Controlled Vocabulary-Based Curation • Standardization of protein/gene/family names and annotation terms • Annotation of specific protein entities
PIR Superfamily Classification • Tree of Life and Evolution of Protein Families (Dayhoff) • The protein superfamily concept(1976) was based on sequence similarity, where sequences were categorized into superfamilies, families, subfamilies, and entries using different % identity thresholds.
PIRSF Classification System • A network classification system from superfamily to subfamily levels to reflect the evolutionary relationships of full-length proteins and domains • Basic unit is homeomorphic family: Full-length similarity, common domain architecture • Provide annotation of generic biochemical and specific biological functions • Basis for evolutionary and comparative genomics research • Basis for accurate and consistent automated protein annotation (protein name, biochemical and biological functions, functional sites) • Basis for standardization of protein names and development of ontology for protein evolution
PIRSF Classification/Curation Workflow • Computational generation of homeomorphic clusters • Computational domain mapping and annotation of preliminary clusters • Automatic placement of new proteins into families • Computer-assisted expert analysis to define homeomorphic families • Family hierarchy created as needed • Expert annotation • Name rules and optional site rules created • Seed members to generate family HMMs
PIRSF Classification Tools • Iterative BlastClust Tree with Annotation Table • Multiple Alignment and Phylogenetic Tree • PIRSF Classification in DAG Editor ISMB: PIRSF Protein Classification System Demo
PIRSF Analysis/Visualization Tools • Taxonomy Distribution and Phylogenetic Pattern • Domain Display • Family Hierarchy (DAG Browser)
Curated family name PIRSF Family Report Description of family Sequence analysis tools
Classification and Functional Annotation Example - Phosphofructokinase (PFK) classification shows that functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue. E. coli (P06998) Gly105 Gly125 Families ATP-PFK: Gly105 + Gly125 ATP_PFK_DR0635 ATP_PFK_euk PPi_PFK_PfpB Classification Tree PPi_PFK_TM0289 PPi-PFK: Gly/Asp105 + Lys125 PPi_PFK_TP0108 PPi_PFK_SMc01852 PFK_XF0274
Family-Based Rules for Annotation Functional Site Rule: tags active site, binding, other residue-specific information ? Functional NameRule: gives name, EC, GO, other function-specific information
iProLINK Literature Mining Resource • UniProtKB Bibliography mapping in iProClass • RLIMS-P Rule-based NLP method for extracting protein phosphorylation data • Substring-based machine learning method for PTM text categorization • BioThesaurus of protein/gene names with UniProtKB association • Entity-named tagging Guide 1 2 3 4 5
Literature Corpus for Text Mining • Literature survey and manual tagging for evidence attribution • Training and benchmarking sets for information retrieval and extraction • Protein phosphorylation data used to develop RLIMS-P for extracting phosphorylation information • The five PTM datasets used to develop a machine learning algorithm for text categorization
A • Report: Full annotation with evidence tagging and PMID mapping to UniProtKB entry 2 • Summary table: PMIDs & top-ranking annotation 3 1 • Name mapping searches BioThesaurus Online RLIMS-P
BioThesaurus • Comprehensive collection of protein/gene names from 23 databases • Associate names (~3.2 million) with UniProtKB entries (>2 million) • Web-based searches to retrieve synonymous names, resolve ambiguous names, evaluate name coverage • FTP download for automatic dictionary-basednamed entity tagging
Name ambiguity of CLIM1 1 2 Annotation error detection Online BioThersaurus 1. Search protein entries sharing the same names2. Retrieve BioThesaurus report
Synonyms for Metalloproteinase inhibitor 3 ID Mapping 3 1 Name ambiguity of TIMP-3 2 BioThesaurus Report Gene/Protein Name Mapping • Search Synonyms • Resolve Name Ambiguity • Underlying ID Mapping
Protein Ontology (PRO) • PRotein Ontology (PRO) in OBO (Open Biomedical Ontologies) Framework • Two sub-ontologies: • Ontology for Protein Evolution (ProEvo) for the classification of proteins on the basis of evolutionary relationships • Ontology for Protein Modified Forms (ProMod) to represent the multiple protein forms of a gene (genetic variation, alternative splicing, proteolytic cleavage, and post-translational modification). • Why PRO? • Allow the specification of relationships between PRO and other ontologies, such as GO and Disease Ontology • Facilitate precise protein annotation of specific proteins/classes • The PRO prototype is illustrated using human proteins from the TGF-beta signaling pathway (http://pir.georgetown.edu/pro).
Acknowledgements • PIR Team • Protein Science Team: Darren Natale, Winona Barker, Peter McGarvey, Zhangzhi Hu, Lai-Su Yeh, Anastasia Nikolskaya, Raja Mazumder, CR Vinayaka, Sona Vasudevan, Cecilia Arighi, Xin Yuan • Informatics Team: Hongzhan Huang, Baris Suzek, Leslie Arminski, Hsing-Kuo Hua, Yongxing Chen, Jing Zhang, Robel Kahsay, Jess Cannata • Students: Natalia Petrova, Paul Ramos, Ti-Cheng Chang, Anna Bank • Collaborators • UniProt: Rolf Apweiler, Amos Bairoch and EBI/SIB Teams • NIAID: Margaret Moore (SSS), Bruno Sobral (VBI) • Text Mining: Hongfang Liu (GUMC), Interjeet Mani (MITRE), Vijay Shanker (U Delaware), Zoran Obradovic (Temple U) • Funding Support • NHGRI/NIGMS (UniProt) • NCI caBIG • NIAID (Proteomic Admin Center) • NSF: iProClass, text mining