1 / 32

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Pro

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology. In-Silico Analysis of Proteins Celebrating the 20th anniversary of Swiss-Prot Fortaleza, Brazil August 4, 2006 Cathy H. Wu, Ph.D.

halia
Download Presentation

Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Pro

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein Information Resource (PIR) for Functional Annotation: Protein Family Classification, Literature Mining and Protein Ontology In-Silico Analysis of Proteins Celebrating the 20th anniversary of Swiss-Prot Fortaleza, Brazil August 4, 2006 Cathy H. Wu, Ph.D. Director, Protein Information Resource Professor, Biochemistry and Molecular & Cellular Biology Georgetown University Medical Center

  2. Wu CH, Zhao S, Chen HL. (1996) A protein class database organized with PROSITE protein groups and PIR superfamilies. Journal of Computational Biology, 3 (4), 547-562.

  3. Protein Information Resource (PIR) Integrated Protein Informatics Resource for Genomic/Proteomic Research • UniProt Universal Protein Resource:Central Resource of Protein Sequence and Function • PIRSF Family Classification System: Protein Classification and Functional Annotation • iProClass Integrated Protein Database: Data Integration and Protein Mapping • iProLINK Literature Mining Resource: Annotation Extraction • Other Projects: NIAID Proteomics, caBIG Grid-Enablement http://pir.georgetown.edu

  4. PIR Protein Sequence Database • The PIR-International Protein Sequence Database (PIR-PSD) grew out of the Atlas of Protein Sequence and Structure (1965-1978), Vol 1-5, Suppl 1-3. • Margaret Dayhoff collected all the known protein sequences to study protein evolution. • The first Atlas contained 65 proteins, the final volume had 1081 proteins. • The PIR-PSD was produced from 1984 (Release 1, 2900 proteins) to 2004 (Release 80, 283,416 proteins). • PIR-PSD has been integrated with the UniProt since 2002.

  5. UniProt Activities at PIR • Integration of PIR-PSD into UniProtKB • Incorporation of unique PIR entries • Incorporation of PIR annotations: references, experimental features with literature evidence tag • Functional annotation of UniProtKB proteins • Development of PIRSF family classification system & PIRSF curation => Comprehensive coverage of all UniProtKB proteins • Development of rule-based annotation system & PIRNR (name rule) /PIRSR (site rule) curation => Rule curation and integration into Swiss-Prot/TrEMBL annotation pipelines & propagation of annotations (e.g., name, GO, site feature) • Production of UniRef100/90/50 databases • Creation of UniProt web site and help system => Unified UniProt web site & user community interaction

  6. PIRSF Classification System Protein Classification and Functional Annotation • PIRSF: Evolutionary relationships of proteins from super- to sub-families • Curated families with name rules and site rules • Curation platform with classification/visualization tools • Dissemination: UniProtKB annotations, InterPro families, PIRSF reports, PIRSF curation platform

  7. iProClass Integrated Protein Database Data Integration and Protein Mapping • Data integration from >90 databases • Underlying data warehouse for protein ID/name/bibliography mapping & pre-computed BLAST results • Integration of protein family, function, structure for functional annotation • Rich link (link + summary) for value-added reports of UniProt proteins

  8. iProLINK Text Mining Resource Annotation Extraction and Literature-Based Protein Annotation • Curated datasets and literature corpus for development of literature mining and annotation extraction tools • RLIMS-P text-mining tool for extracting protein phosphorylation data • BioThesaurus of gene/protein names to resolve synonym and ambiguity

  9. Adm Ctr PRC Data Type Organism NIAID Biodefense Proteomic Program • Goals • Characterize proteomes of pathogens and host cells • Identify proteins associated with the biology of the microbes • Elucidate mechanisms of microbial pathogenesis • Understand immune responses and non-immune mediated host responses

  10. PIRSF UniProt iProClass Data Integration at NIAID Admin Center Master Protein Directory & Complete Proteomes at GU-PIR Protein ID Peptide/Protein Sequence Mapping Integrated Data at VBI Data Exchange Format Controlled Vocabulary Ontology http://pir.georgetown.edu/proteomics/ Multiple Data Types from Proteomics Research Centers Rich annotation - capture experimental data and scientific conclusion; integrate with major databases

  11. caGridArchitecture NCI caBIG Initiative • caBIG (cancer Biomedical Informatics Grid) • Cancer research platform to enable sharing of research infrastructure, data, tools • Designed and built by an open federation of organizations • Based on common standards and open source/open access principles • One of four caBIG grid reference projects • PIR Grid-Enablement: UniProtKB as central protein information resource for cancer research • caBIG Workspaces • Integrative Cancer Research PIR Developer Project: Grid Enablement of PIR PIR Adopter Project: SEED Genome Annotation PIR Adopter Project: GeneConnect ID mapping • Vocabularies and Common Data Elements PIR Participant Project: Protein models, objects, vocabularies, ontologies

  12. UniProt Knowledgebase: Accurate, Consistent, and Rich Annotation of Protein Sequence and Function • Family Classification-Driven and Rule-Based Curation • Functional inference of uncharacterized hypothetical proteins • Systematic detection and correction of genome annotation errors • Improvement of under- or over-annotated proteins • Text Mining-Assisted and Literature-Based Curation • Annotation extraction from scientific literature • Attribution of experimental evidence • Ontology and Controlled Vocabulary-Based Curation • Standardization of protein/gene/family names and annotation terms • Annotation of specific protein entities

  13. PIR Superfamily Classification • Tree of Life and Evolution of Protein Families (Dayhoff) • The protein superfamily concept(1976) was based on sequence similarity, where sequences were categorized into superfamilies, families, subfamilies, and entries using different % identity thresholds.

  14. PIRSF Classification System • A network classification system from superfamily to subfamily levels to reflect the evolutionary relationships of full-length proteins and domains • Basic unit is homeomorphic family: Full-length similarity, common domain architecture • Provide annotation of generic biochemical and specific biological functions • Basis for evolutionary and comparative genomics research • Basis for accurate and consistent automated protein annotation (protein name, biochemical and biological functions, functional sites) • Basis for standardization of protein names and development of ontology for protein evolution

  15. PIRSF Classification/Curation Workflow • Computational generation of homeomorphic clusters • Computational domain mapping and annotation of preliminary clusters • Automatic placement of new proteins into families • Computer-assisted expert analysis to define homeomorphic families • Family hierarchy created as needed • Expert annotation • Name rules and optional site rules created • Seed members to generate family HMMs

  16. PIRSF Classification Tools • Iterative BlastClust Tree with Annotation Table • Multiple Alignment and Phylogenetic Tree • PIRSF Classification in DAG Editor ISMB: PIRSF Protein Classification System Demo

  17. PIRSF Analysis/Visualization Tools • Taxonomy Distribution and Phylogenetic Pattern • Domain Display • Family Hierarchy (DAG Browser)

  18. Curated family name PIRSF Family Report Description of family Sequence analysis tools

  19. Classification and Functional Annotation Example - Phosphofructokinase (PFK) classification shows that functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue. E. coli (P06998) Gly105 Gly125 Families ATP-PFK: Gly105 + Gly125 ATP_PFK_DR0635 ATP_PFK_euk PPi_PFK_PfpB Classification Tree PPi_PFK_TM0289 PPi-PFK: Gly/Asp105 + Lys125 PPi_PFK_TP0108 PPi_PFK_SMc01852 PFK_XF0274

  20. Family-Based Rules for Annotation Functional Site Rule: tags active site, binding, other residue-specific information ? Functional NameRule: gives name, EC, GO, other function-specific information

  21. iProLINK Literature Mining Resource

  22. iProLINK Literature Mining Resource • UniProtKB Bibliography mapping in iProClass • RLIMS-P Rule-based NLP method for extracting protein phosphorylation data • Substring-based machine learning method for PTM text categorization • BioThesaurus of protein/gene names with UniProtKB association • Entity-named tagging Guide 1 2 3 4 5

  23. Literature Corpus for Text Mining • Literature survey and manual tagging for evidence attribution • Training and benchmarking sets for information retrieval and extraction • Protein phosphorylation data used to develop RLIMS-P for extracting phosphorylation information • The five PTM datasets used to develop a machine learning algorithm for text categorization

  24. A • Report: Full annotation with evidence tagging and PMID mapping to UniProtKB entry 2 • Summary table: PMIDs & top-ranking annotation 3 1 • Name mapping searches BioThesaurus Online RLIMS-P

  25. BioThesaurus • Comprehensive collection of protein/gene names from 23 databases • Associate names (~3.2 million) with UniProtKB entries (>2 million) • Web-based searches to retrieve synonymous names, resolve ambiguous names, evaluate name coverage • FTP download for automatic dictionary-basednamed entity tagging

  26. Name ambiguity of CLIM1 1 2 Annotation error detection Online BioThersaurus 1. Search protein entries sharing the same names2. Retrieve BioThesaurus report

  27. Synonyms for Metalloproteinase inhibitor 3 ID Mapping 3 1 Name ambiguity of TIMP-3 2 BioThesaurus Report Gene/Protein Name Mapping • Search Synonyms • Resolve Name Ambiguity • Underlying ID Mapping

  28. Protein Ontology (PRO) • PRotein Ontology (PRO) in OBO (Open Biomedical Ontologies) Framework • Two sub-ontologies: • Ontology for Protein Evolution (ProEvo) for the classification of proteins on the basis of evolutionary relationships • Ontology for Protein Modified Forms (ProMod) to represent the multiple protein forms of a gene (genetic variation, alternative splicing, proteolytic cleavage, and post-translational modification). • Why PRO? • Allow the specification of relationships between PRO and other ontologies, such as GO and Disease Ontology • Facilitate precise protein annotation of specific proteins/classes • The PRO prototype is illustrated using human proteins from the TGF-beta signaling pathway (http://pir.georgetown.edu/pro).

  29. PRO Conceptual Framework

  30. Protein Ontology (PRO)

  31. Acknowledgements • PIR Team • Protein Science Team: Darren Natale, Winona Barker, Peter McGarvey, Zhangzhi Hu, Lai-Su Yeh, Anastasia Nikolskaya, Raja Mazumder, CR Vinayaka, Sona Vasudevan, Cecilia Arighi, Xin Yuan • Informatics Team: Hongzhan Huang, Baris Suzek, Leslie Arminski, Hsing-Kuo Hua, Yongxing Chen, Jing Zhang, Robel Kahsay, Jess Cannata • Students: Natalia Petrova, Paul Ramos, Ti-Cheng Chang, Anna Bank • Collaborators • UniProt: Rolf Apweiler, Amos Bairoch and EBI/SIB Teams • NIAID: Margaret Moore (SSS), Bruno Sobral (VBI) • Text Mining: Hongfang Liu (GUMC), Interjeet Mani (MITRE), Vijay Shanker (U Delaware), Zoran Obradovic (Temple U) • Funding Support • NHGRI/NIGMS (UniProt) • NCI caBIG • NIAID (Proteomic Admin Center) • NSF: iProClass, text mining

More Related