160 likes | 303 Views
COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION-BASED PROTEIN ONTOLOGY. Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center www.uniprot.org http://pir.georgetown.edu/. Why Protein Classification?.
E N D
COMPLEMENTING GENE ONTOLOGY WITH PIRSF CLASSIFICATION-BASED PROTEIN ONTOLOGY Anastasia Nikolskaya PIR (Protein Information Resource) Georgetown University Medical Center www.uniprot.orghttp://pir.georgetown.edu/
Why Protein Classification? • Automatic annotation of protein sequences based on protein families (propagation of annotation) • Systematic correction of annotation errors • Protein name standardization in UniProt • Functional predictions for uncharacterized protein families
PIRSF Classification System • PIRSF: A network structure with hierarchies from Superfamilies to Subfamilies reflects evolutionary relationships of full-length proteins • Definitions: • Basic unit = Homeomorphic Family • Homologous (Common Ancestry): Inferred by sequence similarity • Homeomorphic: Full-length sequence similarity and common domain architecture • Network Structure: Flexible number of levels with varying degrees of sequence conservation • Advantages: • Annotation of both generic biochemical and specific biological functions • Accurate propagation of annotation and development of standardized protein nomenclature and ontology
PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.
PIRSF Classification System A protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains. SF500001: stimulates trophoblast migration SF500002: stimulates proliferation of prostate cancer cells SF500003: anti-proliferative and pro-apoptotic effects on cancer cells SF500004: inhibitor of IGF SF500005: stimulates bone formation SF500006: inhibitor of IGF-II
Creation and curation of PIRSFs UniProt proteins New proteins Unassigned proteins Automatic Procedure Automatic clustering • Computer-Generated (Uncurated) Clusters (36,000 PIRSFs) • Preliminary Curation (5,000 PIRSFs) • Membership • Signature Domains • Full Curation (1,300 PIRSFs) • Family Name with evidence tag • Description, Bibliography Preliminary Homeomorphic Families Orphans Map domains on Families Automatic placement Merge/split clusters Add/remove members Computer-assisted Manual Curation Curated Homeomorphic Families Name, refs, abstract, domain arch. Final Homeomorphic Families Protein name rule/site rule Create hierarchies (superfamilies/subfamilies) Build and test HMMs
PIRSF-Based Protein Annotation in UniProt UniProt is developing protein name standards and guidelines Classification of proteins into families provides a convenient and accurate mechanism to propagate curated information to individual protein members Rule-Based annotation system using curated PIRSFs • Site Rules (PIRSR): Position-Specific Site Features (active sites, binding sites, modified sites, other functional sites) • Name Rules (PIRNR): transfer name from PIRSF to individual proteins (define a subgroup if necessary) • Protein Name (may differ from family name), synonyms, acronyms • EC • Misnomers • GO Terms (homeomorphic family-based, propagatable GO annotation) • Function
PIRSF-Based Protein Ontology • PIRSF family hierarchy is based on evolutionary relationships • Standardized PIRSF family names • Network structure (in DAG) for PIRSF family classification system
PIRSF to GO Mapping • PIRSF to GO mapping provides a link between GO concepts and protein objects • Mapped5500 curated PIRSF homeomorphic families and subfamilies to the GO hierarchy DynGO viewerHongfang Liu , University of Maryland • Superimpose GO and PIRSF hierarchies • Bidirectional display (GO-centric or PIRSF-centric views)
Protein Ontology Can Complement GO Expanding a Node • Identification of GO subtrees that need expansion if GO concepts are too broad • ~ 67% of curated PIRSF families and subfamilies map to GO leaf nodes • Among these, 2209 PIRSFs have shared GO leaf nodes (many PIRSFs to 1 GO leaf) • Example: PIRSF001969 vs PIRSF018239 and PIRSF036495 : High- vs low-affinity IGF binding Identification of missing GO nodes
Protein Ontology Can Complement GO Identification of Missing GO Nodes (higher levels)
Protein Ontology Can Complement GO Linking Function, Biological Process, and Cellular Component through a Protein Object Based on Protein Annotations • Mechanism to examine the relationships between the three GO ontologies based on the shared annotations at different protein family levels • Example: molecular function “estrogen receptor activity” and biological process “signal transduction” ,“estrogen receptor signaling pathway”
PIRSF Protein Classification: a link between GO and protein objects • Annotation Quality • Annotation of biological function of whole proteins • Annotation of uncharacterized “hypothetical” proteins • Correction of annotation errors and underannotations • Standardization of Protein Names • PIRSF to GO mapping provides a link between GO sub-ontologies and protein objects
PIRSF-based Protein Ontology Can Complement GO • Identification of GO subtrees that need expansion if GO concepts are too broad • Comprehensive classification of related protein families in PIRSF can help in identification of missing GO nodes when entire groups of PIRSF superfamilies or families cannot be mapped to existing GO terms • Mechanism to examine the relationships between the three GO ontologies (molecular function, biological process, and cellular component), as well as between GO concepts, based on the shared annotations at different protein family levels
Acknowledgements • Hongfang Liu , University of Maryland • Judith Blake, The Jackson Laboratory • Dr. Cathy Wu, Director • Protein Classification team Dr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia Nikolskaya Dr. Darren Natale Dr. Zhangzhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Xianying Wei Dr. Sona Vasudevan • Informatics team Dr. Hongzhan Huang Baris Suzek, M.S. Sehee Chung, M.S. Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S. Jing Zhang, M.S. Amar Kalelkar • Students Christina Fang Vincent Hormoso Natalia Petrova Jorge Castro-Alvear http://pir.georgetown.edu/ PIR Team UniProt (SwissProt, TrEMBL, PIR) www.uniprot.org