330 likes | 451 Views
Processing Natural Language Comments in Biological Databases: Molecular Assemblies and Their Catalitic Functions. A Case Study. Uwe Reyle Institute of Computational Linguistics University of Stuttgart. EML European Media Laboratory Heidelberg INRIA Institute National de Recherche en
E N D
Processing Natural Language Comments in Biological Databases: Molecular Assemblies and Their Catalitic Functions.A Case Study Uwe Reyle Institute of Computational Linguistics University of Stuttgart
EML European Media Laboratory Heidelberg INRIA Institute National de Recherche en Informatique et Automatique Grenoble
Biological Databases Enzymes Compounds Pathways Proteins flat files – no relational/deductive databases made for Biologists – not for Machines
Biological Databases Enzymes Data-Model Ontology Compounds Pathways Proteins Efficient Querying
Overview • Genes, Proteins and Enzymes • Swissprot Protein Database • Two Examples • Semantic Processing • Parsing Protein Names • Merits for • Coreference Resolution • Extraction/Detaction of Molecular Assemblies
compounds (e.g. sugar...) gene EC enzyme molecularassembly polypeptide biochemical reactions Molecular Assembly Catalitic Activity Posttranslational Modifications EC Translation Transcription EC Chromosome Pathways
compounds (e.g. sugar...) gene EC enzyme molecularassembly polypeptide biochemical reactions SUBUNIT CATALITIC ACTIVITY DE INCLUDES CONTAINS EC Swissprot Entries DE POS EC FUNCTION Chromosome PATHWAY
Reference Database Enzymes Compounds Pathways Proteins Swissprot
Swissprot vs. Medline • fact + • organism + • experimental context • enormous vocabulary • coreference = • intra-document coreference • + coreference to database Medline Abstracts IE Papers Swissprot • fact • much smaller vocabulary • coreference = • intra-document coreference
Coreference to DE-line of database entry SYNONYMSpeptidase, dipeptidyl, IVPep X leukocyte antigen CD26 glycylprolyl dipeptidylaminopeptidaseglycylproline-dipeptidyl-aminopeptidaseglycylproline aminopeptidaseXaa-Pro-dipeptidyl-aminopeptidasedipeptidyl-peptide hydrolaselymphocyte, antigen CD26postproline dipeptidyl aminopeptidase IVglycylprolyl aminopeptidasedipeptidyl-aminopeptidase IVGly-Pro-naphthylamidaseDPP IV/CD26 glycoprotein GP110amino acyl-prolyl dipeptidyl aminopeptidasedipeptidyl aminopeptidase IVT cell triggering molecule Tp103 dipeptidyl-peptidase IV (CD26) X-prolyl dipeptidyl aminopeptidaseX-PDAP aminopeptidase, glycylproline • RECOMMENDED NAMEdipeptidyl-peptidase IV
Structure of Swissprot Entries Quality of information by marking: experiment, similarity, ... Each entry refers to a polypeptide in one single organism 3. The different line types 3.1 The ID line 3.2 The AC line 3.3 The DT line 3.4 The DE line 3.5 The GN line 3.6 The OS line 3.7 The OG line 3.8 The OC line 3.9 The OX line 3.10 The reference (RN, RP, RC, RX, RA, RT, RL) lines 3.11 The CC line 3.12 The DR line 3.13 The KW line 3.14 The FT line 3.15 The SQ line 3.16 The sequence data line 3.17 The // line
An Example ID ACCD_ECOLI STANDARD; PRT; 304 AA. AC P08193; P78251; P76937; DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC 6.4.1.2) (ACCASE BETA CHAIN). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX. -!- CATALITIC ACTIVITY, PATHWAY, SIMILARITY, FEATURES, ...
Variety of SUBUNIT-lines • HETERODIMER ... • PP2 CONSISTS OF A COMMON HETERODIMERIC CORE ENZYME, COMPOSED OF A 36 KDA CATALITIC SUBUNIT (SUBUNIT C) AND A 65 KDA CONSTANT REGULATORY SUBUNIT (PR65 OR SUBUNIT A), THAT ASSOCIATES WITH A VARIETY OF REGULATORY SUBUNITS. PROTEINS THAT ASSOCIATE WITH THE CORE DIMER INCLUDE THREE FAMILIES OF ...
Subunit-lines of type NP <DE> A-kinase anchor protein 5 <SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN. <DE> Potassium-transporting ATPase alpha chain <SUBUNIT> HETERODIMERCOMPOSED OF TWO SUBUNITS, ALPHA AND BETA.
Subunit-lines of type NP <DE> A-kinase anchor protein 5 <SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN. (AKAP5 : (PKA:PKA)), where PKA is inhibited PKC AKAP5andPP2B AKAP5 where PKC and PP2B are inhibited <DE> Potassium-transporting ATPase alpha chain <SUBUNIT> HETERODIMERCOMPOSED OF TWO SUBUNITS, ALPHA AND BETA.
Subunit-lines of type NP <DE> A-kinase anchor protein 5 <SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN. <DE> Potassium-transporting ATPase alpha chain <SUBUNIT> HETERODIMERCOMPOSED OF TWO SUBUNITS, ALPHA AND BETA. Potassium-transporting ATPase (alpha : beta) Potassium-transporting ATPase alpha chain (alpha : beta)
Subunit-lines of type NP <DE> A-kinase anchor protein 5 <SUBUNIT> BINDING PROTEIN FOR DIMER OF PKA AND ALSO FOR PKC AND PP2B. EACH ENZYME IS INHIBITED WHEN BOUND TO THE ANCHOR PROTEIN. <DE> Potassium-transporting ATPasealpha chain <SUBUNIT> HETERODIMERCOMPOSED OF TWO SUBUNITS, ALPHA AND BETA. Potassium-transporting ATPase (alpha : beta) Potassium-transporting ATPase alpha chain (alpha : beta) Task: parse recommended name
Structure of Polypeptide Names that Refer to Subunits of Proteines AssemblyName homolog precursor ; phrase(s) vacuolar soluable anaerobic SubunitRef Protein Name Enzyme Name {beta 1, ASHI, lacH, ...} subunit 30 kda subunit {small, major, second largest,...} subunit type B catalitic subunit subunit {alpha 3, 2 type B, ...} iron-sulfur subunit alpha-2 {alpha, light, catalitic,...} chain cytochrome B-558
Problems • We cannot assume a dictionary of assembly names • AssemblyName very often end with a highly ambiguous symbol that may also be used to start the SubunitRef expression - F, A1, I, II, i, ..., geneName, ... • Nomenclature of subunits does not exist • Contextual knowledge is needed to disambiguate, e.g., XYase A1 large chain
Assembly Names • Mitogen-activated protein kinase kinase kinase kinase acting on a kinase that acts on a protein kinase one of these kinases is mitogen-activated, not the protein, however • „kinase“ has 1 semantic argument, namely the molecule X that it phosphorylates Acceptor/Donor Group phosphoryl Function transfer Acceptor/Donor ... Group phosphoryl Function transfer
Assembly Names CoA Carboxylase Carboxyl Transferase Acceptor/Donor CoA Carboxylase Group carboxyl Function transfer Acceptor/Donor X Group carboxyl Function transfer ADJ-Rel CoA Carboxylase With ADJ-Rel {,is_expressed_by, ...}
Semantic Relations projected from the Lexicon • carboxyl transferase transcarboxylase (IUPAC) • transcarboxylation carboxylation • transcarboxylate carboxylate • phosphorylate, biotinylate, adenylylate, ... • transphosphorylate, ... • crossphosphorylate, ...
Coreference (local) ID ACCD_ECOLI STANDARD; PRT; 304 AA. AC P08193; P78251; P76937; DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC 6.4.1.2) (ACCASE BETA CHAIN). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
Coreference (local) ID ACCD_ECOLI STANDARD; PRT; 304 AA. AC P08193; P78251; P76937; DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC 6.4.1.2) (ACCASE BETA CHAIN). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX.
Coreference (local) ID ACCD_ECOLI STANDARD; PRT; 304 AA. AC P08193; P78251; P76937; DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC 6.4.1.2) (ACCASE BETA CHAIN). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX. PP-attachment: semantics of Heterohexamer
Coreference (non-local) ID ACCD_ECOLI STANDARD; PRT; 304 AA. AC P08193; P78251; P76937; DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC 6.4.1.2) (ACCASE BETA CHAIN). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX. ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT XYZ XYZ SUBUNIT OF ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE ...
Coreference (non-local) ID ACCA_ECOLI STANDARD; PRT; 318 AA. AC P30867; DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT ALPHA (EC 6.4.1.2). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL- COA. CC -!- CATALYTIC ACTIVITY: CARBOXYBIOTIN CARBOXYL CARRIER PROTEIN + ACETYL-COA = BIOTIN CARBOXYL CARRIER PROTEIN + MALONYL-COA. CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTINCARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX. CC -!- SIMILARITY: TO THE C-TERMINUS OF MAMMALIAN PROPIONYL-COA CARBOXYLASE BETA CHAIN.
Coreference (non-local) ID BCCP_ECOLI STANDARD; PRT; 156 AA. AC P02905; DE BIOTIN CARBOXYL CARRIER PROTEIN OF ACETYL-COA CARBOXYLASE (BCCP). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS. CC -!- SUBUNIT: HOMODIMER.
Coreference (non-local) ID ACCC_ECOLI STANDARD; PRT; 449 AA. AC P24182; DE BIOTIN CARBOXYLASE (EC 6.3.4.14) (A SUBUNIT OF ACETYL-COA CARBOXYLASE) (EC 6.4.1.2) (ACC). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- CATALYTIC ACTIVITY: ATP + BIOTIN-CARBOXYL-CARRIER PROTEIN + CO(2) = ADP + ORTHOPHOSPHATE + CARBOXYBIOTIN-CARBOXYL-CARRIER PROTEIN. CC -!- PATHWAY: FIRST STEP IN LONG-CHAIN FATTY ACID SYNTHESIS. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX. CC -!- SIMILARITY: TO OTHER BIOTIN-DEPENDENT ENZYMES AND CARBAMOYL- PHOSPHATE SYNTHETASES.
Extraction ID ACCD_ECOLI STANDARD; PRT; 304 AA. AC P08193; P78251; P76937; DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC 6.4.1.2) (ACCASE BETA CHAIN). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMER OF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX. CC -!- SIMILARITY: BELONGS TO THE ACCD / PCCB FAMILY. Complex consisting of 6 subunits
Extraction ID ACCD_ECOLI STANDARD; PRT; 304 AA. AC P08193; P78251; P76937; DE ACETYL-COENZYME A CARBOXYLASE CARBOXYL TRANSFERASE SUBUNIT BETA (EC 6.4.1.2) (ACCASE BETA CHAIN). CC -!- FUNCTION: THIS PROTEIN IS A COMPONENT OF THE ACETYL COENZYME A CARBOXYLASE COMPLEX; FIRST, BIOTIN CARBOXYLASE CATALYZES THE CARBOXYLATION OF THE CARRIER PROTEIN AND THEN THE TRANSCARBOXYLASE TRANSFERS THE CARBOXYL GROUP TO FORM MALONYL-COA. CC -!- SUBUNIT: ACETYL-COA CARBOXYLASE IS AN HETEROHEXAMEROF BIOTIN CARBOXYL CARRIER PROTEIN, BIOTIN CARBOXYLASE AND THE TWO SUBUNITS OF CARBOXYL TRANSFERASE IN A 2:2 COMPLEX. CC -!- SIMILARITY: BELONGS TO THE ACCD / PCCB FAMILY. Acetyl-CoA Carboxylase Carrier Protein Biotin Carboxylase Carboxyl Transferase Alpha Alpha Beta Beta
Completing the Picture ID BIRA_ECOLI STANDARD; PRT; 321 AA. AC P06709; CC -!- FUNCTION: BIRA ACTS BOTH AS A BIOTIN-OPERON REPRESSOR AND AS THE ENZYME THAT SYNTHESIZES THE COREPRESSOR, ACETYL COA:CARBON-DIOXIDE LIGASE. THIS PROTEIN ALSO ACTIVATES BIOTIN TO FORM BIOTINYL-5'-ADENYLATE AND TRANSFERS THE BIOTIN MOIETY TO BIOTIN-ACCEPTING PROTEINS. CC -!- CATALYTIC ACTIVITY: ATP + BIOTIN + APO-[ACETYL-COA:CARBON-DIOXIDE LIGASE (ADP FORMING)] = AMP + PYROPHOSPHATE + [ACETYL-COA:CARBON-DIOXIDE LIGASE (ADP FORMING)]. CC -!- SUBUNIT: MONOMER. CC -!- SIMILARITY: WITH OTHER BACTERIAL BIRA AND WITH EUKARYOTIC BIOTIN APO-PROTEIN LIGASE. = Acetyl CoA Carboxylase
Conclusion • Sophisticated IE must incorporate • Domain Ontology (EML, INRIA,IMS) • Lexical Semantics (IMS) • Morphological Analysis + Compositional Semantics (IMS) • Discourse Semantics (IMS) • Work on the Lexicon of Cell-Biology • Organic Chemical Compounds „Was bedeutet UREYLEN“ (C. Gerstenberger, IMS) • Semantic/ontological classification of 100 chemical Verbs (Phillip Cimiano Lavin, IMS) • Enzyme- and Protein Names (work in progres)