350 likes | 471 Views
Seed-based Generation of Personalized Bio- Ontologies for Information Extraction. Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University. Supported by NSF. Personalized Information Harvesting.
E N D
Seed-based Generation of PersonalizedBio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University Supported by NSF
Personalized Information Harvesting • Biology domain huge (other domains too) • Data collection • Many (web) sources • Only a tiny subpart wanted • Personalized view • Personalized extraction ontology • Creation: Form specification • Application: Seed-based harvesting
Example • Harvest information about large proteins in humans and the functions of these proteins • Find proteins in humans that are >20 kDa • Find all the proteins in humans that serve as receptors • ... • Information sources various online repositories • NCBI • Gene Cards • The Gene Ontology • GPM Proteomics Database • …
Extraction Ontology T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Instance: ^\d{1,5}(\.\d{1,2})? Context: weight|wght|wt\. Unit: kilodaltons?|kdas?|kds?|das?|daltons? …
Extraction Ontology Unfortunately Hard to Construct T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Instance: ^\d{1,5}(\.\d{1,2})? Context: weight|wght|wt\. Unit: kilodaltons?|kdas?|kds?|das?|daltons? …
Can We Make Construction Easier? • Forms • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Instance recognizers • Some pre-existing instance recognizers • Lexicons • Need for a full extraction ontology?
Form Creation User Interface • Basic form-construction facilities: • single-entry field • multiple-entry field • nested form • …
Source-to-Form Mapping Establishing a Seed
Source-to-Form Mapping Establishing a Seed
Source-to-Form Mapping Establishing a Seed
Source-to-Form Mapping Establishing a Seed
Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection
Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3
Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3
Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
Can Now Harvest Name
Can Now Harvest Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E
Can Now Harvest Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3
Can Now Harvest Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS
Harvesting Populates Ontology Also helps adjust ontology constraints
Can Harvest from Additional Sites Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15
Larger Picture • Information Harvesting • Not only for biology, but for any application • Not only from one site, but from many sites • Opportunities • Extraction ontology creation • Automating site-to-site information harvesting • Automatic semantic annotation • Data/Ontology transformations
Extraction Ontology Creation Lexicons Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E … 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS … Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS
Extraction Ontology Creation Instance Recognizers Number Patterns Context Keywords and Phrases
Automatic Semantic Annotation Recognize and annotate with respect to an ontology
Ontology Transformation OWL & RDF: standard ontology languages XML & XMLS: data exchange Forms: form filling to populate an ontology
Ontology Transformation Transformations to and from all
Contributions • Personalized ontology creation • Mapping from sources • Information harvesting • Opportunities for further work • Extraction ontology creation • Semantic Annotation • Data/Ontology transformations www.deg.byu.edu