1 / 35

Seed-based Generation of Personalized Bio- Ontologies for Information Extraction

Explore creating personalized extraction ontologies for targeted information harvesting in the biology domain from various online repositories through seed-based harvesting. Enhance ontology creation and automate site-to-site information extraction for broader applications.

ctimm
Download Presentation

Seed-based Generation of Personalized Bio- Ontologies for Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Seed-based Generation of PersonalizedBio-Ontologies for Information Extraction Cui Tao & David W. Embley Data Extraction Research Group Department of Computer Science Brigham Young University Supported by NSF

  2. Personalized Information Harvesting • Biology domain  huge (other domains too) • Data collection • Many (web) sources • Only a tiny subpart wanted • Personalized view • Personalized extraction ontology • Creation: Form specification • Application: Seed-based harvesting

  3. Example • Harvest information about large proteins in humans and the functions of these proteins • Find proteins in humans that are >20 kDa • Find all the proteins in humans that serve as receptors • ... • Information sources  various online repositories • NCBI • Gene Cards • The Gene Ontology • GPM Proteomics Database • …

  4. Extraction Ontology T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Instance: ^\d{1,5}(\.\d{1,2})? Context: weight|wght|wt\. Unit: kilodaltons?|kdas?|kds?|das?|daltons? …

  5. Extraction Ontology Unfortunately Hard to Construct T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Instance: ^\d{1,5}(\.\d{1,2})? Context: weight|wght|wt\. Unit: kilodaltons?|kdas?|kds?|das?|daltons? …

  6. Can We Make Construction Easier? • Forms • General familiarity • Reasonable conceptual framework • Appropriate correspondence • Transformable to ontological descriptions • Capable of accepting source data • Instance recognizers • Some pre-existing instance recognizers • Lexicons • Need for a full extraction ontology?

  7. Form Creation User Interface • Basic form-construction facilities: • single-entry field • multiple-entry field • nested form • …

  8. Created Sample Form

  9. Generated Ontology View

  10. Source-to-Form Mapping Establishing a Seed

  11. Source-to-Form Mapping Establishing a Seed

  12. Source-to-Form Mapping Establishing a Seed

  13. Source-to-Form Mapping Establishing a Seed

  14. Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection

  15. Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3

  16. Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3

  17. Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

  18. Almost Ready to Harvest … • Need reading path: DOM-tree structure • Need to resolve mapping problems • Split/Merge • Union/Selection Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

  19. Can Now Harvest Name

  20. Can Now Harvest Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E

  21. Can Now Harvest Name Voltage-dependent anion-selective channel protein 3 VDAC-3 hVDAC3 Outer mitochondrial membrane Protein porin 3

  22. Can Now Harvest Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS

  23. Harvesting Populates Ontology

  24. Harvesting Populates Ontology Also helps adjust ontology constraints

  25. Can Harvest from Additional Sites Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15

  26. Larger Picture • Information Harvesting • Not only for biology, but for any application • Not only from one site, but from many sites • Opportunities • Extraction ontology creation • Automating site-to-site information harvesting • Automatic semantic annotation • Data/Ontology transformations

  27. Extraction Ontology Creation Lexicons Name 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E … 14-3-3 protein epsilon Mitochondrial import stimulation factor Lsubunit Protein kinase C inhibitor protein-1 KCIP-1 14-3-3E … T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 … Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS … Name T-complex protein 1 subunit theta TCP-1-theta CCT-theta Renal carcinoma antigen NY-REN-15 Name Tryptophanyl-tRNA synthetase, mitochondrial precursor EC 6.1.1.2 Tryptophan—tRNA ligase TrpRS (Mt)TrpRS

  28. Automatic Source-to-Form Mapping

  29. Automatic Semantic Annotation

  30. Extraction Ontology Creation Instance Recognizers Number Patterns Context Keywords and Phrases

  31. Automatic Source-to-Form Mapping

  32. Automatic Semantic Annotation Recognize and annotate with respect to an ontology

  33. Ontology Transformation OWL & RDF: standard ontology languages XML & XMLS: data exchange Forms: form filling to populate an ontology

  34. Ontology Transformation Transformations to and from all

  35. Contributions • Personalized ontology creation • Mapping from sources • Information harvesting • Opportunities for further work • Extraction ontology creation • Semantic Annotation • Data/Ontology transformations www.deg.byu.edu

More Related