1 / 39

Populating a Database from Parallel Texts using “Ontology-based” Information Extraction

Populating a Database from Parallel Texts using “Ontology-based” Information Extraction. Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U. of Sheffield Susannah Lydon

Download Presentation

Populating a Database from Parallel Texts using “Ontology-based” Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Populating a Database from Parallel Texts using “Ontology-based” Information Extraction Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U. of Sheffield Susannah Lydon Earth Science Education Unit, U. of Keele

  2. The hypothesis

  3. Overview Parallel texts Legacy data in the natural sciences “Ontology-based” Information Extraction

  4. NLDB’04 - a few running threads Multiple / semi-overlapping text sources Sophisticated vs shallow or statistical text processing “Ontologies” are not the same as gazetteers or lexicons (or semantic nets!) Autonomous agents vs HCC (Human-Computer Collaborative) approaches

  5. We are doing… Highly homogeneous data sources Shallow text processing “Ontologies” only as a last resort HCC approach

  6. We are not doing… Heterogeneous data sources Sophisticated language processing Improvement of single-source IE or question-answering Autonomous agents

  7. Parallel texts Text descriptions in the traditional descriptive sciences. Descriptions of protein sequences and functions in molecular biology. Press coverage of news stories. Police witness-of-crime reports. (Semi-) automatic marking of free text answers in examinations.

  8. Legacy data in the natural sciences Text descriptions in the traditional descriptive sciences: Species descriptions in botany and zoology Descriptions of diseases in medicine.

  9. Data sources Five species of Ranunculus (buttercups) Six botanists’ text descriptions (Floras)

  10. Typical data R. acris L. - MeadowButtercup. Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; 2n=14.

  11. Hand Parsing & Correlation

  12. Results of hand-analysis of Ranunculus descriptions from six sources - Most data from one source only - Individual texts contain on average 39% of the total information for each species

  13. MultiFlora I Automatic compilation of accurate taxonomic databases from multiple non-computerised sources Department of Computer Science University of Manchester Mary McGee Wood David Rydeheard Susannah Lydon Department of Botany Natural History Museum,London Rob Huxley David Sutton Supported by the BBSRC / EPSRC joint Bioinformatics Initiative, grant reference number 34/BIO12072

  14. GATE I

  15. Tagger output

  16. Parse trees

  17. Names & verbs ‘Basal leaves more or less deeply divided…’ 1231 semantics 179 191 (qlf:[ne_tag(e13, offsets(179, 184)), name(e13, 'Basal'), realisation(e13, offsets(179, 184)), leave(e12), time(e12, present), aspect(e12, simple), voice(e12, active), realisation(e12, offsets(185, 191)), realisation(e12, offsets(185, 191)), lsubj(e12, e13)]) 1247 semantics 200 226 (qlf:[divide(e14), adv(e14, less), adv(e14, deeply), time(e14, none), aspect(e14, simple), voice(e14, passive), into(e14, e15), count(e15, 3), realisation(e15, offsets(225, 226)), realisation(e14, offsets(200, 226)), realisation(e14, offsets(200, 226))])

  18. Template output (1) Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; HEAD KIND FEATURE TYPE KIND Erect Perennial to 1m measure unknown basal position pubescent leaves Prefix deeply palmately lobed

  19. Template output (2) flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; HEAD KIND FEATURE TYPE KIND NEGATION flowers 15-25mm measure width across sepals reflexed true achenes short hooked smooth glabrous 2-3.5mm measure unknown

  20. MultiFlora II: Combining Information Extraction and Knowledge Representation for Biodiversity Informatics Department of Computer Science, University of Manchester Mary McGee Wood Susannah Lydon Alan Rector Natural Language Processing Group, University of Sheffield Hamish Cunningham Valentin Tablan Diana Maynard Department of Botany, Natural History Museum, London Rob Huxley Supported by the BBSRC Bioinformatics and E-science Programme, grant reference number 34/BEP17049

  21. GATE II

  22. “Ontology-based” Information Extraction “Ontology” – classes of heads, properties, and features Gazetteers – instances of these classes (Lexicons – not currently used)

  23. Head categories Specific plant parts: Flower: Flower, floret, Fl Leaf: leaf, leaves, Fronds Petal: petal, honey-leaf, vexillum Collective categories: PlantSeparatablePart: appendage, glume, tuber PlantUnseparatablePart: beak, lobe, segment SpecificRegionOfWhole: apex, border, head

  24. Ontology: Heads ontology-heads.eps

  25. Properties 2DShape: arching, linear, toothed 3DShape: branching, thickened, tube Colour: glossy, golden, greenish Count: numerous, several

  26. Ontology: Properties

  27. Features Habit: bush, shrub, succulent MorphologicalProperty: dense, contiguous, separate SurfaceProperty: pilose, pitted, rugose

  28. Ontology: Features

  29. More typical data Perennial herb with overwintering lf-rosettes from the short oblique to erect premorse stock up to 5 cm, rarely longer and more rhizome-like; roots white, rather fleshy, little branched.

  30. System output Head Class Head Property FeatClass Feature Plant herb hasLifeform Lifeform Perennial Leaf lf-rosettes hasLifeform Lifeform overwintering PlantSepPart stock hasRelProperty RelProperty short PlantSepPart stock hasOrientation Orientation oblique to erect PlantSepPart stock hasLength Length up to 5 cm PlantSepPart stock hasRelProperty RelProperty rhizome-like Root roots hasColour Colour white Root roots hasShape3D Shape3D rather fleshy Root roots hasShape3D Shape3D little branched

  31. Precision • R. acris R. bulbosus R. hederaceus Avg • Single description, average • 78 60 83 74 • Single description, average, for whole template • 78 60 83 74 • Merged, for whole template • 58 69 63

  32. Recall R. acris R. bulbosus R. hederaceus Avg Single description, average 70 55 74 66 Single description, average, for whole template 22 18 26 22 Merged, for whole template 69 61 82 71

  33. F-measure R. acris R. bulbosus R. hederaceus Avg Single description, average 73.78 57.39 78.24 69.77 Single description, average, for whole template 34.32 27.69 39.60 33.92 Merged for whole template 65.86 59.46 74.94 66.76

  34. Information merging Of all instances of missed information, percentage compensated for by merging 50 46 55 50 Of total number of slots in template, percentage where merging allowed compensation for missed information 25 29 18 24

  35. Information merging These figures based on human judgement Automated “merging reasoner” under active construction

  36. Future work – short term Fine-tuning to improve precision (Semi-) automatic template correlation heuristics (Semi-) automatic data correlation heuristics Extend coverage and evaluation

  37. Future targets Techniques: Merging reasoner Temporal reasoner Data types: Large-scale legacy data in biodiversity studies Free text annotations in Bioinformatics databases …

More Related