390 likes | 554 Views
Populating a Database from Parallel Texts using “Ontology-based” Information Extraction. Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U. of Sheffield Susannah Lydon
E N D
Populating a Database from Parallel Texts using “Ontology-based” Information Extraction Mary McGee Wood, Shenghui Wang Dept of Computer Science, U. of Manchester Valentin Tablan, Diana Maynard, Hamish Cunningham Dept of Computer Science, U. of Sheffield Susannah Lydon Earth Science Education Unit, U. of Keele
Overview Parallel texts Legacy data in the natural sciences “Ontology-based” Information Extraction
NLDB’04 - a few running threads Multiple / semi-overlapping text sources Sophisticated vs shallow or statistical text processing “Ontologies” are not the same as gazetteers or lexicons (or semantic nets!) Autonomous agents vs HCC (Human-Computer Collaborative) approaches
We are doing… Highly homogeneous data sources Shallow text processing “Ontologies” only as a last resort HCC approach
We are not doing… Heterogeneous data sources Sophisticated language processing Improvement of single-source IE or question-answering Autonomous agents
Parallel texts Text descriptions in the traditional descriptive sciences. Descriptions of protein sequences and functions in molecular biology. Press coverage of news stories. Police witness-of-crime reports. (Semi-) automatic marking of free text answers in examinations.
Legacy data in the natural sciences Text descriptions in the traditional descriptive sciences: Species descriptions in botany and zoology Descriptions of diseases in medicine.
Data sources Five species of Ranunculus (buttercups) Six botanists’ text descriptions (Floras)
Typical data R. acris L. - MeadowButtercup. Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; 2n=14.
Results of hand-analysis of Ranunculus descriptions from six sources - Most data from one source only - Individual texts contain on average 39% of the total information for each species
MultiFlora I Automatic compilation of accurate taxonomic databases from multiple non-computerised sources Department of Computer Science University of Manchester Mary McGee Wood David Rydeheard Susannah Lydon Department of Botany Natural History Museum,London Rob Huxley David Sutton Supported by the BBSRC / EPSRC joint Bioinformatics Initiative, grant reference number 34/BIO12072
Names & verbs ‘Basal leaves more or less deeply divided…’ 1231 semantics 179 191 (qlf:[ne_tag(e13, offsets(179, 184)), name(e13, 'Basal'), realisation(e13, offsets(179, 184)), leave(e12), time(e12, present), aspect(e12, simple), voice(e12, active), realisation(e12, offsets(185, 191)), realisation(e12, offsets(185, 191)), lsubj(e12, e13)]) 1247 semantics 200 226 (qlf:[divide(e14), adv(e14, less), adv(e14, deeply), time(e14, none), aspect(e14, simple), voice(e14, passive), into(e14, e15), count(e15, 3), realisation(e15, offsets(225, 226)), realisation(e14, offsets(200, 226)), realisation(e14, offsets(200, 226))])
Template output (1) Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; HEAD KIND FEATURE TYPE KIND Erect Perennial to 1m measure unknown basal position pubescent leaves Prefix deeply palmately lobed
Template output (2) flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; HEAD KIND FEATURE TYPE KIND NEGATION flowers 15-25mm measure width across sepals reflexed true achenes short hooked smooth glabrous 2-3.5mm measure unknown
MultiFlora II: Combining Information Extraction and Knowledge Representation for Biodiversity Informatics Department of Computer Science, University of Manchester Mary McGee Wood Susannah Lydon Alan Rector Natural Language Processing Group, University of Sheffield Hamish Cunningham Valentin Tablan Diana Maynard Department of Botany, Natural History Museum, London Rob Huxley Supported by the BBSRC Bioinformatics and E-science Programme, grant reference number 34/BEP17049
“Ontology-based” Information Extraction “Ontology” – classes of heads, properties, and features Gazetteers – instances of these classes (Lexicons – not currently used)
Head categories Specific plant parts: Flower: Flower, floret, Fl Leaf: leaf, leaves, Fronds Petal: petal, honey-leaf, vexillum Collective categories: PlantSeparatablePart: appendage, glume, tuber PlantUnseparatablePart: beak, lobe, segment SpecificRegionOfWhole: apex, border, head
Ontology: Heads ontology-heads.eps
Properties 2DShape: arching, linear, toothed 3DShape: branching, thickened, tube Colour: glossy, golden, greenish Count: numerous, several
Features Habit: bush, shrub, succulent MorphologicalProperty: dense, contiguous, separate SurfaceProperty: pilose, pitted, rugose
More typical data Perennial herb with overwintering lf-rosettes from the short oblique to erect premorse stock up to 5 cm, rarely longer and more rhizome-like; roots white, rather fleshy, little branched.
System output Head Class Head Property FeatClass Feature Plant herb hasLifeform Lifeform Perennial Leaf lf-rosettes hasLifeform Lifeform overwintering PlantSepPart stock hasRelProperty RelProperty short PlantSepPart stock hasOrientation Orientation oblique to erect PlantSepPart stock hasLength Length up to 5 cm PlantSepPart stock hasRelProperty RelProperty rhizome-like Root roots hasColour Colour white Root roots hasShape3D Shape3D rather fleshy Root roots hasShape3D Shape3D little branched
Precision • R. acris R. bulbosus R. hederaceus Avg • Single description, average • 78 60 83 74 • Single description, average, for whole template • 78 60 83 74 • Merged, for whole template • 58 69 63
Recall R. acris R. bulbosus R. hederaceus Avg Single description, average 70 55 74 66 Single description, average, for whole template 22 18 26 22 Merged, for whole template 69 61 82 71
F-measure R. acris R. bulbosus R. hederaceus Avg Single description, average 73.78 57.39 78.24 69.77 Single description, average, for whole template 34.32 27.69 39.60 33.92 Merged for whole template 65.86 59.46 74.94 66.76
Information merging Of all instances of missed information, percentage compensated for by merging 50 46 55 50 Of total number of slots in template, percentage where merging allowed compensation for missed information 25 29 18 24
Information merging These figures based on human judgement Automated “merging reasoner” under active construction
Future work – short term Fine-tuning to improve precision (Semi-) automatic template correlation heuristics (Semi-) automatic data correlation heuristics Extend coverage and evaluation
Future targets Techniques: Merging reasoner Temporal reasoner Data types: Large-scale legacy data in biodiversity studies Free text annotations in Bioinformatics databases …