Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?

Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time? Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp257-266 http://www.cs.vassar.edu/faculty/ide/pubs.html As (mis-)interpreted by Peter Clark

The Postulates of MRD Work • P1: MRDs contain information that is useful for NLP e.g.:

The Postulates of MRD Work • P1: MRDs contain information that is useful for NLP • P2: This info is relatively easy to extract from MRDs e.g., extraction of hypernyms (generalizations):  Dipper isa Ladle isa Spoon isa Utensil

But… • Not much to show for it so far (1993) • handful of limited and imperfect taxonomies • few studies on the quality of knowledge in MRDs • few studies on extracting more complex info

Complaints… • P1: useful info in MRDs: • C1a: 50%-70% of info in dictionaries is “garbled” • C1b: sense definitions  concept usage (“real concepts”) • C1c: some types of knowledge simply not there • P2: Info can be easily extracted • Most successes have been for hypernyms only • C2a: MRD formats are a nightmare to deal with • C2b: A virtually open-ended set of ways of describing facts • C2c: Bootstrapping: Need a KB to build a KB from a MRD

C1a: MRD information is “garbled” • Multiple people, multiple years effort • Space restrictions, syntactic restrictions • Particular problem 1: • Attachment of terms too high (21%-34%) • e.g., “pan” and “bottle” are “vessels”, but “cup” and “bowl” are simply “containers” • occurs fairly randomly • Categories less clear at top levels • “fork” and “spoon” is ok, but “implement” and “utensil” = ? • Sometimes no word there to refer to a concept • leads to circular definitions

C1a: MRD information is “garbled”

C1a: MRD information is “garbled” • Particular problem 2: • Categories less clear at top levels • “fork” and “spoon” is ok, but “implement” and “utensil” = ? • Leads to disjuncts e.g. “implement or utensil” • Sometimes no word there to refer to a concept • leads to circular definitions • leads to “covert categories”, e.g., INSTRUMENTAL-OBJECT (a hypernym for “tool”, “utensil”, “instrument”, and “implement”)

C1a: MRD information is “garbled” • Particular problem 3: • And hypernyms are relatively consistent!! Other semantic relations are given in a less consistent way, e.g., smell, taste, etc.

C1b: sense definitions  concept usage (“real concepts”) • Ambiguity of word senses, e.g., • 87% of words in a sample fit > 1 word sense • Word senses don’t reflect actual use • Word sense distinctions differ between MRDs • level of detail • way lines are drawn between senses • no definitive set of distinctions

C1c: some types of knowledge simply not there • no broad contextual or world knowledge, e.g., • no connection between “lawn” and “house”, or between “ash” and “tobacco” • “restaurant, eating house, eating place -- (a building where people go to eat)” [WordNet] • No mention that it’s a commercial business, e.g., for “the waitress collected the check.”

C2a: MRD formats are a nightmare to deal with • Ambiguities / inconsistencies in typesetter format • Complex grammars for entries • Conventions are inconsistent, e.g. bracketing for • “Canopic jar, urn, or vase” vs. • “Junggar Pendi, Dzungaria, or Zungaria” • Need a lot of hand pre-processing • not much general value to this • is a vast task in itself • not many processed dictionaries available

C2b: A virtually open-ended set of ways of describing facts But… There is “virtually an open-ended set of phrases…”

C2c: Bootstrapping: Need a KB to build a KB Need knowledge to do NLP on MRDs! • e.g. “carry by means of a handle” vs. “carry by means of a wagon” • But undisambiguated hierarchy is unusable, e.g., • “saucepan” isa “pan” isa “leaf” • need to build your KB before you even start on the MRD

Synthesis • Underlying postulate of P1 and P2: • P0: Large KBs cannot be built by hand • Counterexamples: • Cyc • Dictionaries themselves! • And besides… • KBs are too hard to extract from MRDs • don’t contain all the knowledge needed • But: MRD contributions: • understanding the structure of dictionaries • convergance of NLP, lexicography, and electronic publishing interests

Ways forward… • Combining Knowledge Sources: • One dictionary has 55%-70% of “problematic cases” [of incompleteness], but 5 dictionaries reduced this to 5% • Also should combine knowledge from corpora as a means of “filling out” KBs • Prediction: • KBs built by people, using corpora and text extraction technology tools, and combined together by hand (Schubert-style; Code4; Ikarus)

Ways forward… • MRDs will become encoded more consistently • Better analysis needed of the types of knowledge needed for NLP • perhaps don’t need the kind of precision in a KB • Exploitation of associational information • Very useful for sense disambiguation (e.g., Harabagiu)

Ways forward… • Lexicographers increasingly interested in using lexical databases for their work • Could create a NLP-like KB directly • Create explicit semantic links between word entries • Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided)

Ways forward…

Ways forward… • Lexicographers increasingly interested in using lexical databases for their work • Could create a NLP-like KB directly • Create explicit semantic links between word entries • Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided) • Ensure consistency of “metatext” (i.e., be consistent about how semantic relations are stated) • Ensure consistency of sense division • e.g., “cup” and “bowl” have two senses (literal and metonymic) but “glass” only has one (literal)  could spot this inconsistency

Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?

Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?

Presentation Transcript

Video Conferencing

Enabling Intensive Life Cycle Support Management by Machine – Readable Unique Item Identifiers

Extracting Information from Spoken User Input A Machine Learning Approach

Extracting Regional Knowledge from Spatial Datasets

An Alternative Machine-Readable Coding Scheme for Unit Dose Medications

Statistical NLP: Lecture 10

Knowledge acquisition (Ch. 17 Durkin)

Automatic Derivation of Thesauri Thesauri Construction

INTER-LAKES BASES PRODUCTS

Dictionaries

Ontologies

Dictionaries

MARC21 Bibliographic Record

Quick Guide to CAM Dictionaries

Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping

CS240A: Databases and Knowledge Bases Time Ontology and Representations

Natural Language to Machine Readable Format

Nested Dictionaries

Logical Agents

Dictionaries and the Order of Knowledge