200 likes | 313 Views
Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?. Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp257-266 http://www.cs.vassar.edu/faculty/ide/pubs.html As (mis-)interpreted by Peter Clark. The Postulates of MRD Work.
E N D
Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time? Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp257-266 http://www.cs.vassar.edu/faculty/ide/pubs.html As (mis-)interpreted by Peter Clark
The Postulates of MRD Work • P1: MRDs contain information that is useful for NLP e.g.:
The Postulates of MRD Work • P1: MRDs contain information that is useful for NLP • P2: This info is relatively easy to extract from MRDs e.g., extraction of hypernyms (generalizations): Dipper isa Ladle isa Spoon isa Utensil
But… • Not much to show for it so far (1993) • handful of limited and imperfect taxonomies • few studies on the quality of knowledge in MRDs • few studies on extracting more complex info
Complaints… • P1: useful info in MRDs: • C1a: 50%-70% of info in dictionaries is “garbled” • C1b: sense definitions concept usage (“real concepts”) • C1c: some types of knowledge simply not there • P2: Info can be easily extracted • Most successes have been for hypernyms only • C2a: MRD formats are a nightmare to deal with • C2b: A virtually open-ended set of ways of describing facts • C2c: Bootstrapping: Need a KB to build a KB from a MRD
C1a: MRD information is “garbled” • Multiple people, multiple years effort • Space restrictions, syntactic restrictions • Particular problem 1: • Attachment of terms too high (21%-34%) • e.g., “pan” and “bottle” are “vessels”, but “cup” and “bowl” are simply “containers” • occurs fairly randomly • Categories less clear at top levels • “fork” and “spoon” is ok, but “implement” and “utensil” = ? • Sometimes no word there to refer to a concept • leads to circular definitions
C1a: MRD information is “garbled” • Particular problem 2: • Categories less clear at top levels • “fork” and “spoon” is ok, but “implement” and “utensil” = ? • Leads to disjuncts e.g. “implement or utensil” • Sometimes no word there to refer to a concept • leads to circular definitions • leads to “covert categories”, e.g., INSTRUMENTAL-OBJECT (a hypernym for “tool”, “utensil”, “instrument”, and “implement”)
C1a: MRD information is “garbled” • Particular problem 3: • And hypernyms are relatively consistent!! Other semantic relations are given in a less consistent way, e.g., smell, taste, etc.
C1b: sense definitions concept usage (“real concepts”) • Ambiguity of word senses, e.g., • 87% of words in a sample fit > 1 word sense • Word senses don’t reflect actual use • Word sense distinctions differ between MRDs • level of detail • way lines are drawn between senses • no definitive set of distinctions
C1c: some types of knowledge simply not there • no broad contextual or world knowledge, e.g., • no connection between “lawn” and “house”, or between “ash” and “tobacco” • “restaurant, eating house, eating place -- (a building where people go to eat)” [WordNet] • No mention that it’s a commercial business, e.g., for “the waitress collected the check.”
C2a: MRD formats are a nightmare to deal with • Ambiguities / inconsistencies in typesetter format • Complex grammars for entries • Conventions are inconsistent, e.g. bracketing for • “Canopic jar, urn, or vase” vs. • “Junggar Pendi, Dzungaria, or Zungaria” • Need a lot of hand pre-processing • not much general value to this • is a vast task in itself • not many processed dictionaries available
C2b: A virtually open-ended set of ways of describing facts But… There is “virtually an open-ended set of phrases…”
C2c: Bootstrapping: Need a KB to build a KB Need knowledge to do NLP on MRDs! • e.g. “carry by means of a handle” vs. “carry by means of a wagon” • But undisambiguated hierarchy is unusable, e.g., • “saucepan” isa “pan” isa “leaf” • need to build your KB before you even start on the MRD
Synthesis • Underlying postulate of P1 and P2: • P0: Large KBs cannot be built by hand • Counterexamples: • Cyc • Dictionaries themselves! • And besides… • KBs are too hard to extract from MRDs • don’t contain all the knowledge needed • But: MRD contributions: • understanding the structure of dictionaries • convergance of NLP, lexicography, and electronic publishing interests
Ways forward… • Combining Knowledge Sources: • One dictionary has 55%-70% of “problematic cases” [of incompleteness], but 5 dictionaries reduced this to 5% • Also should combine knowledge from corpora as a means of “filling out” KBs • Prediction: • KBs built by people, using corpora and text extraction technology tools, and combined together by hand (Schubert-style; Code4; Ikarus)
Ways forward… • MRDs will become encoded more consistently • Better analysis needed of the types of knowledge needed for NLP • perhaps don’t need the kind of precision in a KB • Exploitation of associational information • Very useful for sense disambiguation (e.g., Harabagiu)
Ways forward… • Lexicographers increasingly interested in using lexical databases for their work • Could create a NLP-like KB directly • Create explicit semantic links between word entries • Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided)
Ways forward… • Lexicographers increasingly interested in using lexical databases for their work • Could create a NLP-like KB directly • Create explicit semantic links between word entries • Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided) • Ensure consistency of “metatext” (i.e., be consistent about how semantic relations are stated) • Ensure consistency of sense division • e.g., “cup” and “bowl” have two senses (literal and metonymic) but “glass” only has one (literal) could spot this inconsistency