1 / 20

Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?

Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?. Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp257-266 http://www.cs.vassar.edu/faculty/ide/pubs.html As (mis-)interpreted by Peter Clark. The Postulates of MRD Work.

karis
Download Presentation

Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Extracting Knowledge-Bases from Machine-Readable Dictionaries: Have We Wasted Our Time? Nancy Ide and Jean Veronis Proc KB&KB’93 Workshop, 1993, pp257-266 http://www.cs.vassar.edu/faculty/ide/pubs.html As (mis-)interpreted by Peter Clark

  2. The Postulates of MRD Work • P1: MRDs contain information that is useful for NLP e.g.:

  3. The Postulates of MRD Work • P1: MRDs contain information that is useful for NLP • P2: This info is relatively easy to extract from MRDs e.g., extraction of hypernyms (generalizations):  Dipper isa Ladle isa Spoon isa Utensil

  4. But… • Not much to show for it so far (1993) • handful of limited and imperfect taxonomies • few studies on the quality of knowledge in MRDs • few studies on extracting more complex info

  5. Complaints… • P1: useful info in MRDs: • C1a: 50%-70% of info in dictionaries is “garbled” • C1b: sense definitions  concept usage (“real concepts”) • C1c: some types of knowledge simply not there • P2: Info can be easily extracted • Most successes have been for hypernyms only • C2a: MRD formats are a nightmare to deal with • C2b: A virtually open-ended set of ways of describing facts • C2c: Bootstrapping: Need a KB to build a KB from a MRD

  6. C1a: MRD information is “garbled” • Multiple people, multiple years effort • Space restrictions, syntactic restrictions • Particular problem 1: • Attachment of terms too high (21%-34%) • e.g., “pan” and “bottle” are “vessels”, but “cup” and “bowl” are simply “containers” • occurs fairly randomly • Categories less clear at top levels • “fork” and “spoon” is ok, but “implement” and “utensil” = ? • Sometimes no word there to refer to a concept • leads to circular definitions

  7. C1a: MRD information is “garbled”

  8. C1a: MRD information is “garbled” • Particular problem 2: • Categories less clear at top levels • “fork” and “spoon” is ok, but “implement” and “utensil” = ? • Leads to disjuncts e.g. “implement or utensil” • Sometimes no word there to refer to a concept • leads to circular definitions • leads to “covert categories”, e.g., INSTRUMENTAL-OBJECT (a hypernym for “tool”, “utensil”, “instrument”, and “implement”)

  9. C1a: MRD information is “garbled” • Particular problem 3: • And hypernyms are relatively consistent!! Other semantic relations are given in a less consistent way, e.g., smell, taste, etc.

  10. C1b: sense definitions  concept usage (“real concepts”) • Ambiguity of word senses, e.g., • 87% of words in a sample fit > 1 word sense • Word senses don’t reflect actual use • Word sense distinctions differ between MRDs • level of detail • way lines are drawn between senses • no definitive set of distinctions

  11. C1c: some types of knowledge simply not there • no broad contextual or world knowledge, e.g., • no connection between “lawn” and “house”, or between “ash” and “tobacco” • “restaurant, eating house, eating place -- (a building where people go to eat)” [WordNet] • No mention that it’s a commercial business, e.g., for “the waitress collected the check.”

  12. C2a: MRD formats are a nightmare to deal with • Ambiguities / inconsistencies in typesetter format • Complex grammars for entries • Conventions are inconsistent, e.g. bracketing for • “Canopic jar, urn, or vase” vs. • “Junggar Pendi, Dzungaria, or Zungaria” • Need a lot of hand pre-processing • not much general value to this • is a vast task in itself • not many processed dictionaries available

  13. C2b: A virtually open-ended set of ways of describing facts But… There is “virtually an open-ended set of phrases…”

  14. C2c: Bootstrapping: Need a KB to build a KB Need knowledge to do NLP on MRDs! • e.g. “carry by means of a handle” vs. “carry by means of a wagon” • But undisambiguated hierarchy is unusable, e.g., • “saucepan” isa “pan” isa “leaf” • need to build your KB before you even start on the MRD

  15. Synthesis • Underlying postulate of P1 and P2: • P0: Large KBs cannot be built by hand • Counterexamples: • Cyc • Dictionaries themselves! • And besides… • KBs are too hard to extract from MRDs • don’t contain all the knowledge needed • But: MRD contributions: • understanding the structure of dictionaries • convergance of NLP, lexicography, and electronic publishing interests

  16. Ways forward… • Combining Knowledge Sources: • One dictionary has 55%-70% of “problematic cases” [of incompleteness], but 5 dictionaries reduced this to 5% • Also should combine knowledge from corpora as a means of “filling out” KBs • Prediction: • KBs built by people, using corpora and text extraction technology tools, and combined together by hand (Schubert-style; Code4; Ikarus)

  17. Ways forward… • MRDs will become encoded more consistently • Better analysis needed of the types of knowledge needed for NLP • perhaps don’t need the kind of precision in a KB • Exploitation of associational information • Very useful for sense disambiguation (e.g., Harabagiu)

  18. Ways forward… • Lexicographers increasingly interested in using lexical databases for their work • Could create a NLP-like KB directly • Create explicit semantic links between word entries • Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided)

  19. Ways forward…

  20. Ways forward… • Lexicographers increasingly interested in using lexical databases for their work • Could create a NLP-like KB directly • Create explicit semantic links between word entries • Ensure consistency of content (e.g., using templates/frames ensures all the important information is provided) • Ensure consistency of “metatext” (i.e., be consistent about how semantic relations are stated) • Ensure consistency of sense division • e.g., “cup” and “bowl” have two senses (literal and metonymic) but “glass” only has one (literal)  could spot this inconsistency

More Related