1 / 10

Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn University of Pennsylvania

Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium and Department of Linguistics 3600 Market Street, Philadelphia, PA 19104 U.S.A. www.ldc.upenn.edu. The Problem. Shoebox lexicon of Mawukakan Inconsistencies:

Download Presentation

Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn University of Pennsylvania

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium and Department of Linguistics 3600 Market Street, Philadelphia, PA 19104 U.S.A. www.ldc.upenn.edu

  2. The Problem • Shoebox lexicon of Mawukakan • Inconsistencies: • Inconsistencies among POSs etc.(fixable in Shoebox) • Spelling errors: English, French and Mawu(import into Word, use English and French spell correctors) • Errors in hierarchy:Missing fieldsMis-ordered fields • Missing reciprocal cross-references • Absolutely typical of Shoebox-style lexicons • Repairs needed for • Archiving • Publication • Export/ import

  3. Old Solution • Parse until error, characterize error, find error in Shoebox, fix error… • Find all errors, send list to user, user fixes them, re-do…

  4. Partial solutions • Inconsistencies among POSs etc. • Fixable in Shoebox • Helpful addition: counts of POS tokens • Spelling errors • Import into Word with automatic marking of language, use English and French spell correctors to fix errors, export back to Shoebox • No solution for Mawu spelling(n-grams) • Missing cross-references • Easy to find with shell script, send list to users • Would be better to mark errors in lexicon • Missing bi-directional references

  5. Partial solutions • Errors in hierarchy \w ba’el \pos v.i \ex Yax bo’on ta sna Antonio. \exEn I’m going to Antonio’s house.| \ex Ban yax ba’at? \exEn Where are you going? \exFr Ou allez-vous?

  6. Repairing the hierarchy • Solution: special purpose parser, mark SFM file with errors and suggested fixes • Need hierarchyCannot (reliably) extract hierarchy from Shoebox typ file • User or consultant must provide definition of hierarchy, as regex:(w ( (pos defn (ex exEn exFr)* (syn)?) | (num pos defn (ex exEn exFr)* (syn)?)+ )) • Tool to extract a list of all occurring record/ field patterns

  7. Sample output • regex … (ex exEn exFr)*… • Input … \ex Yax bo’on ta sna Antonio. \exEn I’m going to Antonio’s house.| \ex Ban yax ba’at? \exEn Where are you going? \exFr Ou allez-vous? • Output: … \ex Yax bo’on ta sna Antonio. \exEn I’m going to Antonio’s house.| \exFr ***Missing field inserted*** \ex Ban yax ba’at? \exEn Where are you going? \exFr Ou allez-vous?

  8. More sample output • Input\w yax \pos AUX-V \pos Adj \defn green • Output\w yax \pos AUX-V \pos Adj ***Erroneous field*** \defn green

  9. More sample output • Input\w yax \pos AUX-V \foo bar \degn green • Output\w yax \error ***Unable to parse record structure*** \pos AUX-V \foo bar \degn green

  10. The next language • Nahuatl lexicon • 11,000 entries • 5000 record/ field patterns • 147 SFMs…

More Related