100 likes | 221 Views
Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium and Department of Linguistics 3600 Market Street, Philadelphia, PA 19104 U.S.A. www.ldc.upenn.edu. The Problem. Shoebox lexicon of Mawukakan Inconsistencies:
E N D
Fixing a Legacy Lexicon Mike Maxwell maxwell@ldc.upenn.edu University of Pennsylvania Linguistic Data Consortium and Department of Linguistics 3600 Market Street, Philadelphia, PA 19104 U.S.A. www.ldc.upenn.edu
The Problem • Shoebox lexicon of Mawukakan • Inconsistencies: • Inconsistencies among POSs etc.(fixable in Shoebox) • Spelling errors: English, French and Mawu(import into Word, use English and French spell correctors) • Errors in hierarchy:Missing fieldsMis-ordered fields • Missing reciprocal cross-references • Absolutely typical of Shoebox-style lexicons • Repairs needed for • Archiving • Publication • Export/ import
Old Solution • Parse until error, characterize error, find error in Shoebox, fix error… • Find all errors, send list to user, user fixes them, re-do…
Partial solutions • Inconsistencies among POSs etc. • Fixable in Shoebox • Helpful addition: counts of POS tokens • Spelling errors • Import into Word with automatic marking of language, use English and French spell correctors to fix errors, export back to Shoebox • No solution for Mawu spelling(n-grams) • Missing cross-references • Easy to find with shell script, send list to users • Would be better to mark errors in lexicon • Missing bi-directional references
Partial solutions • Errors in hierarchy \w ba’el \pos v.i \ex Yax bo’on ta sna Antonio. \exEn I’m going to Antonio’s house.| \ex Ban yax ba’at? \exEn Where are you going? \exFr Ou allez-vous?
Repairing the hierarchy • Solution: special purpose parser, mark SFM file with errors and suggested fixes • Need hierarchyCannot (reliably) extract hierarchy from Shoebox typ file • User or consultant must provide definition of hierarchy, as regex:(w ( (pos defn (ex exEn exFr)* (syn)?) | (num pos defn (ex exEn exFr)* (syn)?)+ )) • Tool to extract a list of all occurring record/ field patterns
Sample output • regex … (ex exEn exFr)*… • Input … \ex Yax bo’on ta sna Antonio. \exEn I’m going to Antonio’s house.| \ex Ban yax ba’at? \exEn Where are you going? \exFr Ou allez-vous? • Output: … \ex Yax bo’on ta sna Antonio. \exEn I’m going to Antonio’s house.| \exFr ***Missing field inserted*** \ex Ban yax ba’at? \exEn Where are you going? \exFr Ou allez-vous?
More sample output • Input\w yax \pos AUX-V \pos Adj \defn green • Output\w yax \pos AUX-V \pos Adj ***Erroneous field*** \defn green
More sample output • Input\w yax \pos AUX-V \foo bar \degn green • Output\w yax \error ***Unable to parse record structure*** \pos AUX-V \foo bar \degn green
The next language • Nahuatl lexicon • 11,000 entries • 5000 record/ field patterns • 147 SFMs…