240 likes | 416 Views
Corpus Creation for Lexicography. Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland). Tasks. Design Collection Encoding. The project. A New English-Irish Dictionary Authoritative, general purpose
E N D
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)
Tasks • Design • Collection • Encoding Kilgarriff: Asialex June 2005
The project • A New English-Irish Dictionary • Authoritative, general purpose • Academics, translators, students, secretaries • One year ‘set-up’ phase • Limited time, limited budget • Many tasks, including corpus development • Irish and UK Government funded • Lead contractor: LexMasterClass • Subcontractor: ITE Kilgarriff: Asialex June 2005
Languages • English • Irish Kilgarriff: Asialex June 2005
The Irish language • A Celtic language • Long literary tradition • Irish-Latin dictionary from 9th century • Main language of Ireland until 1850-1900 • English took over (British imperialist policies) • 62,000 speakers as main language • Gaeltacht: Irish-speaking areas • Three dialects Kilgarriff: Asialex June 2005
Gaeltacht areas Kilgarriff: Asialex June 2005
Design: English • Source language for NEID • Very large resource wanted • Eg for word sketches, see Friday talk • Three language varieties • Irish (Hiberno-English) • British • American Kilgarriff: Asialex June 2005
American • 100M words • Journalistic text available • British • 100M words • British National Corpus (BNC) • Model balanced corpus • Spoken conversation (10%) • Books, newspapers, magazines • Popular, academic, technical Kilgarriff: Asialex June 2005
Hiberno-English • 25 M words • Goal: balanced like BNC except • No budget for spoken corpus collection • New category: web • Dates: since independence (1922) • Emphasis on current language Kilgarriff: Asialex June 2005
Design: Irish • 30 M words • Starting point: BNC-like • Native speakers • Native speakers language “better” • Many texts written by non-native speakers • Record status where possible • Newspapers, websites: no info available • Dialect • Record where possible Kilgarriff: Asialex June 2005
“High quality Irish” • Smaller than 150 years ago • Many documents are translations • Learners’ errors, inelegant prose • Samuel Johnson: “writers of the first reputation” Con • Who judges? • Risk of literary or backward-looking bias • Lexicographers needs corpus to translate Boot the computer as well as the babbling brook • Trench and the OED: “an historian, not a critic” • Will a quality filter limit corpus breadth (and size)? Kilgarriff: Asialex June 2005
Quality: outcome • Wide range of text types wanted • Particular effort to gather native speaker non-translations • Period for corpus: 1883-present • Most earlier texts: literary • Most text types: usually recent Kilgarriff: Asialex June 2005
Collection • Use existing • Ask publishers • Web Kilgarriff: Asialex June 2005
Use existing • Irish: PAROLE corpus (8M words, ITE) • English • British: BNC • American: LDC Gigaword – wds journalism • Limerick Corpus of Spoken English • Northern Ireland Corpus of Transcribed Speech Kilgarriff: Asialex June 2005
Ask publishers • The junkmail problem • Appeals to national pride • Charm and persistence • Team member who knows them all Kilgarriff: Asialex June 2005
Web • Fast becoming the usual place to look • Kilgarriff and Grefenstette, CL 2003 • Preliminary experiments • at least 15 M words of Irish out there • Hiberno-English • English as found on sites where Irish was found Kilgarriff: Asialex June 2005
Web issues • Formats • conversion from pdf etc needed • Character representation • Not many pages “do the right thing” • Navigational material: “click here” • Lists • Mixed languages • Duplication Kilgarriff: Asialex June 2005
Encoding • Clean-up • Linguistic processing • Delivery formalism Kilgarriff: Asialex June 2005
Clean-up • Deletion of: Title pages, table of contents, tables, figures, footnotes, endnotes, page headers and footers, crosswords, TV listings, sports results, team listings … Kilgarriff: Asialex June 2005
Linguistic processing • Lemmatize • give giving gives given gave => give (verb) • Part-of-speech tagging • bank (verb) or bank (noun)? • English: existing tools used • Irish: tools developed from scatch • Elaine Ui Dhonnchadha: thesis work • Finite state methods, constraint grammar • Separate talk Kilgarriff: Asialex June 2005
Delivery formalism • Both • XML Corpus Encoding Standards (XCES) • For longevity, interchange format • And • Loaded into Word Sketch Engine • Corpus query tool optimised for lexicography, linguistic research • Good for searching on grammar, text type etc • Friday talk Kilgarriff: Asialex June 2005
Conclusion • Large corpora for high-quality lexicography • Developed in one year, modest budget • Design, collection and encoding • Delivered in a convenient form for the lexicographer Thank you Kilgarriff: Asialex June 2005