1 / 24

Corpus Creation for Lexicography

Corpus Creation for Lexicography. Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland). Tasks. Design Collection Encoding. The project. A New English-Irish Dictionary Authoritative, general purpose

Download Presentation

Corpus Creation for Lexicography

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)

  2. Tasks • Design • Collection • Encoding Kilgarriff: Asialex June 2005

  3. The project • A New English-Irish Dictionary • Authoritative, general purpose • Academics, translators, students, secretaries • One year ‘set-up’ phase • Limited time, limited budget • Many tasks, including corpus development • Irish and UK Government funded • Lead contractor: LexMasterClass • Subcontractor: ITE Kilgarriff: Asialex June 2005

  4. Languages • English • Irish Kilgarriff: Asialex June 2005

  5. The Irish language • A Celtic language • Long literary tradition • Irish-Latin dictionary from 9th century • Main language of Ireland until 1850-1900 • English took over (British imperialist policies) • 62,000 speakers as main language • Gaeltacht: Irish-speaking areas • Three dialects Kilgarriff: Asialex June 2005

  6. Gaeltacht areas Kilgarriff: Asialex June 2005

  7. Design: English • Source language for NEID • Very large resource wanted • Eg for word sketches, see Friday talk • Three language varieties • Irish (Hiberno-English) • British • American Kilgarriff: Asialex June 2005

  8. American • 100M words • Journalistic text available • British • 100M words • British National Corpus (BNC) • Model balanced corpus • Spoken conversation (10%) • Books, newspapers, magazines • Popular, academic, technical Kilgarriff: Asialex June 2005

  9. Hiberno-English • 25 M words • Goal: balanced like BNC except • No budget for spoken corpus collection • New category: web • Dates: since independence (1922) • Emphasis on current language Kilgarriff: Asialex June 2005

  10. Design: Irish • 30 M words • Starting point: BNC-like • Native speakers • Native speakers language “better” • Many texts written by non-native speakers • Record status where possible • Newspapers, websites: no info available • Dialect • Record where possible Kilgarriff: Asialex June 2005

  11. “High quality Irish” • Smaller than 150 years ago • Many documents are translations • Learners’ errors, inelegant prose • Samuel Johnson: “writers of the first reputation” Con • Who judges? • Risk of literary or backward-looking bias • Lexicographers needs corpus to translate Boot the computer as well as the babbling brook • Trench and the OED: “an historian, not a critic” • Will a quality filter limit corpus breadth (and size)? Kilgarriff: Asialex June 2005

  12. Quality: outcome • Wide range of text types wanted • Particular effort to gather native speaker non-translations • Period for corpus: 1883-present • Most earlier texts: literary • Most text types: usually recent Kilgarriff: Asialex June 2005

  13. Kilgarriff: Asialex June 2005

  14. Collection • Use existing • Ask publishers • Web Kilgarriff: Asialex June 2005

  15. Use existing • Irish: PAROLE corpus (8M words, ITE) • English • British: BNC • American: LDC Gigaword – wds journalism • Limerick Corpus of Spoken English • Northern Ireland Corpus of Transcribed Speech Kilgarriff: Asialex June 2005

  16. Ask publishers • The junkmail problem • Appeals to national pride • Charm and persistence • Team member who knows them all Kilgarriff: Asialex June 2005

  17. Web • Fast becoming the usual place to look • Kilgarriff and Grefenstette, CL 2003 • Preliminary experiments • at least 15 M words of Irish out there • Hiberno-English • English as found on sites where Irish was found Kilgarriff: Asialex June 2005

  18. Web issues • Formats • conversion from pdf etc needed • Character representation • Not many pages “do the right thing” • Navigational material: “click here” • Lists • Mixed languages • Duplication Kilgarriff: Asialex June 2005

  19. Kilgarriff: Asialex June 2005

  20. Encoding • Clean-up • Linguistic processing • Delivery formalism Kilgarriff: Asialex June 2005

  21. Clean-up • Deletion of: Title pages, table of contents, tables, figures, footnotes, endnotes, page headers and footers, crosswords, TV listings, sports results, team listings … Kilgarriff: Asialex June 2005

  22. Linguistic processing • Lemmatize • give giving gives given gave => give (verb) • Part-of-speech tagging • bank (verb) or bank (noun)? • English: existing tools used • Irish: tools developed from scatch • Elaine Ui Dhonnchadha: thesis work • Finite state methods, constraint grammar • Separate talk Kilgarriff: Asialex June 2005

  23. Delivery formalism • Both • XML Corpus Encoding Standards (XCES) • For longevity, interchange format • And • Loaded into Word Sketch Engine • Corpus query tool optimised for lexicography, linguistic research • Good for searching on grammar, text type etc • Friday talk Kilgarriff: Asialex June 2005

  24. Conclusion • Large corpora for high-quality lexicography • Developed in one year, modest budget • Design, collection and encoding • Delivered in a convenient form for the lexicographer Thank you Kilgarriff: Asialex June 2005

More Related