1 / 20

The Crúbadán Project: Corpus building for under-resourced languages

The Crúbadán Project: Corpus building for under-resourced languages. Kevin P. Scannell Presented by: Ben King. Problems with Minority Languages. Often no public funding available and no commercial interest in this work Few native speakers with NLP training

tavita
Download Presentation

The Crúbadán Project: Corpus building for under-resourced languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Crúbadán Project: Corpus building for under-resourced languages Kevin P. Scannell Presented by: Ben King

  2. Problems with Minority Languages • Often no public funding available and no commercial interest in this work • Few native speakers with NLP training • Data are too scarce for “representativeness”

  3. Overcoming these Problems • An open source development model • Volunteer labor by language enthusiasts • Free, web-crawled corpora • Language-independent tools (like the crawler) deployed for a large number of languages • Unsupervised machine learning algorithms

  4. An Crúbadán • A web-crawler that builds corpora for minority languages • Phase 1 aims to collect text in all languages on the web • Phase 2 aims to produce NLP resources in these languages

  5. Statistics • 1872 languages • 836,000 documents • 1,900,000,000 words

  6. Design Principles • Orthographies instead of languages • A corpus should consist of real running text • Not just word lists • It should be possible to collect all the Web text for small languages

  7. The pipeline – step 1: finding languages • Lots of manual work • Web searching • Scanning and OCRing books • Monitor for new languages at • Wikipedia • UDHR • Watchtower • bible.is • Add language metadata • Expected polluting languages • Encodings

  8. The pipeline – step 1: finding languages

  9. The pipeline – step 1: finding languages

  10. The pipeline – step 2: building a query • Separate the known words into stopwords and other words • OR together content words • AND with a stopword

  11. The pipeline – step 3: processing documents • Google returns many types of documents • Convert to plain text • pdf2text • Remove markup • Remove boilerplate • Convert to Unicode • Identify the document language • Update lexicon, lang-id trigrams

  12. The pipeline – step 4: repeat • Repeat

  13. Applications

  14. Spell checkers • Most languages do not have a spell checker • Building a spell checker requires • List of roots • Basic morphological rules • A native speaker must identify roots and affixes • 28 new spellcheckers created

  15. Indigenous Tweets and Language Revitalization

  16. Indigenous Tweets and Language Revitalization • Indigenous Tweets project identifies and tracks Twitter accounts that tweet in minority languages • New usage of endangered languages online: • Gamilaraay (Australia, ~3 speakers), • Nawat (El Salvador, ~20 speakers), • Delaware (USA, ~78 speakers) • Encourages a younger generation to use endangered languages

  17. Diacritic Restoration • Many languages with diacritics have developed an ASCII orthography on the Web • Examples • Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan → • Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan • Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao → • Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo a me ka hōʻike ʻana i ka manaʻo • Moi nguoi deu co quyen tu do ngon luan va bay to quan diem → • Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm • Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→ • Ẹnì kọ̣ọ̣kan ló ní ẹ̀tọ̣ sí òmì nira láti ní ìmọ̣ràn tí ó wù ú, kí ó sì sọ irú ìmọ̣ràn bẹ̀ẹ̀ jáde • Released as an open-source Firefox add-on, accentuate.us • Supports 84 languages

  18. Diacritic Restoration

  19. Continuing problems • Variations in scripts • Example: Azerbaijani • ИНСАН ҺҮГУГЛАРЫ ҺАГГЫНДА ҮМУМИ БӘЈАННАМӘ • İNSAN HÜQUQLARI HAQQINDA ÜMUMİ BƏYANNAMƏ • INSAN HUQUQLARI HAQQINDA UMUMI BEYANNAME • Mixed-language documents • Manual intervention

More Related