210 likes | 374 Views
The Crúbadán Project: Corpus building for under-resourced languages. Kevin P. Scannell Presented by: Ben King. Problems with Minority Languages. Often no public funding available and no commercial interest in this work Few native speakers with NLP training
E N D
The Crúbadán Project: Corpus building for under-resourced languages Kevin P. Scannell Presented by: Ben King
Problems with Minority Languages • Often no public funding available and no commercial interest in this work • Few native speakers with NLP training • Data are too scarce for “representativeness”
Overcoming these Problems • An open source development model • Volunteer labor by language enthusiasts • Free, web-crawled corpora • Language-independent tools (like the crawler) deployed for a large number of languages • Unsupervised machine learning algorithms
An Crúbadán • A web-crawler that builds corpora for minority languages • Phase 1 aims to collect text in all languages on the web • Phase 2 aims to produce NLP resources in these languages
Statistics • 1872 languages • 836,000 documents • 1,900,000,000 words
Design Principles • Orthographies instead of languages • A corpus should consist of real running text • Not just word lists • It should be possible to collect all the Web text for small languages
The pipeline – step 1: finding languages • Lots of manual work • Web searching • Scanning and OCRing books • Monitor for new languages at • Wikipedia • UDHR • Watchtower • bible.is • Add language metadata • Expected polluting languages • Encodings
The pipeline – step 2: building a query • Separate the known words into stopwords and other words • OR together content words • AND with a stopword
The pipeline – step 3: processing documents • Google returns many types of documents • Convert to plain text • pdf2text • Remove markup • Remove boilerplate • Convert to Unicode • Identify the document language • Update lexicon, lang-id trigrams
The pipeline – step 4: repeat • Repeat
Spell checkers • Most languages do not have a spell checker • Building a spell checker requires • List of roots • Basic morphological rules • A native speaker must identify roots and affixes • 28 new spellcheckers created
Indigenous Tweets and Language Revitalization • Indigenous Tweets project identifies and tracks Twitter accounts that tweet in minority languages • New usage of endangered languages online: • Gamilaraay (Australia, ~3 speakers), • Nawat (El Salvador, ~20 speakers), • Delaware (USA, ~78 speakers) • Encourages a younger generation to use endangered languages
Diacritic Restoration • Many languages with diacritics have developed an ASCII orthography on the Web • Examples • Oll skulu vera fraels at hava sinar askodanir og bera taer fram uttan fordan → • Øll skulu vera fræls at hava sínar áskoðanir og bera tær fram uttan forðan • Ua noa i na kanaka apau ke kuokoa o ka manao a me ka hoike ana i ka manao → • Ua noa i nā kānaka apau ke kūʻokoʻa o ka manaʻo a me ka hōʻike ʻana i ka manaʻo • Moi nguoi deu co quyen tu do ngon luan va bay to quan diem → • Mọi người đều có quyền tự do ngôn luận và bầy tỏ quan điểm • Eni kookan lo ni eto si omi nira lati ni imoran ti o wu u, ki o si so iru imoran bee jade→ • Ẹnì kọ̣ọ̣kan ló ní ẹ̀tọ̣ sí òmì nira láti ní ìmọ̣ràn tí ó wù ú, kí ó sì sọ irú ìmọ̣ràn bẹ̀ẹ̀ jáde • Released as an open-source Firefox add-on, accentuate.us • Supports 84 languages
Continuing problems • Variations in scripts • Example: Azerbaijani • ИНСАН ҺҮГУГЛАРЫ ҺАГГЫНДА ҮМУМИ БӘЈАННАМӘ • İNSAN HÜQUQLARI HAQQINDA ÜMUMİ BƏYANNAMƏ • INSAN HUQUQLARI HAQQINDA UMUMI BEYANNAME • Mixed-language documents • Manual intervention