1 / 29

Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing

Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing. Matt Post, Chris Callison-Burch, and Miles Osborne June 8, 2012. Languages. Bengali বাংলা. Hindi मानकहिनद्ी. Malayalam മലയാളം. Tamil தமிழ். Telugu తెలుగు. Urdu اردو. Indo-Aryan languages. Dravidian languages.

dorjan
Download Presentation

Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing • Matt Post, Chris Callison-Burch, and Miles Osborne • June 8, 2012

  2. Languages Bengali বাংলা Hindi मानकहिनद्ी Malayalam മലയാളം Tamil தமிழ் Telugu తెలుగు Urdu اردو Indo-Aryan languages Dravidian languages

  3. Languages Bengali Hindi Malayalam Tamil Telugu Urdu Dravidian languages Indo-Aryan languages

  4. Characteristics “The senatorprepared her remarks”செனட்டர் அவளை கருத்துக்கள் தயார் .senator her remarks prepared . ஈறு  + இன் + ஈனeeRu in eenclimb past 1p-sing-neuter • Head-final • Subject-object-verb (SOV) word order • Agglutinative morphology • inflectional: tense, person, number, gender, mood, voice • e.g., ஈரிநீன் / eeRineen (“climbed”) www.google.com/transliterate/tamil www.emille.lancs.ac.uk/lesal/tamil.pdf

  5. Data collection • We took the 100 most-popular Wikipedia articles in each language and translated them using Amazon’s Mechanical Turk in a 3-step process • Dictionary construction • Page translation • Vote gathering

  6. 1. Dictionary construction • We built a source-language vocabulary, and solicited translations for each word from Turkers • Each word was presented along with four sentences it occurred in • As controls, we used titles, which link pages in Wikipedia across languages

  7. 2. Translation அண்ணாதுரை மிகச் சிறந்த தமிழ் சொற்பொழிவாளரும் மேடைப் பேச்சாளரும் ஆவார் . Annadurai best Tamil lecturer stage spokesman is . • We took the 100 most-popular Wikipedia articles in each language and translated them using Amazon’s Mechanical Turk four non-expert translations annadurai was an excellent orator and a public speaker .annadurai is a very good speakerannadurai is one of the best tamil speeches and also stage speeches .annathurai was the great reader and also the stage speaker .

  8. 3. Votes 5 அண்ணாதுரை மிகச் சிறந்த தமிழ் சொற்பொழிவாளரும் மேடைப் பேச்சாளரும் ஆவார் . Annadurai best Tamil lecturer stage spokesman is . • In a final task, we collected five votes on which of the four translations was the best training sentences 8504-8507 four non-expert translations annadurai was an excellent orator and a public speaker .annadurai is a very good speakerannadurai is one of the best tamil speeches and also stage speeches .annathurai was the great reader and also the stage speaker .

  9. Data collection • Obtained about 500K English words for training, another 35K for tuning and testing Europarl ES-EN 50m English words Indic languages 0.5m English words

  10. Data splits • We produce four datasets: train, dev, devtest, test • Steps: • We manually assigned documents to one of seven categories • We assigned categories to datasets in round-robin fashion

  11. Data splits

  12. Data quality issues அஜித் குமாரின்முதல்வெற்றிப்படம்ஆசை . ajithkumarfirstsuccessfulmovieassai . training sentence 17 • Translation quality • Translations • aasai was the firstsuccessfullmovie for ajithkumar . • firstfilm by ajithkumar was ' asai ' • ajithkumarfirstvictory is aasai • ajithkumarfirstsuccessmovie is aasai

  13. Data quality issues • Inconsistent orthography • Translations • in srilanka solargovernment • cholarule in sri lanka . • in srilankachozhasruled • cholareign in sri lanka இலங்கையில் சோழர் ஆட்சி In Sri LankaChola ruled

  14. Legend Data quality issues false positive true positive false negative • Poor alignments

  15. Research Questions • How well does SMT work on these languages? • Do linguistic annotations help? • How important is translation quality?

  16. Q1: How well can we do? • Hiero • Linguistically un-informed grammars that define lexicalized (re)orderings, extracted from aligned textX → X(1) உறுதி செய்கிறது X(2), X(1) confirmed X(2)

  17. Q1: How well can we do? scores are the mean of three MERT runs

  18. Q2: Do linguistic annotations help? • Syntax-augmented machine translation (SAMT) • Linguistically informed grammars extracted with the aid of a target-side parse treeS+. → PRP+VBZ(1) உறுதி செய்கிறது .(2), PRP+VBZ(1) confirmed .(2) • SAMT grammars are particularly well-motivated • Syntax should help describe high-level SOV → SVO reordering • Previously well-attested for Urdu (Baker et al., SCALE 2009)

  19. Q2: Do linguistic annotations help? — BLEU-4 scores — scores are the mean of three MERT runs

  20. Q2: Do linguistic annotations help? scores are the mean of three MERT runs

  21. Q3: Does translation quality matter? • Recall that we have four redundant translations for each Indian language sentence, along with independently-obtained votes about which is best • We trained models on a quarter of the data • selected randomly • selected by plurality (breaking ties randomly) • And tested on the same test sets

  22. Q3: Does text quality matter? scores are the mean of three MERT runs

  23. Q3: Does text quality matter? scores are the mean of three MERT runs

  24. Future directions • Morphology • We took the word segmentations as given, yet we know these languages to be highly agglutinative • Better segmentation should help at all stages, from alignment to decoding

  25. Future directions • Text normalization: standardizing orthography would help immensely Urdu-English dataset Tamil-English dataset இலங்கையில் சோழர் ஆட்சிin srilankasolar governmentcholarulein sri lanka .in srilankachozhasruledcholareignin sri lanka

  26. Summary joshua-decoder.org/indian-parallel-corpora • A suite of six low-resource, head-final, morphologically rich languages from the Indian subcontinent • Provided data splits for comparisons • Ideas can be tested in an afternoon on a variety of languages • We suggest future work in the areas of morphology, normalization, and domain adaptation • The website will track uses of the data as well as the best test-set scores

  27. joshua-decoder.org/indian-parallel-corpora

  28. Thanks • Support: Google, Microsoft, EuroMatrixPlus, DARPA • Lexi Birch • Ghouse Ismail

More Related