1 / 38

MT For Low-Density Languages

MT For Low-Density Languages. Ryan Georgi Ling 575 – MT Seminar Winter 2007. What is “Low Density”?. What is “Low Density”?. In NLP, languages are usually chosen for: Economic Value Ease of development Funding (NSA, anyone?). What is “Low Density”?.

dian
Download Presentation

MT For Low-Density Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007

  2. What is “Low Density”?

  3. What is “Low Density”? • In NLP, languages are usually chosen for: • Economic Value • Ease of development • Funding (NSA, anyone?)

  4. What is “Low Density”? • As a result, NLP work until recently has focused on a rather small set of languages. • e.g. English, German, French, Japanese, Chinese

  5. What is “Low Density”? • “Density” refers to the availability of resources (primarily digital) for a given language. • Parallel text • Treebanks • Dictionaries • Chunked, semantically tagged, or other annotation

  6. What is “Low Density”? • “Density” not necessarily linked to speaker population • Our favorite example, Iniktitut

  7. So, why study LDL?

  8. So, why study LDL? • Preserving endangered languages • Spreading benefits of NLP to other populations • (Tegic has T9 for Azerbaijani now) • Benefits of wide typological coverage for cross-linguistic research • (?)

  9. Problem of LDL?

  10. Problem of LDL? • “The fundamental problem for annotation of lower-density languages is that they are lower density” – Maxwell & Hughes • Easiest NLP development (and often best) done with statistical methods • Training requires lots of resources • Resources require lots of money • Cost/Benefit chicken and the egg

  11. What are our options? • Create corpora by hand • Very time-consuming (= expensive) • Requires trained native speakers • Digitize printed resources • Time-consuming • May require trained native speakers • e.g. orthography without unicode entries

  12. What are our options? • Traditional requirements are going to be difficult to satisfy, no matter how we slice it. • We need to, then: • Maximize information extracted from resources we can get • Reduce requirements for building a system

  13. Maximizing Information with IGT

  14. Maximizing Information with IGT • Interlinear Glossed Text • Traditional form of transcription for linguistic field researchers and grammarians • Example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

  15. Benefits of IGT • As IGT is frequently used in fieldwork, it is often available for low-density languages • IGT provides information about syntax, morphology, • The translation line is usually a high-density language that we can use as a pivot language.

  16. Drawbacks of IGT • Data can be ‘abormal’ in a number of ways • Usually quite short • May be used by grammarian to illustrate fringe usages • Often purposely limited vocabularies • Still, in working with LDL it might be all we’ve got

  17. Utilizing IGT • First, a big nod to Fei (this is her paper!) • As we saw in HW#2, word alignment is hard. • IGT, however, often gets us halfway there!

  18. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

  19. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”

  20. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teachergave a book to the boy yesterday”

  21. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbook to-the boy yesterday “The teachergave a book to the boy yesterday”

  22. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbookto-theboyyesterday “The teachergave a bookto theboyyesterday”

  23. Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbookto-theboyyesterday “The teachergave a bookto theboyyesterday” • The interlinear already aligns the source with the gloss • Often, the gloss uses words found in the translation already

  24. Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)

  25. Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) • We can get a little more by stemming…

  26. Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) • We can get a little more by stemming… • …but we’re going to need more.

  27. Utilizing IGT • Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: (Source: Will & Fei’s NAACL 2007 Paper!)

  28. Utilizing IGT • Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: (Source: Will & Fei’s NAACL 2007 Paper!)

  29. Utilizing IGT • What can we get from this? • Automatically generated CFGs • Can infer word order from these CFGs • Can infer possible constituents • …suggestions? • From a small amount of data, this is a lot of information, but what about…

  30. Reducing data Requirements with Prototyping

  31. Grammar Induction • So, we have a way to get production rules from a small amount of data. • Is this enough? • Probably not. • CFGs aren’t known for their robustness • How about using what we have as a bootstrap?

  32. Grammar Induction • Given unannotated text, we can derive PCFGs • Without annotation, though, we just have unlabelled trees: ROOT C2 X0 X1 Y2 the dog Z3 N4 fell asleep • Such an unlabelled parse doesn’t give us S -> NP VP, though. p=0.02 p=0.45e-4 p=0.003 p=0.09 p=5.3e-2

  33. Grammar Induction • Can we get labeled trees without annotated text? • Haghighi & Klein (2006) • Propose a way in which production rules can be passed to a PCFG induction algorithm as “prototypical” constituents • Think of these prototypes as a rubric that could be given to a human annotator • e.g. for English, NP -> DT NN

  34. Grammar Induction • Let’s take the possible constituent DT NN • We could tell our PCFG algorithm to apply this as a constituent everywhere it occurs • But what about DT NN NN? (the train station)? • We would like to catch this as well

  35. Grammar Induction • K&H’s solution? • distributional clustering • “a similarity measure between two items on the basis of their immediate left and right contexts” • …to be honest, I lose them in the math here. • Importantly, however, weighting the probability of a constituent with the right measure improves from the base unsupervised level of f-measure 35.3 to 62.2

  36. So… what now?

  37. Next Steps • By extracting production rules from a very small amount of data using IGT and using Haghighi & Klein’s unsupervised methods, it may be possible to bootstrap an effective language model from very little data!

  38. Next Steps • Possible applications: • Automatic generation of language resources • (While a system with the same goals would only compound error, automatically annotated data could be easier for a human to correct rather than hand-generate) • Assist linguists in the field • (Better model performance could imply better grammar coverage) • …you tell me!

More Related