380 likes | 632 Views
MT For Low-Density Languages. Ryan Georgi Ling 575 – MT Seminar Winter 2007. What is “Low Density”?. What is “Low Density”?. In NLP, languages are usually chosen for: Economic Value Ease of development Funding (NSA, anyone?). What is “Low Density”?.
E N D
MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007
What is “Low Density”? • In NLP, languages are usually chosen for: • Economic Value • Ease of development • Funding (NSA, anyone?)
What is “Low Density”? • As a result, NLP work until recently has focused on a rather small set of languages. • e.g. English, German, French, Japanese, Chinese
What is “Low Density”? • “Density” refers to the availability of resources (primarily digital) for a given language. • Parallel text • Treebanks • Dictionaries • Chunked, semantically tagged, or other annotation
What is “Low Density”? • “Density” not necessarily linked to speaker population • Our favorite example, Iniktitut
So, why study LDL? • Preserving endangered languages • Spreading benefits of NLP to other populations • (Tegic has T9 for Azerbaijani now) • Benefits of wide typological coverage for cross-linguistic research • (?)
Problem of LDL? • “The fundamental problem for annotation of lower-density languages is that they are lower density” – Maxwell & Hughes • Easiest NLP development (and often best) done with statistical methods • Training requires lots of resources • Resources require lots of money • Cost/Benefit chicken and the egg
What are our options? • Create corpora by hand • Very time-consuming (= expensive) • Requires trained native speakers • Digitize printed resources • Time-consuming • May require trained native speakers • e.g. orthography without unicode entries
What are our options? • Traditional requirements are going to be difficult to satisfy, no matter how we slice it. • We need to, then: • Maximize information extracted from resources we can get • Reduce requirements for building a system
Maximizing Information with IGT • Interlinear Glossed Text • Traditional form of transcription for linguistic field researchers and grammarians • Example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”
Benefits of IGT • As IGT is frequently used in fieldwork, it is often available for low-density languages • IGT provides information about syntax, morphology, • The translation line is usually a high-density language that we can use as a pivot language.
Drawbacks of IGT • Data can be ‘abormal’ in a number of ways • Usually quite short • May be used by grammarian to illustrate fringe usages • Often purposely limited vocabularies • Still, in working with LDL it might be all we’ve got
Utilizing IGT • First, a big nod to Fei (this is her paper!) • As we saw in HW#2, word alignment is hard. • IGT, however, often gets us halfway there!
Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”
Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teacher gave a book to the boy yesterday”
Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacher book to-the boy yesterday “The teachergave a book to the boy yesterday”
Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbook to-the boy yesterday “The teachergave a book to the boy yesterday”
Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbookto-theboyyesterday “The teachergave a bookto theboyyesterday”
Utilizing IGT • Take the previous example: Rhoddodd yr athro lyfr I’r bachgen ddoe gave-3sg the teacherbookto-theboyyesterday “The teachergave a bookto theboyyesterday” • The interlinear already aligns the source with the gloss • Often, the gloss uses words found in the translation already
Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)
Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) • We can get a little more by stemming…
Utilizing IGT • Alignment isn’t always this easy… xaraju mina lgurfati wa nah.nu nadxulu xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu exited-3MPL from DEF-room-GEN and we 1PL-enter 'They left the room as we were entering it‘ (Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes) • We can get a little more by stemming… • …but we’re going to need more.
Utilizing IGT • Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: (Source: Will & Fei’s NAACL 2007 Paper!)
Utilizing IGT • Thankfully, with an English translation, we already have tools to get phrase and dependency structures that we can project: (Source: Will & Fei’s NAACL 2007 Paper!)
Utilizing IGT • What can we get from this? • Automatically generated CFGs • Can infer word order from these CFGs • Can infer possible constituents • …suggestions? • From a small amount of data, this is a lot of information, but what about…
Grammar Induction • So, we have a way to get production rules from a small amount of data. • Is this enough? • Probably not. • CFGs aren’t known for their robustness • How about using what we have as a bootstrap?
Grammar Induction • Given unannotated text, we can derive PCFGs • Without annotation, though, we just have unlabelled trees: ROOT C2 X0 X1 Y2 the dog Z3 N4 fell asleep • Such an unlabelled parse doesn’t give us S -> NP VP, though. p=0.02 p=0.45e-4 p=0.003 p=0.09 p=5.3e-2
Grammar Induction • Can we get labeled trees without annotated text? • Haghighi & Klein (2006) • Propose a way in which production rules can be passed to a PCFG induction algorithm as “prototypical” constituents • Think of these prototypes as a rubric that could be given to a human annotator • e.g. for English, NP -> DT NN
Grammar Induction • Let’s take the possible constituent DT NN • We could tell our PCFG algorithm to apply this as a constituent everywhere it occurs • But what about DT NN NN? (the train station)? • We would like to catch this as well
Grammar Induction • K&H’s solution? • distributional clustering • “a similarity measure between two items on the basis of their immediate left and right contexts” • …to be honest, I lose them in the math here. • Importantly, however, weighting the probability of a constituent with the right measure improves from the base unsupervised level of f-measure 35.3 to 62.2
Next Steps • By extracting production rules from a very small amount of data using IGT and using Haghighi & Klein’s unsupervised methods, it may be possible to bootstrap an effective language model from very little data!
Next Steps • Possible applications: • Automatic generation of language resources • (While a system with the same goals would only compound error, automatically annotated data could be easier for a human to correct rather than hand-generate) • Assist linguists in the field • (Better model performance could imply better grammar coverage) • …you tell me!