770 likes | 918 Views
Towards automatic enrichment and analysis of linguistic data for low-density languages. Fei Xia University of Washington Joint work with William Lewis and Dan Jinguji. Motivation: theoretical linguistics.
E N D
Towards automatic enrichment and analysis of linguistic data for low-density languages Fei Xia University of Washington Joint work with William Lewis and Dan Jinguji
Motivation: theoretical linguistics • For a particular language (e.g., Yaqui), find the answers for the following questions: • What is word order: SVO, SOV, VSO, ….? • Does it have double-object construction? • Can a coordinated phrase be discontinuous? (e.g., “NP1 Verb and NP2”) • …. • We want to know the answers for hundreds of languages.
Motivation: computational linguistics • For a particular language, we want to build • a Part-of-speech tagger and a parser • Common approach: create a treebank • a MT system • Common approach: • collect parallel data • test translation divergence (Dorr, 1994; Fox 2002; Hwa et al, 2002)
Main ideas • Projecting structures from a resource-rich language (e.g., English) to a low-density language. • Tapping the large body of Web-based linguistic data using ODIN dataset
Structure projection • Previous work • (Yarowsky & Ngai, 2001): POS tags and NP boundaries • (Xi & Hwa, 2005): POS tags • (Hua et al., 2002): dependency structures • (Quirk et al., 2005): dependency structures • Our work: • Projecting both dependency structures and phrase structures • It does not require a large amount of parallel data or hand-aligned data. • It can be applied to hundreds of languages.
Outline • Background: IGT and ODIN • Data enrichment • Word alignment • Structure projection • Grammar extraction • Experiments • Conclusion and future work
Interlinear Glossed Text (IGT) Rhoddodd yr athro lyfr i’r bacjgem ddoe Gave-3sg the teacher book to-the boy yesterday The teacher gave a book to the boy yesterday (Bailyn, 2001)
ODIN • Online Database of Interlinear text • Storing and indexing IGT found in scholarly documents on the Web • Searchable by language name, language family, concept/gram, etc. • Current size • 36439 instances • 725 languages
The goal • Original IGT: three lines • Enriched IGT: • English phrase structure (PS), dependency structure (DS) • Source PS and DS • Word alignment between source and English translation
Three steps • Parse the English translation • Align the source sentence and its English translation • Project the English PS and DS onto the source side
Step 1: Parsing the English translation The teacher gave a book to the boy yesterday
Heuristic word aligner Gave-3sg the teacher book to-the boy yesterday The teacher gave a book to the boy yesterday The aligner aligns two words if they have the same root form.
Limitation of heuristic word aligner 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG I caught the pig and the cat
Statistical word aligner • GIZA++ package (Och and Ney, 2000) • It implemented IBM models (Brown et. al., 1993) • Widely used in statistical MT field • Parallel corpus formed by the gloss and translation lines of all the IGT examples in ODIN.
Improving word aligner • Train both directions (glosstrans, transgloss) and combine the results • Split words in the gloss line into morphemes 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG 1SG pig -NNOM –SG grasp -PST and cat –NNOM -SG
Improving word aligner (cont) Pedro-NOM Goyo-ACC yesterday horse-ACC steal-PRFV-SAY-PRES Pedro says Goyo has stolen the horse yesterday . Add (x,x) sentence pairs: (Pedro, Pedro) (Goyo, Goyo) …..
Step 3: Projecting structures • Projecting DS • Previous work: • (Hwa et. al, 2002) • (Quirk et. al, 2005) • Projecting PS
Projecting PS • Copy the English PS and remove all the unaligned English words • Replace English words with corresponding source words • Starting from the root, reorder children of each node. • Attach unaligned source words
Starting with English PS The teacher gave a book to the boy yesterday
After reordering 1 2 3 4 5 6 7
Reordering two children of x: y1 and y2 Let Si be the phrase span of yi: • S1 and S2 don’t overlap: reorder two nodes according to the spans. • S1½ S2: remove y2 • S1¾ S2: remove y1 • S1 and S2 overlap, and neither is a strict subset of the other: remove both nodes. If y1 and y2 are leaf nodes, merge them.
Attaching unaligned source words y yi yj yk
Information that can be extracted from enriched IGT • Grammars for source language • Transfer rules • Examples with interesting properties (e.g., crossing dependencies)
Grammars S VBD NP NP PP NP NP DT NN NP NN PP IN+DT NN
Examples of crossing dependencies Inepo kow-ta bwuise-k into mis-ta 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG I caught the pig and the cat (Martinez Fabian, 2006)
Outline • Background: IGT and ODIN • Data enrichment • Experiments • Conclusion and future work
Experiments • Test on a small set of IGT examples for seven languages: • SVO: German (GER) and Hausa (HUA) • SOV: Korean (KKN) and Yaqui (YAQ) • VSO: Irish (GLI) and Welsh (WLS) • VOS: Malagasy (MEX)
Test set Numbers in the last row come from the Ethnologue (Gorden, 2005) Human annotators checked system output and corrected - English DS - word alignment - source DS
Heuristic word aligner High precision, low recall.