Towards automatic enrichment and analysis of linguistic data for low-density languages

Towards automatic enrichment and analysis of linguistic data for low-density languages Fei Xia University of Washington Joint work with William Lewis and Dan Jinguji

Motivation: theoretical linguistics • For a particular language (e.g., Yaqui), find the answers for the following questions: • What is word order: SVO, SOV, VSO, ….? • Does it have double-object construction? • Can a coordinated phrase be discontinuous? (e.g., “NP1 Verb and NP2”) • …. • We want to know the answers for hundreds of languages.

Motivation: computational linguistics • For a particular language, we want to build • a Part-of-speech tagger and a parser • Common approach: create a treebank • a MT system • Common approach: • collect parallel data • test translation divergence (Dorr, 1994; Fox 2002; Hwa et al, 2002)

Main ideas • Projecting structures from a resource-rich language (e.g., English) to a low-density language. • Tapping the large body of Web-based linguistic data  using ODIN dataset

Structure projection • Previous work • (Yarowsky & Ngai, 2001): POS tags and NP boundaries • (Xi & Hwa, 2005): POS tags • (Hua et al., 2002): dependency structures • (Quirk et al., 2005): dependency structures • Our work: • Projecting both dependency structures and phrase structures • It does not require a large amount of parallel data or hand-aligned data. • It can be applied to hundreds of languages.

Outline • Background: IGT and ODIN • Data enrichment • Word alignment • Structure projection • Grammar extraction • Experiments • Conclusion and future work

Background: IGT and ODIN

Interlinear Glossed Text (IGT) Rhoddodd yr athro lyfr i’r bacjgem ddoe Gave-3sg the teacher book to-the boy yesterday The teacher gave a book to the boy yesterday (Bailyn, 2001)

ODIN • Online Database of Interlinear text • Storing and indexing IGT found in scholarly documents on the Web • Searchable by language name, language family, concept/gram, etc. • Current size • 36439 instances • 725 languages

Data Enrichment

The goal • Original IGT: three lines • Enriched IGT: • English phrase structure (PS), dependency structure (DS) • Source PS and DS • Word alignment between source and English translation

Three steps • Parse the English translation • Align the source sentence and its English translation • Project the English PS and DS onto the source side

Step 1: Parsing the English translation The teacher gave a book to the boy yesterday

Step 2: Word alignment

Source-gloss alignment

Gloss-translation alignment

Heuristic word aligner Gave-3sg the teacher book to-the boy yesterday The teacher gave a book to the boy yesterday The aligner aligns two words if they have the same root form.

Limitation of heuristic word aligner 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG I caught the pig and the cat

Statistical word aligner • GIZA++ package (Och and Ney, 2000) • It implemented IBM models (Brown et. al., 1993) • Widely used in statistical MT field • Parallel corpus formed by the gloss and translation lines of all the IGT examples in ODIN.

Improving word aligner • Train both directions (glosstrans, transgloss) and combine the results • Split words in the gloss line into morphemes 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG  1SG pig -NNOM –SG grasp -PST and cat –NNOM -SG

Improving word aligner (cont) Pedro-NOM Goyo-ACC yesterday horse-ACC steal-PRFV-SAY-PRES Pedro says Goyo has stolen the horse yesterday . Add (x,x) sentence pairs: (Pedro, Pedro) (Goyo, Goyo) …..

Step 3: Projecting structures • Projecting DS • Previous work: • (Hwa et. al, 2002) • (Quirk et. al, 2005) • Projecting PS

Projecting phrase structure

Projecting PS • Copy the English PS and remove all the unaligned English words • Replace English words with corresponding source words • Starting from the root, reorder children of each node. • Attach unaligned source words

Starting with English PS The teacher gave a book to the boy yesterday

Replacing English words

Reordering children

Calculating phrase spans

“Reordering” NP and VP

Removing VP

Removing a node in PS

After removing VP

Reordering VBD and NP

Removing NP

Merging IN and DT

Before “reordering”

After reordering 1 2 3 4 5 6 7

Reordering two children of x: y1 and y2 Let Si be the phrase span of yi: • S1 and S2 don’t overlap: reorder two nodes according to the spans. • S1½ S2: remove y2 • S1¾ S2: remove y1 • S1 and S2 overlap, and neither is a strict subset of the other: remove both nodes. If y1 and y2 are leaf nodes, merge them.

Attaching unaligned source words y yi yj yk

Information that can be extracted from enriched IGT • Grammars for source language • Transfer rules • Examples with interesting properties (e.g., crossing dependencies)

Grammars S  VBD NP NP PP NP NP  DT NN NP  NN PP  IN+DT NN

Examples of crossing dependencies Inepo kow-ta bwuise-k into mis-ta 1SG pig-NNOM.SG grasp-PST and cat-NNOM.SG I caught the pig and the cat (Martinez Fabian, 2006)

Examples of crossing dependencies

Outline • Background: IGT and ODIN • Data enrichment • Experiments • Conclusion and future work

Experiments • Test on a small set of IGT examples for seven languages: • SVO: German (GER) and Hausa (HUA) • SOV: Korean (KKN) and Yaqui (YAQ) • VSO: Irish (GLI) and Welsh (WLS) • VOS: Malagasy (MEX)

Test set Numbers in the last row come from the Ethnologue (Gorden, 2005) Human annotators checked system output and corrected - English DS - word alignment - source DS

Heuristic word aligner  High precision, low recall.

Statistical word aligner: training data

Towards automatic enrichment and analysis of linguistic data for low-density languages

Towards automatic enrichment and analysis of linguistic data for low-density languages

Presentation Transcript

Towards a Taxonomy of Linguistic Perspective

Automatic Translation of Human Languages

MT For Low-Density Languages

BABYLON Parallel Text Builder: Gathering Parallel Texts for Low-Density Languages

GO Enrichment analysis

The Linguistic-Core Approach to Structured Translation and Analysis of Low-Resource Languages

Functional Enrichment Analysis :

Automatic Data Selection for In Situ Analysis

Linguistic Enrichment of Statistical Transliteration

Linguistic Analysis of noisy Text Data

Department of Language and Linguistic Science Languages for All

Towards a Comprehensive Environment for Data Analysis and Visualization

Rapid development of machine translation for low density languages

Towards New Models and Languages for Data Mining and Integration

AVENUE Automatic Machine Translation for low-density languages

ALIP: Automatic Linguistic Indexing of Pictures

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages

Towards the Use of Linguistic Information in Automatic MT Evaluation Metrics

NICE Machine Translation for Low-Density Languages

Using Unicode for Linguistic Data

Low-Density Permutation Codes for Digital Holographic Data Storage

Data Enrichment App