Morphosyntactic correspondence: a progress report on bitext parsing

Morphosyntactic correspondence: a progress report on bitext parsing Alexander Fraser, Renjing Wang, Hinrich Schütze Institute for NLP INFuture2009: Digital Resources and Knowledge Sharing Nov 4th 2009, Zagreb University of Stuttgart

Outline • The Institute for Natural Language Processing at the University of Stuttgart • Bitext parsing • Using morphosyntactic correspondence

IfNLP Stuttgart • The Institute for Natural Language Processing (IfNLP/IMS) at the University of Stuttgart • Dogil (Phonetics and Speech) • Large department • Kuhn/Rohrer (LFG syntax and semantics) • Cahill (LFG generation) • Heid (Terminology extraction, morphology) • Padó (Semantics, lexical semantics) • Schütze (Statistical NLP and Information Retrieval) • More on next slide

IfNLP – Statistical NLP Group • Hinrich Schütze (director since 2004) • Bernd Möbius – Speech recognition and synthesis • Helmut Schmid - Parsing , morphology (known for TreeTagger, BitPar) • Sabine Schulte im Walde – NLP and cognitive modeling of lexical semantics • Michael Walsh – Speech, exemplar theoretic syntax • Alex Fraser - Statistical machine translation, parsing, cross-lingual information retrieval • General department areas of research • New statistical NLP models and methods • Semi-supervised and active learning • Cognitive/linguistic representation models • Applied to: NLP, retrieval, MT, speech, e-learning, …

IfNLP - Partnerships • Partnerships • Stuttgart: large projects with linguistics, computer science, EE signal processing, high performance computing • Germany: Darmstadt, Tübingen, DSPIN/CLARIN consortium (UIMA-based German processing) • International: large French-led European project (6 universities, 4 industrial partners), collaborations on South African languages, Edinburgh, CLARIN • Industrial: various projects with publishers (many focusing on terminology)

What is bitext parsing? • Bitext: a text and its translation • Sentences and their translations are aligned • Sometimes called a parallel corpus • Syntactic parsing: automatically find the syntactic structure of a sentence (syntactic parse) • Bitext parsing: automatically find the syntactic structure of the parallel sentences in a bitext • We will use the complementarity of the syntax of the two languages to obtain improved parses

Motivation for bitext parsing • Many advances in syntactic parsing come from better modeling • But the overall bottleneck is the size of the treebank • Our research asks a different question: • Where can we (cheaply) obtain additional information, which helps to supplement the treebank? • A new information source for resolving ambiguity is a translation • The human translator understands the sentence and disambiguates for us! • Our research goal was to build large databases of improved parses to help establish preferences for difficult phenomena like PP-attachment

Clause attachment ambiguity Parse 1: high attachment (wrong) Parse 2: low attachment (correct)

Not ambiguous in German • Number agreement disambiguates • FRAU (woman) and HATTE (had) agree • Unambiguous low attachment

Parse reranking of bitext • Goal: improve English parsing accuracy • Parse English sentence, obtain list of 100 best parse candidates • Parse German sentence, obtain single best parse • Determine the correspondence of German to English words using a word alignment • Calculate syntactic divergence of each English parse candidate and the projection of the German parse • Choose probable English parse candidate with low syntactic divergence

exp ∑mλm hm(g, e, a) P(e | g) = ∑e exp ∑mλm hm(g, e, a) Measuring syntactic divergence • Define features to capture different (overlapping) aspects of syntactic divergence. Functions of: • Candidate English parse e • German parse g • Word alignment a • Combine in log-linear model • Discriminatively train λ parameters to maximize parsing accuracy on a training set (minimum error rate training)

Rich bitext projection features • Defined 36 features by looking at common English parsing errors • No monolingual features, except baseline parser probability • General features • Is there a probable label correspondence between German and the hypothesized English parse? • How expected is the size of each constituent in the hypothesized English parse given the German parse? • Specific features • Are coordinations realized identically? • Is the NP structure the same? • Mix of probabilistic and heuristic features

Training • Use BitPar syntactic forest parser • English BitPar trained on Penn Treebank • German BitPar trained on Tiger Treebank • Probabilistic feature functions built using large parallel text (Europarl) • Weights on feature functions (lambda vector) trained on portion of the Penn Treebank together with its translation into German • Minimum error rate training using F score

Reranking English parses • Difficult task • German is difficult to parse • Our knowledge source, the German parser, is out-of-domain (poor performance) • Baseline English parser we are trying to improve is in-domain (good performance) • Test set has long sentences • Result: 0.70% F1 improvement on test data (stat. significant)

New results • Reranking German parses • We needed German gold standard parses (and English translations) • Sebastian Pado has made a small parallel treebank for Europarl available • No engineering on German yet • We are using the same syntactic divergence features which were designed to improve English parsing • There are German specific ambiguities which could be modeled, such as subject-object ambiguity (e.g., Die Maus jagt die Katze, “the mouse chases the cat” or “the cat chases the mouse”) • But easier task because the parser we are trying to improve is weaker (German is hard to parse, Europarl is out of domain) • 2.3% F1 improvement currently, we think this can be further improved

Summary: bitext parsing • I showed you an approach for bitext parsing • Reranking the parses of English to minimize syntactic divergence with an automatically generated German parse • I then showed our first results for reranking German parses using a single English parse • The approach we used for this kind of morphosyntactic correspondence is more general than just parse reranking • Machine translation involves morphosyntactic correspondence • And this is where we are interested in looking at Croatian

Morphosyntactic processing • I am co-PI of a new IfNLP project funded by the DFG (German Science Foundation) • Project: morphosyntactic modeling for statistical machine translation (SMT) • SMT research, up until recently, has been dominated by translation into English • English expresses a lot of information through word order, very little through inflection • Approaches to translating morphologically rich languages to English are preprocessing based

Present: linguistic preprocessing • Linguistic preprocessing for SMT (stat. machine translation) • From: freer syntax, morphologically rich language • To: rigid syntax, morphologically poor language • Existing examples: German to English, Czech to English

Present: linguistic preprocessing • How this works • Produce morphosyntactic analysis of German (or Czech) • Reorder words in the German/Czech sentence to be in English order • Reduce morphological inflection (for instance, remove case marking, remove all agreement on adjectives, etc) • For Czech: insert pseudo-words (e.g. indicate PRO-drop pronouns) • Use statistics on this “simplified” German or Czech to map directly to English using SMT

Present: linguistic preprocessing • How well does this work? • German to English SMT with linguistic preprocessing (Stuttgart system) • Results from 2008 ACL workshop on machine translation (extensive human evaluation) • Only system limited to organizer’s data competitive with: • The best system of 5 rule-based MT systems • Saarbrücken hybrid rule-based/SMT system • Google Translate, which does not use linguistic preprocessing but does use vastly more data

Future: modeling • What about translating from English to German or to Slavic languages? • Problem: morphological generation is more difficult • It is easy to reduce multiple inflections to one (for instance, stemming) • Harder to learn to generate the right inflection

Future: modeling • Current work on morphological generation • Work at Charles University in Prague on Czech • Tectogrammatical representation is not (yet) competitive with simple statistics (little explicit knowledge of morphology or syntax) • Best English to German SMT systems also use little or no morphological knowledge • And they are much worse than rule-based English to German systems • Challenge: to use morphosyntactic knowledge with statistical approaches requires more than just linguistic preprocessing • morphosyntactic modeling

Morphosyntactic correspondence • In fact, all multilingual problems involve morphosyntactic correspondence: • If we have a source parse tree, and source text, and we would like a target text, this is machine translation • If we have a source parse tree, source text and target text, and we would like a target parse, this is bitext parsing • If we would like to know which word in the target text is a translation of a particular word in the source text and we use morphosyntactic analysis, this is syntactic word alignment • The same thinking can be used for cross-lingual information retrieval • Very relevant when one of the languages is morphologically rich

Conclusion • I introduced the IfNLP Stuttgart • I presented a new approach to improving parsing using morphosyntactic correspondence: bitext parsing • I discussed the general challenge of using morphosyntactic correspondence, focusing on statistical machine translation • Biggest challenge is translating into freer word order, morphologically rich (e.g., German and particularly Slavic languages) • We are interested in the challenge of building systems to translate to Croatian • To do this: we need partners who are working on Croatian analysis! • We also request that you think about multilingual applications when producing Croatian NLP resources • The type of approach I showed for bitext parsing is useful for other multilingual applications

Thank you!

Title • text

Statistical Approach • Using statistical models • Create many alternatives, called hypotheses • Give a score to each hypothesis • Find the hypothesis with the best score through search • Disadvantages • Difficulties handling structurally rich models (math and computation) • Need data to train the model parameters • Difficult to understand decision process made by system • Advantages • Avoid hard decisions • Speed can be traded with quality, no all-or-nothing • Works better in the presence of unexpected input • Learns automatically as more data becomes available Modified from Vogel

Morphosyntactic knowledge • We use: morphological analyzers & treebanks, which are combined in parsing models learned from treebanks • English models have little morphological analysis (suffix analysis to determine POS for unknown words) • German syntactic parser BitPar (Schmid) uses SMOR (Stuttgart Morphological Analyzer) • Given inflected form, SMOR returns possible fine-grained POS tags • E.g., for nouns/adjectives: POS, case, gender, number, definiteness • BitPar puts possible analyses in the chart, and disambiguates • Slavic languages require even more morphological knowledge than German

Transferring syntactic knowledge • Need knowledge source! • English syntactic parser • About 90% bracketing accuracy • Mapping • Requires bitext • Work discussed here uses German/English Europarl (European Parliament Proceedings) • Resource for Croatian: Acquis Communautaire • Automatically generated word alignment

Additional details in the paper • Formalization of bitext parsing as a parse reranking task • Definitions of bitext feature functions • Analysis of feature functions through feature selection • Comparison of MERT (minimum error rate training) with SVM-Rank

Morphosyntactic correspondence: a progress report on bitext parsing