Feasibility of Human-in-the-loop Minimum Error Rate Training

Feasibility of Human-in-the-loopMinimum Error Rate Training Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University EMNLP 2009 – Singapore Thursday August 6th, 2009 { ozaidan | ccb } @ cs.jhu.edu

CCB ’09:

CCB ’09: quixotic things like human-in-the-loop minimum error rate training foolishly impractical especially in the pursuit of ideals ; especially : marked by rash lofty romantic ideas or extravagantly chivalrous action

Log-linear MT in One Slide • MT systems rely on several models. • A candidate is represented as a feature vector: • Corresponding weight vector: • Each candidate is assigned a score: • System selects highest-scoring translation:

Minimum Error Rate Training • Och (2003): weight vector should be chosen by optimizing to evaluation metric of interest (aka MERT phase). • But error surface is ugly. • Och suggests an efficient line optimization method…

Visualizing Och’s Method We want to plot this

Visualizing Och’s Method

Visualizing Och’s Method TER-like sufficient statistics

Visualizing Och’s Method TER-like sufficient statistics Fast!

Visualizing Och’s Method TER-like sufficient statistics Fast! Fast?

BLEU & MERT • The metric most often optimized is BLEU: • Why BLEU? • Usually the reported metric, • it has been shown to correlate well with human judgment, and • it can be computed efficiently.

Problem with BLEU MERT • General critique of BLEU • Chiang et al. (2008): weaknesses in BLEU. • Callison-Burch et al. (2006): not always appropriate to use BLEU to compare systems. • Metric disparity • Actual evaluations have a human component (e.g. GALE uses H-TER). • What is the alternative? H-TER MERT?

H-TER MERT? • In theory, MERT applicable to any metric. • In practice, scoring 1000’s of candidate translations with H-TER is expensive. • H-TER cost estimate: • Assume sentence takes 10 seconds to post-edit, at a cost of $0.10. • 100 candidates for each of 1000 source sentences  35 work days, $10,000. • vs. BLEU: minutes per iteration (and free). per iteration(!)

A Human-BasedAutomatic Metric • We suggest a metric that is: • viable to be used in MERT, yet • based on human judgment. • Viability: relies on prebuilt database; no human involvement during MERT. • Human-based: the database is a repository of human judgments.

Our Metric: RYPT • Main idea: reward syntactic constituents in source that are aligned to “acceptable” substrings in candidate translation. • When scoring a candidate: • Obtain parse tree for source sentence. • Align source words to candidate words. • Count number of subtrees translated in an “acceptable” manner. • RYPT = Ratio of Yes nodes in the Parse Tree.

RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

RYPT (Ratio of Y in Parse Tree) Source Parse Tree Label Y indicates forecasts deemed acceptable translation of prognosen. Y Source Translation (to be scored)

RYPT (Ratio of Y in Parse Tree) Source Parse Tree Label Y indicates forecasts deemed acceptable translation of prognosen. Y Y Source Translation (to be scored)

RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)

Is RYPT Good? Empirically Next … • Is RYPT acceptable? • Must show RYPT is reasonable substitute for human judgment. • Is RYPT feasible? • Must show collecting necessary judgments is efficient and affordable.

Feasibility: Reusing Judgments <der patient , the patient , > • For each source sentence, we build a database, where each entry is a tuple: <source substring , candidate substring , judgment> • A judgment is reused across candidates: der patient wurde isoliert . the patient was isolated . the patient isolated . the patient was in isolation . the patient has been isolated .

Feasibility: Reusing Judgments <der patient , of the patient, > • For each source sentence, we build a database, where each entry is a tuple: < source substring , candidate substring , judgment > • A judgment is reused across candidates: der patient wurde isoliert . of the patient was isolated . of the patient isolated . of the patient was in isolation . of the patient has been isolated .

Feasibility: Label Percolation N N N N Y Y Y Y • Minimize label collection even further by percolating labels through the parse tree: • If a node is labeled NO, ancestors likely labeled NO Percolate NO up the tree. • If a node is labeled YES, descendents likely labaled YES Percolate YES down the tree. Y N

Maximizing Label Percolation Too much focus on individual words No percolation Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y • Queries are performed in batch mode. • For maximum percolation, queries should avoid overlapping substrings. • One extreme: select root node. • Other extreme: select all preterminals. Never happens… Y

Query Selection Middle ground Frontier node set with some source maxLen

OK, so how do you obtain these labels?

Amazon Mechanical Turk • We use Amazon Mechanical Turk to collect judgment labels. • AMT: virtual marketplace, allows “requesters” to create and post tasks to be completed by “workers” around the world. • Requester provides HTML template, csv database. • AMT creates individual tasks for workers. • Task = Human Intelligence Task = HIT

HIT Example Source prozent Reference percent Candidate translations % per cent

des zentralen statistischen amtes Source Reference statistics office data from the central statistical office from the central statistics office in the central statistical office in the central statistics office Candidate translations of central statistical office of central statistics office of the central statistical office of the central statistics office

Data Summary Hourly ‘wage’: $1.95 • 3,873 HITs created, each with 3.4 judgments on average  13k labels. • 115 distinct workers put in 30.8 hours. • One label per 8.4 seconds (426 labels/hr). • Cost:$21.43 Amazon fees $53.47 Wages $ 6.54 Bonuses $81.44 161 labels per $

Is RYPT Good? Next … Yes! • Is RYPT acceptable? • Must show RYPT is reasonable substitute for human judgment. • Is RYPT feasible? • Must show collecting necessary judgments is efficient and affordable.

Is RYPT Acceptable? • Is RYPT a reasonable alternative for human judgment? • Our experiment: compare predictive power of RYPT vs. BLEU. • Compare top-1 candidate by BLEU score vs. top-1 candidate by RYPT score. • Which candidate looks better to a human?

RYPT vs. BLEU . RYPT Candidates BLEU . cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7 . . .

RYPT vs. BLEU RYPT’s BLEU’s choice choice vs. cand 5 vs. cand 3 . RYPT Candidates BLEU . cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7 . . . • Which one would be preferred by a human? • Ask a Turker! • Actually, ask 3 Turkers… •  3 judgments • * 250 sentence pairs • 750 judgments

RYPT vs. BLEU • RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. • Majority vote breakdown:

RYPT vs. BLEU • RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. • Majority vote breakdown: Majority vote picks RYPT’s choice Majority vote picks BLEU’s choice 48.0% 16.8% 35.2% 24.0% 13.2% Majority vote strongly prefers RYPT’s choice Majority vote strongly prefers BLEU’s choice Strong preference for X = no votes for Y

BLEU’s Inherent Advantage • When comparing candidate translations, worker was shown the references. • BLEU’s choice, by definition, will have high overlap with the reference. • Annotator might judge BLEU’s choice to be ‘better’ because it ‘looks’ like the reference. • When no references shown, (and restricted to workers in Germany): 46.1% 17.9% 36.0% 45.2% 25.6% 29.2%

See Paper for… • Source-candidate alignment method, which takes advantage of derivation trees given by Joshua (see 3.1). • Percolation coverage and accuracy, and effect of maxLen (see 5.1). • Related work (see 6). • Nießen et al. (2000): DB of judgments. • WMT Workshops: manual evaluation; metric correlation with human judgment. • Snow et al. (2008): AMT is “fast and cheap.”

Future Work • This was a pilot study… • Complete MERT run (already in progress). • Beyond a single iteration. • Using AMT’s API. • Probabilistic approach to labeling nodes. • Treat a node label as a random variable. • Existing labels = observed, others inferred. Stay tuned for our next paper 

Feasibility of Human-in-the-loop Minimum Error Rate Training

Feasibility of Human-in-the-loop Minimum Error Rate Training

Presentation Transcript

HERA-JANUS Training Analysing Human Error in Incident Investigation

Machine Translation Minimum Error Rate Training

Human Error In General Aviation

Human In the Loop Cyber Physical Systems

Feasibility of Outpatient Fully Integrated Closed-Loop Control

Human-in-the-Loop/Human-Aware Planning and Decision Support

Understanding the Human in the Loop

Minimum Error Rate Training in Statistical Machine Translation

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION

Minimum Error Rate Training in Statistical Machine Translation

Minimum Phone Error (MPE) Model and Feature Training

Human-in-the-Loop Control of an Assistive Robot Arm

Cryptic Variation in the Human mutation rate

Minimum Rank Error Language Modeling

Minimum Rank Error Training for Language Modeling

Finding Minimum Type Error Sources

CS598: Human-in-the-loop Data Management

Empirical studies on human-in-the-loop?

Human Error:

What is Human-in-the-Loop Machine Learning?

Human in-the-loop in Machine Learning