510 likes | 617 Views
Feasibility of Human-in-the-loop Minimum Error Rate Training. Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University EMNLP 2009 – Singapore Thursday August 6 th , 2009. {. ozaidan. |. ccb. }. @.
E N D
Feasibility of Human-in-the-loopMinimum Error Rate Training Omar F. Zaidan Chris Callison-Burch The Center for Language and Speech Processing Johns Hopkins University EMNLP 2009 – Singapore Thursday August 6th, 2009 { ozaidan | ccb } @ cs.jhu.edu
CCB ’09: quixotic things like human-in-the-loop minimum error rate training foolishly impractical especially in the pursuit of ideals ; especially : marked by rash lofty romantic ideas or extravagantly chivalrous action
Log-linear MT in One Slide • MT systems rely on several models. • A candidate is represented as a feature vector: • Corresponding weight vector: • Each candidate is assigned a score: • System selects highest-scoring translation:
Minimum Error Rate Training • Och (2003): weight vector should be chosen by optimizing to evaluation metric of interest (aka MERT phase). • But error surface is ugly. • Och suggests an efficient line optimization method…
Visualizing Och’s Method We want to plot this
Visualizing Och’s Method TER-like sufficient statistics
Visualizing Och’s Method TER-like sufficient statistics
Visualizing Och’s Method TER-like sufficient statistics
Visualizing Och’s Method TER-like sufficient statistics
Visualizing Och’s Method TER-like sufficient statistics Fast!
Visualizing Och’s Method TER-like sufficient statistics Fast! Fast?
BLEU & MERT • The metric most often optimized is BLEU: • Why BLEU? • Usually the reported metric, • it has been shown to correlate well with human judgment, and • it can be computed efficiently.
Problem with BLEU MERT • General critique of BLEU • Chiang et al. (2008): weaknesses in BLEU. • Callison-Burch et al. (2006): not always appropriate to use BLEU to compare systems. • Metric disparity • Actual evaluations have a human component (e.g. GALE uses H-TER). • What is the alternative? H-TER MERT?
H-TER MERT? • In theory, MERT applicable to any metric. • In practice, scoring 1000’s of candidate translations with H-TER is expensive. • H-TER cost estimate: • Assume sentence takes 10 seconds to post-edit, at a cost of $0.10. • 100 candidates for each of 1000 source sentences 35 work days, $10,000. • vs. BLEU: minutes per iteration (and free). per iteration(!)
A Human-BasedAutomatic Metric • We suggest a metric that is: • viable to be used in MERT, yet • based on human judgment. • Viability: relies on prebuilt database; no human involvement during MERT. • Human-based: the database is a repository of human judgments.
Our Metric: RYPT • Main idea: reward syntactic constituents in source that are aligned to “acceptable” substrings in candidate translation. • When scoring a candidate: • Obtain parse tree for source sentence. • Align source words to candidate words. • Count number of subtrees translated in an “acceptable” manner. • RYPT = Ratio of Yes nodes in the Parse Tree.
RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)
RYPT (Ratio of Y in Parse Tree) Source Parse Tree Label Y indicates forecasts deemed acceptable translation of prognosen. Y Source Translation (to be scored)
RYPT (Ratio of Y in Parse Tree) Source Parse Tree Label Y indicates forecasts deemed acceptable translation of prognosen. Y Y Source Translation (to be scored)
RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)
RYPT (Ratio of Y in Parse Tree) Source Parse Tree Source Translation (to be scored)
Is RYPT Good? Empirically Next … • Is RYPT acceptable? • Must show RYPT is reasonable substitute for human judgment. • Is RYPT feasible? • Must show collecting necessary judgments is efficient and affordable.
Feasibility: Reusing Judgments <der patient , the patient , > • For each source sentence, we build a database, where each entry is a tuple: <source substring , candidate substring , judgment> • A judgment is reused across candidates: der patient wurde isoliert . the patient was isolated . the patient isolated . the patient was in isolation . the patient has been isolated .
Feasibility: Reusing Judgments <der patient , of the patient, > • For each source sentence, we build a database, where each entry is a tuple: < source substring , candidate substring , judgment > • A judgment is reused across candidates: der patient wurde isoliert . of the patient was isolated . of the patient isolated . of the patient was in isolation . of the patient has been isolated .
Feasibility: Label Percolation N N N N Y Y Y Y • Minimize label collection even further by percolating labels through the parse tree: • If a node is labeled NO, ancestors likely labeled NO Percolate NO up the tree. • If a node is labeled YES, descendents likely labaled YES Percolate YES down the tree. Y N
Maximizing Label Percolation Too much focus on individual words No percolation Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y • Queries are performed in batch mode. • For maximum percolation, queries should avoid overlapping substrings. • One extreme: select root node. • Other extreme: select all preterminals. Never happens… Y
Query Selection Middle ground Frontier node set with some source maxLen
Query Selection Middle ground Frontier node set with some source maxLen
Query Selection Middle ground Frontier node set with some source maxLen
Query Selection Middle ground Frontier node set with some source maxLen
Query Selection Middle ground Frontier node set with some source maxLen
Amazon Mechanical Turk • We use Amazon Mechanical Turk to collect judgment labels. • AMT: virtual marketplace, allows “requesters” to create and post tasks to be completed by “workers” around the world. • Requester provides HTML template, csv database. • AMT creates individual tasks for workers. • Task = Human Intelligence Task = HIT
HIT Example Source prozent Reference percent Candidate translations % per cent
des zentralen statistischen amtes Source Reference statistics office data from the central statistical office from the central statistics office in the central statistical office in the central statistics office Candidate translations of central statistical office of central statistics office of the central statistical office of the central statistics office
Data Summary Hourly ‘wage’: $1.95 • 3,873 HITs created, each with 3.4 judgments on average 13k labels. • 115 distinct workers put in 30.8 hours. • One label per 8.4 seconds (426 labels/hr). • Cost:$21.43 Amazon fees $53.47 Wages $ 6.54 Bonuses $81.44 161 labels per $
Is RYPT Good? Next … Yes! • Is RYPT acceptable? • Must show RYPT is reasonable substitute for human judgment. • Is RYPT feasible? • Must show collecting necessary judgments is efficient and affordable.
Is RYPT Acceptable? • Is RYPT a reasonable alternative for human judgment? • Our experiment: compare predictive power of RYPT vs. BLEU. • Compare top-1 candidate by BLEU score vs. top-1 candidate by RYPT score. • Which candidate looks better to a human?
RYPT vs. BLEU . RYPT Candidates BLEU . cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7 . . .
RYPT vs. BLEU RYPT’s BLEU’s choice choice vs. cand 5 vs. cand 3 . RYPT Candidates BLEU . cand 1 cand 2 cand 3 cand 4 cand 5 cand 6 cand 7 . . . • Which one would be preferred by a human? • Ask a Turker! • Actually, ask 3 Turkers… • 3 judgments • * 250 sentence pairs • 750 judgments
RYPT vs. BLEU • RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. • Majority vote breakdown:
RYPT vs. BLEU • RYPT’s choice is preferred 46.1% of the time, vs. 36.0% for BLEU’s choice. • Majority vote breakdown: Majority vote picks RYPT’s choice Majority vote picks BLEU’s choice 48.0% 16.8% 35.2% 24.0% 13.2% Majority vote strongly prefers RYPT’s choice Majority vote strongly prefers BLEU’s choice Strong preference for X = no votes for Y
BLEU’s Inherent Advantage • When comparing candidate translations, worker was shown the references. • BLEU’s choice, by definition, will have high overlap with the reference. • Annotator might judge BLEU’s choice to be ‘better’ because it ‘looks’ like the reference. • When no references shown, (and restricted to workers in Germany): 46.1% 17.9% 36.0% 45.2% 25.6% 29.2%
See Paper for… • Source-candidate alignment method, which takes advantage of derivation trees given by Joshua (see 3.1). • Percolation coverage and accuracy, and effect of maxLen (see 5.1). • Related work (see 6). • Nießen et al. (2000): DB of judgments. • WMT Workshops: manual evaluation; metric correlation with human judgment. • Snow et al. (2008): AMT is “fast and cheap.”
Future Work • This was a pilot study… • Complete MERT run (already in progress). • Beyond a single iteration. • Using AMT’s API. • Probabilistic approach to labeling nodes. • Treat a node label as a random variable. • Existing labels = observed, others inferred. Stay tuned for our next paper