Undergrad Status Report: August 1, 2003 John Blatz & Alex Kulesza

Undergrad Status Report: August 1, 2003 John Blatz & Alex Kulesza

Presentation Overview • Project Summary • Semantic Similarity as a Confidence Feature • Dealing with 100GB of Data • Human MT Evaluations 2

Project Summary • Goal: estimate probabilities of correctness for MT outputs • Approach: Apply machine learning techniques to confidence features • Different features relevant than for an MT system 3

Project Summary • Major Tasks: • Develop a set of relevant features, extracted from the source sentence, target hypothesis, and potentially the base MT system • Apply machine learning approaches to these features and tune to maximize CE performance • Additionally, justify the use of an automatic evaluation metric for “correctness” 4

Semantic Similarity for CE • Goal: to exploit knowledge about relationships between word meanings to better match human judgements • Like to be able to identify sentence pairs that “mean the same thing” • “The accomplishments of economic construction in China’s fourteen open-border municipalities is remarkable.” • “China’s 14 open border cities economic development has been remarkably successful.” • Difficult to quantify semantic distance: how do you compare “apples” with “oranges”? 6

Word-level Semantic Similarity: Approaches • Word-similarity metrics used in contextual word sense disambiguation • Semantic network approaches: WordNet • Edge counting (Rada et. al.) • Weighted path length (Hirst & St. Onge) • Information content approaches using corpus statistics • Lowest super-ordinate (Resnik) • Combination of edge counts and information content (Jiang & Conrath) 7

Word-level Semantic Similarity: Issues • Problems with above approaches: • Restriction to IS-A hierarchy means only nouns can be compared • Part of speech can vary over valid translations • Strong bias towards identity • Slow • Alternative: dictionary-based approach using WordNet glosses [Banerjee & Pedersen 2002] 8

Dictionary-Based Word Similarity: Algorithm • Similarity = degree of overlap between glosses • Weight of overlap proportional to square of its length • Scores normalized by gloss length • Glosses compared for words related to each target word (e.g. hypernyms, hyponyms, holonyms, etc.) • Final similarity of word senses is weighted sum over relation pairs • Similarity between words = maximum similarity of senses 9

s38:~/sim/test> perl scoretest.pl china chinese (china, chinese) (china#n, chinese#n) (china#n#1, chinese#n#1): score 0.0830807267196561 (china#n#1, chinese#n#2): score 0.60600285617774 (china#n#2, chinese#n#1): score 0.135138184584178 (china#n#2, chinese#n#2): score 0.266995506535948 (china#n#3, chinese#n#1): score 0.142382054673721 (china#n#3, chinese#n#2): score 1.31397266313933 (china#n#4, chinese#n#1): score 0.137027374470659 (china#n#4, chinese#n#2): score 0.232544191919192 (china#n, chinese#a) (china#n#1, chinese#a#1): score 0.0748615531590043 (china#n#1, chinese#a#2): score 0.089509428311705 (china#n#2, chinese#a#1): score 0.176181917211329 (china#n#2, chinese#a#2): score 0.0528322440087146 (china#n#3, chinese#a#1): score 1.07827160493827 (china#n#3, chinese#a#2): score 1.09197530864198 (china#n#4, chinese#a#1): score 0.128265107212476 (china#n#4, chinese#a#2): score 0.0450779727095517 score: 1.31397266313933 15.37u 0.75s 0:18.24 88.3% Dictionary-Based Word Similarity: Output 10

Dictionary-Based Word Similarity: Output • “economic” <-> “economic”: 4.048 • “construction” <-> “development”: 0.746 • “cities” <-> “municipalities”: 1.963 • “accomplishments” <-> “successful”: 0.085 • “remarkable” <-> “remarkably”: 0.492 • “China” <-> “Chinese”: 1.314 • “accomplishments” <-> “achievements”: 2.142 • “apples” <-> “oranges”: 0.338 11

Sentence-level Semantic Similarity: Approaches • Compare words in translation outputs aligned to the same source word, ignoring non-content words • High scores should be for sentence pairs where all words are similar • Linear average over source words too biased in favor of exact matches, not biased enough against zero-similarity word pairs • Possible fixes: • Log-linear average • Thresholding • Harmonic mean (average distance = 1/similarity) 12

Sentence-level Semantic Similarity: Outputs x17:~/sim> perl nbestsim.pl debug.nbest.gz Comparing to: china 's 14 open border cities marked economic achievements #1: china 's 14 open border cities marked economic achievements [score 1.44419355017311] #2: china 's 14 open border cities achievements remarkable [score 1.14704148074796] #3: china 's 14 open border cities building remarkable achievements [score 0.940209528271938] #4: china 's 14 open border cities , remarkable achievements [score 0.918073109130525] #5: china 's 14 open border cities construction remarkable achievements [score 0.939190065215347] #6: china 's 14 open border cities achievements marked [score 1.28994034511384] #7: china 's 14 open border cities achievements significant [score 1.1481214046316] #8: china 's 14 open border cities economic achievements remarkable [score 1.30129468580722] #9: china 's 14 open border cities economic remarkable achievements [score 1.08304969019374] #10: china 's 14 open border cities economic construction remarkable achievements [score 1.07232631418979] #11: china 's 14 open border cities significant achievement in economic construction [score 1.02599875824328] #12: china 's 14 open border cities achievements significantly [score 1.135364189977] #13: china 's 14 open border cities economic achievements marked [score 1.44419355017311] #14: china 's 14 open border cities significant economic achievements [score 1.30237460969086] #15: china 's 14 open border cities economic achievements significant [score 1.30237460969086] #16: china 14 open border cities marked economic achievements [score 1.28053447962537] #17: china 14 open border cities achievements remarkable [score 0.983382410200229] #18: china 14 open border cities building remarkable achievements [score 0.776550457724207] 13

Semantic Similarity Confidence Features • Similarity to a “good” translation: for now, average similarity to top 3 hypotheses • Using log-linear average of word similarities • Number of word similarities > 1 • Harmonic mean of word similarities • Cache size • Word similarity queries cached to improve speed • Size of cache ≈ # words in nbest ≈ semantic complexity 14

Future Semantic Features • Intrasentential semantic coherence • Semantic coherence of nbest as a whole • Comparison of each sentence to a different archetype • Center hypothesis • Combination of most probable translation for each source word • Bag o’ words vector space similarity 15

Dealing with 100GB of Data • 5700 source sentences, each with ~16,000 translations = 100,000,000 examples • Necessity of so much data? • Sparseness of source sentences • Oracle shows increased performance (~10%) going from 1,000 to 16,000-best • Raw data size nears 100GB = our disk limit 17

Dealing with 100GB of Data • Obvious solution: compression • However, data must now be decompressed from disk each time it is needed by training algorithms, so speed is also critical • Implementation of C++ file readers for various compressions under a uniform interface allowed direct comparison 18

Switch to gzip brings anticipated training time down under one day (hopefully). 19

Dealing with 100GB of Data • Gradient descent training typically requires stochastic example presentation in order to be effective • This randomization is non-trivial when the example set won’t fit in memory and reading from disk is expensive 20

Dealing with 100GB of Data 21

Dealing with 100GB of Data • Cache is loaded sequentially and unloaded randomly • Expected position of each example is in-order… • But standard deviation is linear in the size of the cache 22

Human MT Evaluation • Motivation: we ought to justify (or at least evaluate) our choice of target metric – WER, BLEU, NIST, etc. • Developed server/multi-client evaluation system to dynamically distribute sentences, ensuring each receives two independent votes • Users can start/stop as they please, from any terminal, and we receive data in real-time 26

************************************************************************************************************************************************ Human MT Eval Client ************************************************************************ Please rate the quality of a given hypothesis translation with respect to the reference on a scale from 1 to 5 as follows: Reference ex: bob walked the dog. 1: Useless; captures absolutely none of the reference's meaning. ex: franklin is a doctor. 2: Poor; contains a few key words, but little or no meaning. ex: dog banana walk. 3: Mediocre; contains some meaning, but with serious errors. ex: the dog walked bob. 4: Acceptable; captures most of the meaning with only small errors. ex: bob walk the dog. 5: Human quality; captures all of the reference's meaning. ex: bob took the dog for a walk. Press return to continue... Human MT Evaluation 27

************************************************************************************************************************************************ Human MT Eval Client ************************************************************************ Hypothesis: he said : " we should really street in order to improve the law @-@ abiding citizens went to security in the streets in a peaceful life , do not fear being attacked . " Reference: " we really have to tackle the problem of street @-@ crime . law @-@ abiding citizens want to feel safe when they walk on the street . they want a peaceful life and untroubled by attack , " he said . Enter your rating (1-5), 'h' for help, or 'q' to quit: Human MT Evaluation 28

********************************************************************************************************************************************************************************************** Human MT Eval Server Running: 1196 votes recorded (69.4777 votes/user-hour) 583/98336 sentences evaluated *********************************************************************************************** uid votes total time weiwei 12 0h 9m 57s edrabek 102 0h 43m 27s jason 2 0h 1m 28s viren 11 0h 3m 53s dengyg 0 0h 0m 0s keng 24 0h 13m 9s kyamada 2 0h 3m 18s skumar 7 0h 7m 34s libin 36 0h 19m 21s kmach 5 0h 5m 4s holub 6 0h 4m 13s cuijia 12 0h 6m 14s fraser 47 0h 26m 0s dguthrie 20 0h 20m 37s hari 2 0h 3m 35s louise 12 0h 9m 49s uid votes total time hamish 21 0h 27m 53s cimartin 21 0h 7m 39s zhenjin 4 0h 3m 17s och 60 0h 53m 15s erin 67 0h 38m 9s anoop 36 0h 20m 2s foster 39 0h 39m 38s kulesza 139 0h 55m 54s asanchis 65 3h 57m 22s simona 112 2h 0m 13s ueffing 152 1h 19m 41s jblatz 113 1h 45m 46s cyril 67 1h 6m 23s (q)uit : 29

Undergrad Status Report: August 1, 2003 John Blatz & Alex Kulesza