360 likes | 376 Views
Implementing MapReduce for scalable coreference resolution, evaluating using ACE-Usenet, generating features, context analysis, and pairwise similarity for efficient resolution.
E N D
Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu HLT COE and UMIACS Laboratory for Computational Linguistics and Information Processing COE Quarterly Technical Exchange, June 10th 2008
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ COE ACE System English Pipeline Within-DocCoref. PairsFiltering FeatureGeneration Clustering Context Features ConversationalGenre Features Within-DocCoref. FeatureGeneration Clustering Arabic Pipeline COE Quarterly Technical Exchange, June 10th 2008
Roadmap • Context Features • Pairwise similarity • Efficient vs. effectiveness • Generating features for ACE • Conversational-genre Features • New generative model • Joint Resolution • Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008
Context Features Close friends and colleagues of Cheney -- including former Gen. Brent Scowcroft, who was national security adviser when Cheney was Gerald Ford's chief of staff and George H. W. Bush's defense secretary -- have been famously quoted they just don't recognize the Cheney they served along side and the Cheney of today who repeatedly made false assertions about the Iraq war and weapons of mass destruction. Now, an article in Vanity Fair Magazine by Todd S. Purdum has published a number of strikingly similar assessments from Clinton's former confidants -- plus medically authoritative guesswork speculating about how health problems of the sort Clinton experienced can change a person. But we avoid that trash talk to focus only on the real, striking changes in the public performances of Bill Clinton and Dick Cheney today. Compared to the way they were, back when they were greatly admired by those who knew them best, back in the day. Once, Clinton and Cheneywere considered consummate political performers. Now they utter gaffes and commit blunders. And they leave the lasting impression that they just don't care about what you think about it. Once, they were smart and savvy strategic forces that always seemed to boost the political fortunes of their team (Clinton with sterling public performances; Cheney with rock-steady behind-the-scenes guidance). Now they have become liabilities to their causes, grand grist for late-night monologues, caricatures on "Saturday Night Live." It barely seems credible now but there was a time when it seemed the Democratic nomination was Hillary Clinton's for the taking. The air of certainty in January was convincing when Clinton declared from a sofa at her Washington home: "I'm in and I'm in to win." Two Democratic senators and two former governors swiftly pulled out rather than get between Clinton and White House. Then along came Barack Obamaand the aura of inevitability that was crucial to Clinton's strategy vanished. "The Clinton campaign was meant to be shock and awe: big events in big states, sweep the board on Super Tuesday, overwhelm the less well-known competitors," said Chip Smith, who was deputy campaign manager for Al Gore in 2000. "Unfortunately, Obama uprooted that strategy. Inevitability isn't a viable strategy against a well-funded candidate with a powerful message." It is unclear whether there was anything Clinton could have done to stop a gifted politician such as Obama, once his early win in Iowa and prodigious fundraising ability established that he really did have a chance of winning the Democratic nomination. Clinton also may have destroyed any chance of a comeback after being caught out in her fib about coming under sniper fire while in Bosnia in the 1990s. The lie crystallised voter unease with Clinton, and held back chances of a grand comeback in Pennsylvania. In April, a Washington Post/ABC News poll found that 61% of American voters considered her dishonest and untrustworthy. COE Quarterly Technical Exchange, June 10th 2008
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Abstract Problem Goal: Scalable Pairwise Similarity ~10K docs ~50 million doc pairs ~140K entities ~10 billion entity pairs COE Quarterly Technical Exchange, June 10th 2008
Solutions • Trivial • Loads each vector o(N) times • Loads each term t o(dft2) times • Better • Each term contributes only if appears in • Loads each term (with posting list) once • Each term contributes o(dft2) COE Quarterly Technical Exchange, June 10th 2008
Indexing (3-doc toy collection) Clinton Clinton ObamaClinton 1 2 Indexing 1 Cheney ClintonCheney 1 Barack 1 ClintonBarackObama Obama 1 1 Standard IR Indexing COE Quarterly Technical Exchange, June 10th 2008
2 2 2 1 2 1 3 1 2 2 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama 1 1 COE Quarterly Technical Exchange, June 10th 2008
Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs Grouping multiply term postings sum similarity multiply term postings sum similarity multiply term postings sum similarity multiply term postings COE Quarterly Technical Exchange, June 10th 2008
MapReduce! (a) Map (b) Shuffle (c) Reduce Shuffling group values by keys map input reduce output map input reduce output map input reduce output map input COE Quarterly Technical Exchange, June 10th 2008
And indexing .. of course! (a) Map (b) Shuffle (c) Reduce Shuffling group values by keys tokenize doc combine Posting list tokenize doc combine Posting list tokenize doc combine Posting list tokenize doc COE Quarterly Technical Exchange, June 10th 2008
Terms: Zipfian Distribution each term t contributes o(dft2) partial results very few terms dominate the computations most frequent term (“said”) 3% most frequent 10 terms 15% most frequent 100 terms 57% most frequent 1000 terms 95% doc freq (df) ~0.1% of total terms(99.9% df-cut) term rank COE Quarterly Technical Exchange, June 10th 2008
Efficiency (disk space) Aquaint-2 Collection, ~ million doc 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008
Efficiency (disk space) Aquaint-2 Collection, ~ million doc 8 trillionintermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008
Effectiveness Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness For more details, Check “Pairwise Document Similarity in Large Collections with MapReduce”at ACL 2008 (presented next week!) Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In ACE! • ~10K docs • each document is a vector • ~140K entities • each has multiple mentions • each entity context is a vector • Generated 8 feature matrices (6 English + 2 Arabic) English Pipeline Within-DocCoref. PairsFiltering FeatureGeneration Clustering Arabic Pipeline Within-DocCoref. FeatureGeneration Clustering COE Quarterly Technical Exchange, June 10th 2008
Roadmap • Context Features • Pairwise similarity • Efficient vs. effectiveness • Generating features for ACE • Conversational-genre Features • New generative model • Joint Resolution • Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008
Identity Resolution in Email Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Mary Adams <mary.adams@enron.com> Subject: Re: tennis tomorrow! Did Sue want Scott to join? Looks like the game will be too late for him. Sue Identity Resolution Who? i.e., label with email address COE Quarterly Technical Exchange, June 10th 2008
New Generative Model • Choose “person” c to mention p(c) • Choose appropriate “context” X to mention c p(X | c) • Choose a “mention” l p(l | X, c) playingtennis “sue” COE Quarterly Technical Exchange, June 10th 2008
Topical Context Social Context Conversational Context LocalContext Context COE Quarterly Technical Exchange, June 10th 2008
(2) Mention Resolution Posterior Distribution (1) Identity Modeling Prior Distribution Evidence Single-Mention: 2-Step Solution COE Quarterly Technical Exchange, June 10th 2008
Improved Results +8.9% +8.6% For more details, Check “Resolving Personal Names in Email using Context Expansion”at ACL 2008 (alsopresented next week!) COE Quarterly Technical Exchange, June 10th 2008
Limitation! “sjhonson@enron.com” “Susan Scott” Context-Free Resolution social social “Sue” “Sue” topical “Suebob” conversational topical social topical “Susan Jones” “Susan” Joint Resolution! COE Quarterly Technical Exchange, June 10th 2008
Joint Resolution MentionGraph SpreadCurrent Resolution CombineContext Info UpdateResolution COE Quarterly Technical Exchange, June 10th 2008
Joint Resolution Work in Progress! MentionGraph map shuffle reduce MapReduce! COE Quarterly Technical Exchange, June 10th 2008
Roadmap • Context Features • Pairwise similarity • Efficient vs. effectiveness • Generating features for ACE • Conversational-genre Features • New generative model • Joint Resolution • Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008
Email Message From: Machiavegli <machia@aol.com> To: Mark <mk@hotmail> Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. receiver is email address COE Quarterly Technical Exchange, June 10th 2008
Usenet Message From: Machiavegli <machia@aol.com> Newsgroup: soc.history.what-if Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. newsgroup! COE Quarterly Technical Exchange, June 10th 2008
ACE Usenet Document <DOCID> soc.history.what-if_20350205910 </DOCID> <POSTER> Machiavegli </POSTER> <POSTDATE> 29 Jan 2005 22:04:38 GMT </POSTDATE> <SUBJECT> The 1860 Presidential Election </SUBJECT> In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. no email addresses in headers! COE Quarterly Technical Exchange, June 10th 2008
Reconstruct from automatically Got the address back! From: Machiavegli <machia@aol.com> Newsgroup: soc.history.what-if Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10th 2008
Handling it as @ From: Machiavegli <machia@aol.com> To: soc.history.what-if@usenet.com Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. handle group as receiver COE Quarterly Technical Exchange, June 10th 2008
Feature Value: same label • Need for feature matrix (pairwise score) sjhonson@hotmail.com sjhonson@hotmail.com “Steph” “Stephan” “Stephan” “S. Smith” +1.0 COE Quarterly Technical Exchange, June 10th 2008
Feature Value: different labels • Need for feature matrix (pairwise score) sjhonson@hotmail.com smith_s@aol.com “Steph” “Stephan” “Stephan” “S. Smith” -1.0 COE Quarterly Technical Exchange, June 10th 2008
Conclusion • MapReduce can be applied to many HLT applications • easy, cheap, and fast for distributed processing • e.g., scalable pairwise similarity for coreference resolution • calls for new ways of thinking • Identity resolution in email • new generative model yields improved accuracy • scalable joint resolution needed • Usenet-ACE is new test collection COE Quarterly Technical Exchange, June 10th 2008
Thank You! COE Quarterly Technical Exchange, June 10th 2008
MapReduce and Text Analysis • Computing pairwise similarity in large collections • Joint resolution of mentions in email collections • Search engines (of course!) • Building language models • Clustering applications • Machine translation • … COE Quarterly Technical Exchange, June 10th 2008