1 / 36

Using MapReduce for Scalable Coreference Resolution

Using MapReduce for Scalable Coreference Resolution. Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu HLT COE and UMIACS Laboratory for Computational Linguistics and Information Processing. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~.

helenboyd
Download Presentation

Using MapReduce for Scalable Coreference Resolution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using MapReduce for Scalable Coreference Resolution Tamer Elsayed, Doug Oard, Jimmy Lin, Asad Sayeed and Tan Xu HLT COE and UMIACS Laboratory for Computational Linguistics and Information Processing COE Quarterly Technical Exchange, June 10th 2008

  2. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ COE ACE System English Pipeline Within-DocCoref. PairsFiltering FeatureGeneration Clustering Context Features ConversationalGenre Features Within-DocCoref. FeatureGeneration Clustering Arabic Pipeline COE Quarterly Technical Exchange, June 10th 2008

  3. Roadmap • Context Features • Pairwise similarity • Efficient vs. effectiveness • Generating features for ACE • Conversational-genre Features • New generative model • Joint Resolution • Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008

  4. Context Features Close friends and colleagues of Cheney -- including former Gen. Brent Scowcroft, who was national security adviser when Cheney was Gerald Ford's chief of staff and George H. W. Bush's defense secretary -- have been famously quoted they just don't recognize the Cheney they served along side and the Cheney of today who repeatedly made false assertions about the Iraq war and weapons of mass destruction. Now, an article in Vanity Fair Magazine by Todd S. Purdum has published a number of strikingly similar assessments from Clinton's former confidants -- plus medically authoritative guesswork speculating about how health problems of the sort Clinton experienced can change a person. But we avoid that trash talk to focus only on the real, striking changes in the public performances of Bill Clinton and Dick Cheney today. Compared to the way they were, back when they were greatly admired by those who knew them best, back in the day. Once, Clinton and Cheneywere considered consummate political performers. Now they utter gaffes and commit blunders. And they leave the lasting impression that they just don't care about what you think about it. Once, they were smart and savvy strategic forces that always seemed to boost the political fortunes of their team (Clinton with sterling public performances; Cheney with rock-steady behind-the-scenes guidance). Now they have become liabilities to their causes, grand grist for late-night monologues, caricatures on "Saturday Night Live." It barely seems credible now but there was a time when it seemed the Democratic nomination was Hillary Clinton's for the taking. The air of certainty in January was convincing when Clinton declared from a sofa at her Washington home: "I'm in and I'm in to win." Two Democratic senators and two former governors swiftly pulled out rather than get between Clinton and White House. Then along came Barack Obamaand the aura of inevitability that was crucial to Clinton's strategy vanished. "The Clinton campaign was meant to be shock and awe: big events in big states, sweep the board on Super Tuesday, overwhelm the less well-known competitors," said Chip Smith, who was deputy campaign manager for Al Gore in 2000. "Unfortunately, Obama uprooted that strategy. Inevitability isn't a viable strategy against a well-funded candidate with a powerful message." It is unclear whether there was anything Clinton could have done to stop a gifted politician such as Obama, once his early win in Iowa and prodigious fundraising ability established that he really did have a chance of winning the Democratic nomination. Clinton also may have destroyed any chance of a comeback after being caught out in her fib about coming under sniper fire while in Bosnia in the 1990s. The lie crystallised voter unease with Clinton, and held back chances of a grand comeback in Pennsylvania. In April, a Washington Post/ABC News poll found that 61% of American voters considered her dishonest and untrustworthy. COE Quarterly Technical Exchange, June 10th 2008

  5. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 0.20 0.30 0.54 0.21 0.00 0.34 0.34 0.13 0.74 Abstract Problem Goal: Scalable Pairwise Similarity ~10K docs  ~50 million doc pairs ~140K entities  ~10 billion entity pairs COE Quarterly Technical Exchange, June 10th 2008

  6. Solutions • Trivial • Loads each vector o(N) times • Loads each term t o(dft2) times • Better • Each term contributes only if appears in • Loads each term (with posting list) once • Each term contributes o(dft2) COE Quarterly Technical Exchange, June 10th 2008

  7. Indexing (3-doc toy collection) Clinton Clinton ObamaClinton 1 2 Indexing 1 Cheney ClintonCheney 1 Barack 1 ClintonBarackObama Obama 1 1 Standard IR Indexing COE Quarterly Technical Exchange, June 10th 2008

  8. 2 2 2 1 2 1 3 1 2 2 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama 1 1 COE Quarterly Technical Exchange, June 10th 2008

  9. Pairwise Similarity (abstract) (a) Generate pairs (b) Group pairs (c) Sum pairs Grouping multiply term postings sum similarity multiply term postings sum similarity multiply term postings sum similarity multiply term postings COE Quarterly Technical Exchange, June 10th 2008

  10. MapReduce! (a) Map (b) Shuffle (c) Reduce Shuffling group values by keys map input reduce output map input reduce output map input reduce output map input COE Quarterly Technical Exchange, June 10th 2008

  11. And indexing .. of course! (a) Map (b) Shuffle (c) Reduce Shuffling group values by keys tokenize doc combine Posting list tokenize doc combine Posting list tokenize doc combine Posting list tokenize doc COE Quarterly Technical Exchange, June 10th 2008

  12. Terms: Zipfian Distribution each term t contributes o(dft2) partial results very few terms dominate the computations most frequent term (“said”)  3% most frequent 10 terms  15% most frequent 100 terms  57% most frequent 1000 terms  95% doc freq (df) ~0.1% of total terms(99.9% df-cut) term rank COE Quarterly Technical Exchange, June 10th 2008

  13. Efficiency (disk space) Aquaint-2 Collection, ~ million doc 8 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008

  14. Efficiency (disk space) Aquaint-2 Collection, ~ million doc 8 trillionintermediate pairs 0.5 trillion intermediate pairs Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008

  15. Effectiveness Drop 0.1% of terms“Near-Linear” GrowthFit on diskCost 2% in Effectiveness For more details, Check “Pairwise Document Similarity in Large Collections with MapReduce”at ACL 2008 (presented next week!) Hadoop, 19 PCs, each: 2 single-core processors, 4GB memory, 100GB disk COE Quarterly Technical Exchange, June 10th 2008

  16. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In ACE! • ~10K docs • each document is a vector • ~140K entities • each has multiple mentions • each entity context is a vector • Generated 8 feature matrices (6 English + 2 Arabic) English Pipeline Within-DocCoref. PairsFiltering FeatureGeneration Clustering Arabic Pipeline Within-DocCoref. FeatureGeneration Clustering COE Quarterly Technical Exchange, June 10th 2008

  17. Roadmap • Context Features • Pairwise similarity • Efficient vs. effectiveness • Generating features for ACE • Conversational-genre Features • New generative model • Joint Resolution • Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008

  18. Identity Resolution in Email Date: Wed Dec 20 08:57:00 EST 2000 From: Kay Mann <kay.mann@enron.com> To: Mary Adams <mary.adams@enron.com> Subject: Re: tennis tomorrow! Did Sue want Scott to join? Looks like the game will be too late for him. Sue Identity Resolution Who? i.e., label with email address COE Quarterly Technical Exchange, June 10th 2008

  19. New Generative Model • Choose “person” c to mention p(c) • Choose appropriate “context” X to mention c p(X | c) • Choose a “mention” l p(l | X, c) playingtennis “sue” COE Quarterly Technical Exchange, June 10th 2008

  20. Topical Context Social Context Conversational Context LocalContext Context COE Quarterly Technical Exchange, June 10th 2008

  21. (2) Mention Resolution Posterior Distribution (1) Identity Modeling Prior Distribution Evidence Single-Mention: 2-Step Solution COE Quarterly Technical Exchange, June 10th 2008

  22. Improved Results +8.9% +8.6% For more details, Check “Resolving Personal Names in Email using Context Expansion”at ACL 2008 (alsopresented next week!) COE Quarterly Technical Exchange, June 10th 2008

  23. Limitation! “sjhonson@enron.com” “Susan Scott” Context-Free Resolution social social “Sue” “Sue” topical “Suebob” conversational topical social topical “Susan Jones” “Susan” Joint Resolution! COE Quarterly Technical Exchange, June 10th 2008

  24. Joint Resolution MentionGraph SpreadCurrent Resolution CombineContext Info UpdateResolution COE Quarterly Technical Exchange, June 10th 2008

  25. Joint Resolution Work in Progress! MentionGraph map shuffle reduce MapReduce! COE Quarterly Technical Exchange, June 10th 2008

  26. Roadmap • Context Features • Pairwise similarity • Efficient vs. effectiveness • Generating features for ACE • Conversational-genre Features • New generative model • Joint Resolution • Evaluation using ACE-Usenet COE Quarterly Technical Exchange, June 10th 2008

  27. Email Message From: Machiavegli <machia@aol.com> To: Mark <mk@hotmail> Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. receiver is email address COE Quarterly Technical Exchange, June 10th 2008

  28. Usenet Message From: Machiavegli <machia@aol.com> Newsgroup: soc.history.what-if Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. newsgroup! COE Quarterly Technical Exchange, June 10th 2008

  29. ACE Usenet Document <DOCID> soc.history.what-if_20350205910 </DOCID> <POSTER> Machiavegli </POSTER> <POSTDATE> 29 Jan 2005 22:04:38 GMT </POSTDATE> <SUBJECT> The 1860 Presidential Election </SUBJECT> In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. no email addresses in headers! COE Quarterly Technical Exchange, June 10th 2008

  30. Reconstruct from automatically Got the address back! From: Machiavegli <machia@aol.com> Newsgroup: soc.history.what-if Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. COE Quarterly Technical Exchange, June 10th 2008

  31. Handling it as @ From: Machiavegli <machia@aol.com> To: soc.history.what-if@usenet.com Date: 29 Jan 2005 22:04:38 GMT Subject: The 1860 Presidential Election In 1860 there was a four-way race between the Republican Party with Abraham Lincold, the Democratic Party with Stephen Douglas, the Southern Democratic Party with John Breckenridge, and the Constitutional Union Party with John Bell. Lincoln won a plurality with about 40% of the vote. WI it was only a two-way race between Lincoln and Douglas? I believe Douglas would have won. This would have delayed secession and the Civil War. handle group as receiver COE Quarterly Technical Exchange, June 10th 2008

  32. Feature Value: same label • Need for feature matrix (pairwise score) sjhonson@hotmail.com sjhonson@hotmail.com “Steph” “Stephan” “Stephan” “S. Smith” +1.0 COE Quarterly Technical Exchange, June 10th 2008

  33. Feature Value: different labels • Need for feature matrix (pairwise score) sjhonson@hotmail.com smith_s@aol.com “Steph” “Stephan” “Stephan” “S. Smith” -1.0 COE Quarterly Technical Exchange, June 10th 2008

  34. Conclusion • MapReduce can be applied to many HLT applications • easy, cheap, and fast for distributed processing • e.g., scalable pairwise similarity for coreference resolution • calls for new ways of thinking • Identity resolution in email • new generative model yields improved accuracy • scalable joint resolution needed • Usenet-ACE is new test collection COE Quarterly Technical Exchange, June 10th 2008

  35. Thank You! COE Quarterly Technical Exchange, June 10th 2008

  36. MapReduce and Text Analysis • Computing pairwise similarity in large collections • Joint resolution of mentions in email collections • Search engines (of course!) • Building language models • Clustering applications • Machine translation • … COE Quarterly Technical Exchange, June 10th 2008

More Related