1 / 27

SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia. Chin-Yew LIN cyl@microsoft.com. Web 2.0. Web as a platform Connect people and services anywhere, anytime, on any device Harnessing collective intelligence Aggregated grassroots contribution

yukio
Download Presentation

SQuaD the starting point of web intelligence Natural Language Computing, Microsoft Research Asia

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SQuaDthe starting point of web intelligenceNatural Language Computing, Microsoft Research Asia Chin-Yew LIN cyl@microsoft.com

  2. Web 2.0 Web as a platform • Connect people and services anywhere, anytime, on any device Harnessing collective intelligence • Aggregated grassroots contribution Data is the next “Intel Inside” • Data-centric computing How do we turn DATA into VALUE? Tim O’Reilly’s “What is Web 2.0” http://www.oreillynet.com/pub/a/oreilly/tim/news/2005/09/30/what-is-web-20.html?page=1

  3. BaiduZhidao (百度知道) 17,012,767 resolved questions in two years’ operation. 8,921,610 are knowledge related. 96.7% of questions are resolved. 10,000,000 daily visitors. 71,308 new questions per day. 3.14 answers per question. http://www.searchlab.com.cn (中国人搜索行为研究/User Research Lab of Chinese Search)

  4. Stickiness of BaiduZhidao 据正望咨询调查,“百度知道”跟搜索的关系非常紧密,而且对搜索黏性的提高有很大帮助,根据其统计,“百度知道”已成为百度的一个核心产品。“百度的用户中有50%搜索‘知道’,其用户量已经超过百度贴吧,与其MP3搜索可相提并论。” 。 50% of Baidu users search BaiduZhidao. Zhidao search traffic comparable to MP3 search. (http://news.csdn.net/n/20080425/115453.html; 04/25/2008)

  5. A Traditional QA Architecture • A QA system gives direct answers to a • question instead of documents • Falcon QA system (LCC) • Moldovan et al. ACL 2000 • Surdeanu et al. IEEE Trans. PDS 2002 • Best QA system in TREC 8 & 9 • Average question answering time • TREC 8: 48 seconds • TREC 9: 94 seconds Not Scalable Traditional IR Falcon QA system module analysis: processing time

  6. Scalable Question Answering & Distillation Goal: • Create a scalable question and answering service Methods: • Index all question and answer pairs on the web • Enrich QnA through summarization

  7. Challenges ACL 2008 SIGIR 2008 AAAI 2008 COLING 2008 WWW 2008 ACL 2008

  8. List of Papers Accepted Recommending Questions Using the MDL-based Tree Cut Model – Cao et al.; WWW 2008 Searching Questions by Identifying Question Topic and Question Focus – Duan et al.; ACL 2008 Using Conditional Random Fields to Extract Contexts and Answers of Questions from Online Forums – Ding el al.; ACL 2008 Finding Question Answer Pairs from Online Forums – Cong et al.; SIGIR 2008 Question Utility: A Novel Static Ranking of Question Search – Song et al.; AAAI 2008 Answer Summarization: Understanding and Summarizing Answers in Community-Based Question Answering Services – Liu et al; COLING 2008

  9. QA Pairs in OnlineForums CONTEXT Questions Answers

  10. Question Mining & Answering(ACL 2008 & SIGIR 2008) ACL 2008 & SIGIR 2008 Extract question and answer pairs • Community QnA • Create a resolved question list • Extract & index question, best answer, and other answers • Yahoo! Answers, BaiduZhidao, … • Forum • Extract and index threads and postings, find questions and their answers • 6 travel forums

  11. Question Utility(AAAI 2008) AAAI 2008 Motivation • How useful is a question? • How should we rank questions without queries? Definition • How likely a question would be asked again? The probability generating query Q’ from question Q (Relevance score) The prior probability of question Q reflecting a static rank of the question i.e. Question Utility

  12. Answer Summarization(COLING 2008) COLING 2008 Example: “Where to stay in Paris?” • 1,822 answers (Yahoo! Answers 06/23/08) • Is the “best answer” the best answer? Question clustering • Find similar questions Answer summarization • Aggregate answers for a question cluster Answer Taxonomy Question Taxonomy

  13. Question Search & Recommendation Yunbo CAO & Chin-Yew LIN WWW 2008 & ACL 2008

  14. Question Search & Recommendation(ACL 2008 & WWW 2008) WWW 2008 & ACL 2008 Query • We would like to know what will be available to see in the Forbidden Citybecause we understand that it will be under repairs. Question search • Is it true that the Forbidden City is undergoing renovation & we won't be allow to enter? Question recommendation • Would you get a lowerprice by not needing a guide for the Forbidden City and etc? • Can anybody recommend a budget hotel near Forbidden City? Question = Topic + Focus + Others (TFO) • Search: sametopicsimilarfocuses • Recommend: sametopicdifferentfocuses How can we discriminate topic from focus?

  15. Identifying Topic and Focus Specificity: the inverse of the entropy of the topic term ‘s distribution over the sub-categories. Order topic terms by their specificity China Anyone know where to see the Dragon Boat Festival in Beijing? Where is a good(Less expensive) placeto shop in Beijing? What's the cheapestwayto get from BeijingtoHongKong? Europe Howfar is it fromBerlintoHamburg? What is the cheapestwayfromBerlintoHamburg? Whereto seebetweenHamburgandBerlin? HowlongdoesittakefromHamburgtoBerlin? Travel @Yahoo! Answers Travel @Yahoo! Answers Asia Pacific Asia Pacific China China Japan Japan … … Europe Europe … …

  16. Order Topic Terms by Specificity • Query: Any cool clubs in Hamburg or Berlin? • Topic Chain: Hamburg  Berlin  cool clubs • Topic Terms: cool clubs, Hamburg, Berlin coolclubs Question Topic Question Focus Hamburg Berlin where to see how far • Topic Terms: where to see, Hamburg, Berlin • Topic Chain: Hamburg  Berlin  where to see • Hamburg  Berlin  how far • Related questions: Where to see in Hamburg or Berlin? • How far is it from Berlin to Hamburg?

  17. ROOT Berlin Hamburg Berlin cheap hotel (1) fun club (1) how long does it take (1) cool club (1) nice hotel (1) Determine the Cut on a Question Tree The Use of MDL (Minimum Description Length) Based Tree Cut Model (Li & Abe 1998) … … …

  18. The MDL-based Tree Cut Model (Li & Abe, CL1998)

  19. Scoring the Candidates Given a queried question and a candidate • The search relevance score is • The recommendation score is Question Focus Question Focus Question Topic Question Topic

  20. cool club Berlin where to see how far Hamburg good hostel fun club Flow of Question Search/Recommendation Index • Related Questions: • 1. Where to see between Hamburg and Berlin? • 2. How far is it from Berlin to Hamburg? • 3. Any good hostels in Hamburg or Berlin? • 4. What are the most/best fun club in Hamburg? STEP 1: Retrieve Related Questions Query: any cool clubs in Berlin or Hamburg? STEP 2: Discriminate Question Topic from Question Focus • Search: • 1. What are the most/best fun club in Hamburg? • Recommendation: • 1. Where to see between Hamburg and Berlin? • 2. How far is it from Berlin to Hamburg? • 3. Any good hostels in Hamburg or Berlin? STEP 3: Rank Questions on the basis of the cut

  21. Experimental Results (Search) Data (Yahoo! Answers) • Query: 200 questions about ‘travel’; 200 questions about ‘computers & internet’ • Relevance: human judgment Baselines • VSM (Vector Space Model), LMIR (Language Model for Information Retrieval) Results • Travel • Computers & Internet

  22. Experimental Results (Recommendation) Data (Yahoo! Answers) • Query: 100 questions about ‘travel’; 100 questions about ‘computers & internet’ • Relevance: human judgment Baselines • VSM (Vector Space Model), PVSM (Phrase-based Vector Space Model) Results • Travel • Computers & Internet

  23. Error Analysis (Search) Stat. on question topic/focus identification errors The reason – data sparseness (more than 0.04 MAP drop) • No question focus (data sparseness over question topics) • Does anyone know anything about West Suburban Dialysis in Chicago? • West Suburban Dialysis  Chicago  anything To search question descriptions and answers as well as question titles • Inaccurate specificity (data sparseness over question foci) • Any nightlife activities near Generator Hostel, Berlin? • Incorrect: GeneratorHostel nightlife activity  Berlin • Correct: GeneratorHostel Berlin  nightlife activity To cluster topic terms (e.g., nightlife activity vs. night life activity)

  24. Knowledge Distillation & Dissemination

  25. Q&A = Knowledge = Power Q&A is complement to web keyword search Q&A can enhance existing QnA and search services Leverage existing knowledge in the question and answer forms Question and Answer = Knowledge Knowledge = Power

  26. Discussion

More Related