1 / 19

Reyyan Yeniterzi

Weakly-Supervised Discovery of Named Entities Using Web Search Queries. CIKM 2007. Marius Pasca Google. Reyyan Yeniterzi. Motivation. Name entities essential during the construction of knowledge bases from Web helpful in various NLP tasks; like parsing, coreference resolution …

etta
Download Presentation

Reyyan Yeniterzi

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Weakly-Supervised Discovery of Named Entities Using Web Search Queries CIKM 2007 Marius Pasca Google ReyyanYeniterzi

  2. Motivation • Name entities • essential during the construction of knowledge bases from Web • helpful in various NLP tasks; like parsing, coreference resolution … • constitute a significant part of the Web search queries • helpful in building verticals in Web search

  3. Previous works • Mining query logs • to improve various IR tasks • re-ranking of retrieved documents • query expansion • spelling correction • Large-scale IE • mainly on document collections • ignoring the collective knowledge embedded in noisy search queries • This is the first work that applies name entity finding to Web search query logs

  4. Extraction from Query Logs • Given • A set of target classes • A set of seed instances • The goal • To extract relevant class instances from query logs • Without using • Any domain knowledge • Any handcrafted extraction pattern

  5. Overview of the System

  6. Step 1: identification of query templates that match the seed instances

  7. Step 2: identification of candidate instances

  8. Step 3: internal representation of candidate instances • query: prefixcandidate_instancepostfix • entry: prefixpostfix • weight of an entry = frequency of the query

  9. Step 4: internal representation of seed instances • introducing weak supervision in the extraction process • the vectors associated with the seed instance are merged into a reference search-signature vector • a loose search fingerprint of the desired output type with respect to the class

  10. Step 5: instance ranking • ranking based on the similarity score (computed with Jensen-Shannon) between each candidate vector and class vector

  11. Reference search-signature vectors • A series of queries that can be asked about instances of a class • Given a set of candidate phrases, the system guess which candidate phrases are more likely to belong to the target class by looking at the queries

  12. Experimental setting - 1 • Target Classes • 10 classes with 5 seed instances for each class • City • Country • Drug • Food • Location • Movie • Newspaper • Person • University • VideoGame

  13. Experimental setting - 2 • Data • A random sample of 50 million unique fully-anonymized queries submitted to Google • Evaluation Procedure • Top 250 candidates of each class are manually assigned a correctness label • 1 : correct • 0 : incorrect • Precision at rank N has been calculated for several N values

  14. Quality of Extracted Instances

  15. Does the popularity of seed instances in query logs correlated with precision? • more accurate • scoring more queries with seed instances -0.17 -0.19 • better internal • representation • ?

  16. Comparing the usefulness of query logs vs. Web documents in NE finding • M. Pasca. Acquisition of categorized named entities for Web search. CIKM 2004 • Target classes are incrementally acquired from Web documents along with their respective instances by using hand crafted extraction patterns (D-patt) • Class [such as|including] Instance • Manual one-to-one mapping of chosen target classes with acquired classes

  17. Comparing the usefulness of query logs vs. Web documents in NE finding • Instances extracted from Web documents are also manually evaluated as correct and incorrect • Except City, Newspaper and Country classes, seed based extraction from queries outperformed D-patt in every other class

  18. Conclusion • Search queries, which are thought as noisy, keyword based approximations of underspecified user information needs, proved to be useful in name entity discoveries even with a small set of seed instances • with absolute precision (or precision improvement relative to web based hand crafted system) • 0.96 (29%) for prec@50 • 0.90 (26%) for prec@150 • 0.80 (15%) for prec@250

  19. Questions ?

More Related