Reyyan Yeniterzi

Weakly-Supervised Discovery of Named Entities Using Web Search Queries CIKM 2007 Marius Pasca Google ReyyanYeniterzi

Motivation • Name entities • essential during the construction of knowledge bases from Web • helpful in various NLP tasks; like parsing, coreference resolution … • constitute a significant part of the Web search queries • helpful in building verticals in Web search

Previous works • Mining query logs • to improve various IR tasks • re-ranking of retrieved documents • query expansion • spelling correction • Large-scale IE • mainly on document collections • ignoring the collective knowledge embedded in noisy search queries • This is the first work that applies name entity finding to Web search query logs

Extraction from Query Logs • Given • A set of target classes • A set of seed instances • The goal • To extract relevant class instances from query logs • Without using • Any domain knowledge • Any handcrafted extraction pattern

Overview of the System

Step 1: identification of query templates that match the seed instances

Step 2: identification of candidate instances

Step 3: internal representation of candidate instances • query: prefixcandidate_instancepostfix • entry: prefixpostfix • weight of an entry = frequency of the query

Step 4: internal representation of seed instances • introducing weak supervision in the extraction process • the vectors associated with the seed instance are merged into a reference search-signature vector • a loose search fingerprint of the desired output type with respect to the class

Step 5: instance ranking • ranking based on the similarity score (computed with Jensen-Shannon) between each candidate vector and class vector

Reference search-signature vectors • A series of queries that can be asked about instances of a class • Given a set of candidate phrases, the system guess which candidate phrases are more likely to belong to the target class by looking at the queries

Experimental setting - 1 • Target Classes • 10 classes with 5 seed instances for each class • City • Country • Drug • Food • Location • Movie • Newspaper • Person • University • VideoGame

Experimental setting - 2 • Data • A random sample of 50 million unique fully-anonymized queries submitted to Google • Evaluation Procedure • Top 250 candidates of each class are manually assigned a correctness label • 1 : correct • 0 : incorrect • Precision at rank N has been calculated for several N values

Quality of Extracted Instances

Does the popularity of seed instances in query logs correlated with precision? • more accurate • scoring more queries with seed instances -0.17 -0.19 • better internal • representation • ?

Comparing the usefulness of query logs vs. Web documents in NE finding • M. Pasca. Acquisition of categorized named entities for Web search. CIKM 2004 • Target classes are incrementally acquired from Web documents along with their respective instances by using hand crafted extraction patterns (D-patt) • Class [such as|including] Instance • Manual one-to-one mapping of chosen target classes with acquired classes

Comparing the usefulness of query logs vs. Web documents in NE finding • Instances extracted from Web documents are also manually evaluated as correct and incorrect • Except City, Newspaper and Country classes, seed based extraction from queries outperformed D-patt in every other class

Conclusion • Search queries, which are thought as noisy, keyword based approximations of underspecified user information needs, proved to be useful in name entity discoveries even with a small set of seed instances • with absolute precision (or precision improvement relative to web based hand crafted system) • 0.96 (29%) for prec@50 • 0.90 (26%) for prec@150 • 0.80 (15%) for prec@250

Questions ?

Reyyan Yeniterzi

Reyyan Yeniterzi

Presentation Transcript

SOURCES OF GROWTH