Minimally Supervised Learning of Semantic Knowledge from Query Logs

Mamoru Komachi(†) and Hisami Suzuki(‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA Minimally Supervised Learning of Semantic Knowledge from Query Logs IJCNLP-08, Hyderabad, India

Task • Learn semantic categories from web search query logs by bootstrapping with minimal supervision • Semantic category: a set of words which are interrelated • Named entities, technical terms, paraphrases, … • Can be useful forsearch ads, etc… similar similar Darjeeling Kombucha (Japanese tea) Chai (Indian tea) 2 2014/4/1

Our Contribution • First to use the Japanese query logs for the task of learning of named entities • Propose an efficient method suited for query logs, based on the general-purpose Espresso (Pantel and Pennacchiotti 2006) algorithm

Table of Contents • Related work • Bootstrapping techniques for relation extraction • Scoring metrics • The Tchai algorithm • Problems of Espresso • Extension to Espresso • Experiment • System performance and comparison to other algorithms • Samples of extracted instances and patterns

Bootstrapping • Iteratively conduct pattern induction and instance extraction starting from seed instances • Can fertilize small set of seed instances Query log (Corpus) Instances Contextual patterns vaio Compare vaio laptop Compare # laptop Toshiba satellite Compare toshiba satellite laptop #:slot HP xb3000 Compare HP xb3000 laptop

Instance lookup and pattern induction • Semantic drift • Computational efficency ANA ANA 予約 # 予約 query log extracted pattern instance Restaurant reservation? Flight reservation? Broad coverage, Noisy patterns Use all strings but instances =Require no segmentation Generic patterns

Instance/Pattern Scoring Metrics • Sekine & Suzuki (2007) • Starts from a large named entity dictionary • Assign low scores to generic patterns and ignore • Basilisk (Thelen and Riloff, 2002) • Balance the recall and precision of generic patterns • Espresso (Pantel and Pennacchiotti, 2006) PMI is normalized by the maximum of all P and I P: patterns in corpus I: instances in corpus PMI: pointwise mutual information r: reliability score Reliability of an instance and a pattern is mutually defined

The Tchai Algorithm • Filter generic patterns/instances • Not to select generic patterns and instances • Replace scaling factor in reliability scores • Take the maximum PMI for a given instance/pattern rather than the maximum for all instances and patterns • This modification shows a large impact on the effectiveness of our algorithm • Only induce patterns at the beginning • Tchai runs 400X faster than Espresso

Experiments • Japanese query logs from 2007/01-02 • Unique one million (166 millions in token) • Target categories • Manually classified 10,000 most frequent search words (in the log of 2006/12) -- hereafter referred to as 10K list • Travel: the largest category (712 words) • Finance: the smallest category (240 words)

Results High precision (92.1%) Travel Finance Learned 251 novel words Due to the ambiguity of hand labeling (e.g. Tokyo Disney Land) Include common nouns related to Travel (e.g. Rental car)

Sample of Instances (Travel category) Able to learn several sub-categories in which no seed words given

System Performance Travel Finance High precision and recall High precision but low relative recall due to strict filtering Relative Recall (Pantel et al., 2004)

Cumulative precision: Travel Tchai achieved the best precision

Sample Extracted Patterns Basilisk and Espresso extracted location names as context patterns, which may be too generic for Travel domain Tchai found context patterns that are characteristic to the domain

Conclusion and future work • Conclusion • Use of query logs for semantic category learning • Improved Espresso algorithm in both precision and performance • Future work • Generalize bootstrapping method by graph-based matrix calculation

Tchai Thank you for listening!

Minimally Supervised Learning of Semantic Knowledge from Query Logs

Minimally Supervised Learning of Semantic Knowledge from Query Logs

Presentation Transcript

Mining Query Logs

Semantic Query Languages

Supervised Learning

Supervised learning

Supervised Learning

Overview of Supervised Learning

Emergence of Semantic Knowledge from Experience

Semantic Query Optimization

Supervised Learning

Publishing Search Query logs

Learning Logs

Overview of Supervised Learning

Supervised Learning

Cross-Lingual Query Suggestion Using Query Logs of Different Languages

Learning Knowledge Rich User Models from the Semantic Web

Learning Logs

Learning Logs

Minimally Supervised Event Causality Identification

Mining Query Logs

Relevance feedback using query-logs

Supervised Learning

Supervised Learning