160 likes | 355 Views
Mamoru Komachi (†) and Hisami Suzuki (‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA. Minimally Supervised Learning of Semantic Knowledge from Query Logs. IJCNLP-08, Hyderabad, India. Task.
E N D
Mamoru Komachi(†) and Hisami Suzuki(‡) (†) Nara Institute of Science and Technology, Japan (‡) Microsoft Research, USA Minimally Supervised Learning of Semantic Knowledge from Query Logs IJCNLP-08, Hyderabad, India
Task • Learn semantic categories from web search query logs by bootstrapping with minimal supervision • Semantic category: a set of words which are interrelated • Named entities, technical terms, paraphrases, … • Can be useful forsearch ads, etc… similar similar Darjeeling Kombucha (Japanese tea) Chai (Indian tea) 2 2014/4/1
Our Contribution • First to use the Japanese query logs for the task of learning of named entities • Propose an efficient method suited for query logs, based on the general-purpose Espresso (Pantel and Pennacchiotti 2006) algorithm
Table of Contents • Related work • Bootstrapping techniques for relation extraction • Scoring metrics • The Tchai algorithm • Problems of Espresso • Extension to Espresso • Experiment • System performance and comparison to other algorithms • Samples of extracted instances and patterns
Bootstrapping • Iteratively conduct pattern induction and instance extraction starting from seed instances • Can fertilize small set of seed instances Query log (Corpus) Instances Contextual patterns vaio Compare vaio laptop Compare # laptop Toshiba satellite Compare toshiba satellite laptop #:slot HP xb3000 Compare HP xb3000 laptop
Instance lookup and pattern induction • Semantic drift • Computational efficency ANA ANA 予約 # 予約 query log extracted pattern instance Restaurant reservation? Flight reservation? Broad coverage, Noisy patterns Use all strings but instances =Require no segmentation Generic patterns
Instance/Pattern Scoring Metrics • Sekine & Suzuki (2007) • Starts from a large named entity dictionary • Assign low scores to generic patterns and ignore • Basilisk (Thelen and Riloff, 2002) • Balance the recall and precision of generic patterns • Espresso (Pantel and Pennacchiotti, 2006) PMI is normalized by the maximum of all P and I P: patterns in corpus I: instances in corpus PMI: pointwise mutual information r: reliability score Reliability of an instance and a pattern is mutually defined
The Tchai Algorithm • Filter generic patterns/instances • Not to select generic patterns and instances • Replace scaling factor in reliability scores • Take the maximum PMI for a given instance/pattern rather than the maximum for all instances and patterns • This modification shows a large impact on the effectiveness of our algorithm • Only induce patterns at the beginning • Tchai runs 400X faster than Espresso
Experiments • Japanese query logs from 2007/01-02 • Unique one million (166 millions in token) • Target categories • Manually classified 10,000 most frequent search words (in the log of 2006/12) -- hereafter referred to as 10K list • Travel: the largest category (712 words) • Finance: the smallest category (240 words)
Results High precision (92.1%) Travel Finance Learned 251 novel words Due to the ambiguity of hand labeling (e.g. Tokyo Disney Land) Include common nouns related to Travel (e.g. Rental car)
Sample of Instances (Travel category) Able to learn several sub-categories in which no seed words given
System Performance Travel Finance High precision and recall High precision but low relative recall due to strict filtering Relative Recall (Pantel et al., 2004)
Cumulative precision: Travel Tchai achieved the best precision
Sample Extracted Patterns Basilisk and Espresso extracted location names as context patterns, which may be too generic for Travel domain Tchai found context patterns that are characteristic to the domain
Conclusion and future work • Conclusion • Use of query logs for semantic category learning • Improved Espresso algorithm in both precision and performance • Future work • Generalize bootstrapping method by graph-based matrix calculation
Tchai Thank you for listening!