90 likes | 196 Views
Introduction to Oluolu, a Query Log Mining Tool on Hadoop. Takahiko Ito. Preliminaries: DidYouMean. Recent search engines such as Google, Yahoo support the ‘DidYouMean’ features. Search engines with DidYouMean work as follows Users submit a query to a search engine
E N D
Introduction to Oluolu,a Query Log Mining Tool on Hadoop Takahiko Ito
Preliminaries: DidYouMean • Recent search engines such as Google, Yahoo support the ‘DidYouMean’ features. • Search engines with DidYouMean work as follows • Users submit a query to a search engine • The search engine tells the correct query for the user when the submit query have spelling mistakes
Implementation of DidYouMean • In many cases, the DidYouMean feature is implemented with fussy matching such as edit-edistance [Levenshtein, 1966]. • Unfortunately fussy matching algorithms do NOT work for Japanese queries. • Since, the queries with spelling mistakes can completely different from the spelling of the query word the user means
Mistakes of Japanese Queries Mistakes of queries can be grouped as follows • Simple spelling mistake ひらたかパーク (correct: ひらかたパーク) • Kana-Kanji conversion mistake 墨ともふどうさん (correct: 住友不動産) 歌だ光る (correct: 宇多田ヒカル) 米事案セット (correct: ベイジアンセット) • Mixture of 1 and 2 cases
Oluolu approach Oluolu creates a dictionary from query log data • NOT uses the spelling similarity between queries. • Can extract the pairs of queries if their spellings are quite different • Extracts pairs of queries (query with spelling mistake, query with correct spell) from the user sessions.
Query log data Query log data have three components • User Id (or IP address) • Time submit the query • Query string A session is a set of queries which were submit by the same user with small time span.
Extract query pairs • Oluolu extracts pairs of queries which are in the same session and validate them from the frequency rate. • E.g. pair (Pthon and Python) is extracted.
Scalability: Oluolu The amount of query log is HUGE! • Oluolu work on the Hadoop distributed environment!
References • V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl., 6:707-710, 1966.