1 / 9

Introduction to Oluolu, a Query Log Mining Tool on Hadoop

Introduction to Oluolu, a Query Log Mining Tool on Hadoop. Takahiko Ito. Preliminaries: DidYouMean. Recent search engines such as Google, Yahoo support the ‘DidYouMean’ features. Search engines with DidYouMean work as follows Users submit a query to a search engine

vernon
Download Presentation

Introduction to Oluolu, a Query Log Mining Tool on Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Oluolu,a Query Log Mining Tool on Hadoop Takahiko Ito

  2. Preliminaries: DidYouMean • Recent search engines such as Google, Yahoo support the ‘DidYouMean’ features. • Search engines with DidYouMean work as follows • Users submit a query to a search engine • The search engine tells the correct query for the user when the submit query have spelling mistakes

  3. Implementation of DidYouMean • In many cases, the DidYouMean feature is implemented with fussy matching such as edit-edistance [Levenshtein, 1966]. • Unfortunately fussy matching algorithms do NOT work for Japanese queries. • Since, the queries with spelling mistakes can completely different from the spelling of the query word the user means

  4. Mistakes of Japanese Queries Mistakes of queries can be grouped as follows • Simple spelling mistake ひらたかパーク (correct: ひらかたパーク) • Kana-Kanji conversion mistake 墨ともふどうさん (correct: 住友不動産) 歌だ光る (correct: 宇多田ヒカル) 米事案セット (correct: ベイジアンセット) • Mixture of 1 and 2 cases

  5. Oluolu approach Oluolu creates a dictionary from query log data • NOT uses the spelling similarity between queries. • Can extract the pairs of queries if their spellings are quite different • Extracts pairs of queries (query with spelling mistake, query with correct spell) from the user sessions.

  6. Query log data Query log data have three components • User Id (or IP address) • Time submit the query • Query string A session is a set of queries which were submit by the same user with small time span.

  7. Extract query pairs • Oluolu extracts pairs of queries which are in the same session and validate them from the frequency rate. • E.g. pair (Pthon and Python) is extracted.

  8. Scalability: Oluolu The amount of query log is HUGE! • Oluolu work on the Hadoop distributed environment!

  9. References • V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals. Sov. Phys. Dokl., 6:707-710, 1966.

More Related