140 likes | 245 Views
Named Entity Mining From Click-Through Data Using Weakly Supervised LDA. Gu Xu 1 , Shuang -Hong Yang 1,2 , Hang Li 1 1 Microsoft Research Asia, China 2 College of Computing, Georgia Tech, USA. Talk Outline. Named Entity Mining Exploiting click-through data
E N D
Named Entity Mining From Click-Through Data Using Weakly Supervised LDA Gu Xu1, Shuang-Hong Yang1,2, Hang Li1 1Microsoft Research Asia, China 2College of Computing, Georgia Tech, USA
Talk Outline • Named Entity Mining • Exploiting click-through data • Applying Latent Dirichlet Allocation • Developing a weakly supervised Learning approach • Weakly Supervised LDA • Experimental Results • Summary
Named Entity Mining • Named Entity Mining (NEM) • To mine the information of named entities of a class from a large amount of data. • Example: mine movie titles from a textual data collection • Applications: Web search, etc. • Three Challenges • Suitable data source for NEM • Ambiguity in classes of named entities • Supervision from human knowledge Click-through Data LDA (Topic Model) Weakly Supervised Learning
Click-through Data • Query context • [movie]trailer, [game]cheats • Click context • imdb.com for movies, gamespot.com for games • Wisdom-of-crowds • Very Large-scale data and keep on growing • Frequent update with emerging named entities • New data source for NEM • Over 70% queries contain named entities. • Rich context for determining the classes of entities. Click-Through Data
Latent Dirichlet Allocation • Deal with ambiguity in classes of named entities • Classes of named entities are ambiguous. • Harry Potter: Book, Movie and Game • Topic models (LDA) Harry Potter harry potter trailer imdb.com harry potter dvd movies.yahoo.com harry potter cheats cheats.ign.com harry potter game gamespots.com Classes of Named Entity as Topics Movie Game Click Context Click Context Query Context Query Context gamespots.com cheats.ign.com gamefaqs.com # cheats # walkthrough # game imdb.com movies.yahoo.com disney.go.com # trailer # dvd # movie
Weakly Supervised Learning • Supervise LDA training with examples • LDA is unsupervised model. • Topics in LDA are latent and not align with predefined semantic classes, like book, movie and game. • Human labels are inaccurate and partial. • Binary indicator rather than proportion • Labels only indicate that a named entity belongs to certain classes, but not exclude the possibility that it belongs to the other classes. • Weakly-supervised LDA • Supervise LDA training with partial labels
Weakly Supervised LDA • Overview ……………….. Harry Potter ……………….. ……………….. Seeds harry potter book http://www.amazon.com harry potter cheats http://cheats.ign.com harry potter trailer http://www.imdb.com …………………………………….. Click-through Data Create a virtual document for each seed and train WS-LDA # book, http://www.amazon.com # cheats, http://cheats.ign.com # trailer, http://www.imdb.com …………………………………….. Virtual Document Contexts Websites Newly Discovered Entities Find new named entities as well as their classes by using obtained query contexts and clicked websites
Weakly Supervised LDA (cont.) • LDA with two types of virtual words • w1: Query context • w2: Click context # book # cheats # trailer …………… Virtual Document http://www.amazon.com http://cheats.ign.com http://www.imdb.com ………………………………….
Weakly Supervised LDA (cont.) • Introduce Weak Supervision • LDA log likelihood + soft constraints • Soft Constraints Soft Constraints LDA Probability Document Probability on i-th Class Document Binary Label on i-th Class
Experimental Results • Dataset • Seed named entities • About 1,000 seeds for each class, and 3767 unique named entities in total • Click-through data • 1.5 billion query-URL pairs, containing 240 million unique queries and 17 million unique URLs
Experimental Results (cont.) • Top Contexts and websites Movie Contexts Game Contexts Book Contexts Music Contexts Movie Websites Game Websites Book Websites Music Websites
Experimental Results (cont.) • Accuracy of Mined Entities
Summary • Proposed to use click-through data as a new data source for NEM • Employed topic model to deal with ambiguity in classes of named entities • Devised weakly supervised LDA for modeling click-through data • Two types of virtual words • Introduce weakly supervised learning into LDA • Experiments on large-scale data verified effectiveness of proposed approach