260 likes | 356 Views
Web Search Clustering and Labeling with Hidden Topics. Presenter : Chien-Hsing Chen Author: Cam- Tu Nguyen Xuan-Hieu Phan Susumu Horiguchi Thu- Trang Nguyen Quang-Thuy Ha. 2009.TALIP.40 . Outline. Motivation Objective Method Experiments Conclusion
E N D
Web Search Clustering and Labeling withHidden Topics Presenter:Chien-Hsing Chen Author: Cam-Tu Nguyen Xuan-HieuPhan Susumu Horiguchi Thu-Trang Nguyen Quang-Thuy Ha 2009.TALIP.40.
Outline • Motivation • Objective • Method • Experiments • Conclusion • Comment
Motivation • d1: • ezPeer+ 音樂下載、音樂試聽、歌詞、MP3、音樂網- 蔡依林- 歷年專輯 • ezPeer+ – 蔡依林- J1 Live Concert演唱會影音全紀錄,J-game,看我72變,城堡,J9 Party 派對精選,JolinJ- • Top 冠軍精選,舞孃,蔡依林唯舞獨尊演唱會鮮聽版& 混音專輯&花...web.ezpeer.com/singer/s120.html - 頁庫存檔 - 類似內容 • d2: • ezPeer+ 音樂下載、音樂試 • 花蝴蝶好聽… • web.ezpeer.com/singer/s120.html - 頁庫存檔 - 類似內容 • The snippets are usually noisier, less topic-focused, and much shorter • 花?? • similarity evaluation between snippets may not be successful d3: {He is an author} d4: {The writer is standing behind you}
Objective • Similarity evaluator is referred to a set of hidden topics • di: {He is an author} • dj: {The writer is standing behind you} • (a document may be related to multi-topics)
music movie Framework music movie radio player dj di di > topic10 dj > topic10 (label candidate generation)
cul. hel. politics edu. LDA entertainment In training step: the keyword is related to a topic when it often occurs in the documents topic show business zm,n refer to topic k k topic m document n word z1 z2 z3 wm,n refer to vocabulary w1 w2 w3 k = 10 (show business) K=60 the word “music” in the topic 10 can explain the occurrence of the words in the documents m=1,2,3
LDA k topic m document n word zm,n z1 wm,n k = topic 10 K=60 w1
LDA dm k topic m document n word p(.|.)=? zm,n z1 wm,n k = topic 10 K=60 w1
LDA p(.|.)=1/60 dm k topic m document n word p(.|.)=? zm,n z1 wm,n k = topic 10 K=60 w1
Similarity between di and dj • the tth term in the vocabulary V • the kth topic
Framework similarity matrix between snippets
Label Candidate Generation D Topic k
Label assignment for clustering snippets D Topic k di dj
music movie Framework music movie radio layer dj di di > topic10 dj > topic4, topic10 (label candidate generation)
Experiment Wikipedia dataset Vnexpress dataset
Experimental dataset Web dataset consists of 2,357 snippets in 9 categories 20 queries to Google and obtaining about 150 distinguished snippets
Experiments • F-measure
Conclusion • clustering snippets with hidden topics • labeling clusters using hidden topic analysis
My Comment • Advantage • labeling clusters with the help of hidden topics • the size of snippets is small • Two datasets: 2,357 and 150 • (in our work: more than 2 million snippets) • Disadvantage • less depends on snippets • Application • snippets are useful to make sense