290 likes | 771 Views
Web People Search via Connection Analysis Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, Member, IEEE, and Rabia Nuray-Turan IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL.20, NO.11, NOVEMBER 2008. 指導老師:陳彥良教授 許秉瑜教授 報告人 :楊詠喬 龍晶珠. Introduction (1/2).
E N D
Web People Search via Connection AnalysisDmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, Member, IEEE, and Rabia Nuray-TuranIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.20, NO.11, NOVEMBER 2008 指導老師:陳彥良教授 許秉瑜教授 報告人 :楊詠喬 龍晶珠
Introduction (1/2) • 現今的網路搜尋中,人物搜尋活動佔了5%以上。 • Google 或Yahoo等搜尋引擎,依人名為關鍵字做搜尋,會回傳一連串名字相同的人的網頁資料。 • 下一代的搜尋引擎在尋人時,將利用群集(clustering)的方法,使尋人更為簡易 。
Introduction (2/2) 本論文 1.一個新的、具有高品質分群結果的 網路人物搜尋法 2.本研究方法的完整實証評估 3.本研究方法所帶來的影響
Outlines • Overview of the approach • Generating a graph representation • Disambiguation algorithm • Interpreting clustering results • Related works • Experimental results • Conclusions and future work
Overview of the approach(I/3) • User input • Web page retrieval retrieves a fixed number (top K) of relevant pages • Preprocessing: -- compute TF/IDF -- extraction of Named entities (NEs) and Web-related information
Overview of the approach(2/3) • Graph creation the entity-relationship (ER) graph • Clustering • Cluster processing (1)sketch (2)cluster ranking (3)web page ranking
Disambiguation algorithm(1/5) • CC(Correlation Clustering) focus on developing and learning a new accurate s(u,v) • Connection Strength(c(u,v)) c(u,v) can help designing a better similarity function s(u,v)
Disambiguation algorithm(2/5) • Similarity Function(s(u,v)) s(u,v) lebals data with the threshold τ and the δ-band approach
Disambiguation algorithm(3/5) • TF/IDF -用來計算 feature-based similarity f(u,v) • Between two documents u,v
Disambiguation algorithm(4/5) • For each (u,v) edge,we should require that Adding slack
Disambiguation algorithm(5/5) • Choosing negative weight The value of w-( )is chosen to be zero when is less than a certain threshold, and it is chosen to be 1 when it is above this threshold. The value for this threshold itself is learned from the data.
Interpreting clustering results • Cluster rank • Cluster sketch • Web page rank The remainder pages are displayed in the order of the affinity to the selected cluster.
Related work(1/2) • Disambguation
Related work(2/2) Web people serch 1.server-side setting 2.middleware approach (ˇ)
Experimental results • 1. Experimental setup • 2. testing disambiguation quality • 3. impact on search • 4. efficiency
Experimental setup • Data sets( Leave-one-out cross validation) : • 1. www 2005 data set • 2. WEPS data set • 3. Context data set • Quality evaluation measures • B-cubed , Fp • Baseline methods • Agglomerative Vector Space Clustering • Statistical significance test • t-test
Testing disambiguation quality—Experiment 1 (disambiguation quality : overall)
Testing disambiguation quality— Experiment 2 (disambiguation quality :group identification)
Testing disambiguation quality— Experiment 3 (disambiguation quality :queries with context)
Testing disambiguation quality— Experiment 4 ( quality of generating cluster sketches)
Impact on search—measures(experiment 5 ) • First-dominant cluster Regular cluster
Impact on search—measures(experiment 5 ) • average
Efficiency experiment 6 1.由於透過第三者 (NE extractor, GATE) 摘錄NEs, 一開始的下載及前處理,每個網頁需要用3.82秒。 2.假如用 server-side approach, 前處理過程就可以離線事先做好。 3.集群演算法本身執行時,平均每個名字花4.7秒。
Future work • Employ external data sources for disambiguation as well • Use more advances extraction capabilities • a better interpretation of extracted entities by taking into account the roles they play with respect to each other • Develop disambiguation algorithms for other people search problems that have different settings • A algorithms for a generic entity search