指導老師：陳彥良教授　許秉瑜教授報告人：楊詠喬　龍晶珠

Web People Search via Connection AnalysisDmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, Member, IEEE, and Rabia Nuray-TuranIEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,VOL.20, NO.11, NOVEMBER 2008 指導老師：陳彥良教授　許秉瑜教授報告人：楊詠喬　龍晶珠

Introduction (1/2) • 現今的網路搜尋中，人物搜尋活動佔了5％以上。 • Google 或Yahoo等搜尋引擎，依人名為關鍵字做搜尋，會回傳一連串名字相同的人的網頁資料。 • 下一代的搜尋引擎在尋人時，將利用群集（clustering）的方法，使尋人更為簡易。

Introduction (2/2) 本論文 1.一個新的、具有高品質分群結果的網路人物搜尋法 2.本研究方法的完整實証評估 3.本研究方法所帶來的影響

Outlines • Overview of the approach • Generating a graph representation • Disambiguation algorithm • Interpreting clustering results • Related works • Experimental results • Conclusions and future work

Overview of the approach(I/3) • User input • Web page retrieval retrieves a fixed number (top K) of relevant pages • Preprocessing： -- compute TF/IDF -- extraction of Named entities (NEs) and Web-related information

Overview of the approach(2/3) • Graph creation the entity-relationship (ER) graph • Clustering • Cluster processing (1)sketch (2)cluster ranking (3)web page ranking

Overview of the approach(3/3)

Generating a graph representation(1/2)

Generating a graph representation(2/2)

Disambiguation algorithm(1/5) • CC（Correlation Clustering） focus on developing and learning a new accurate s(u,v) • Connection Strength（c(u,v)） c(u,v) can help designing a better similarity function s(u,v)

Disambiguation algorithm(2/5) • Similarity Function（s(u,v)） s(u,v) lebals data with the threshold τ and the δ-band approach

Disambiguation algorithm(3/5) • TF/IDF -用來計算 feature-based similarity f(u,v) • Between two documents u,v

Disambiguation algorithm(4/5) • For each (u,v) edge，we should require that Adding slack

Disambiguation algorithm(5/5) • Choosing negative weight The value of w-（）is chosen to be zero when is less than a certain threshold, and it is chosen to be 1 when it is above this threshold. The value for this threshold itself is learned from the data.

Interpreting clustering results • Cluster rank • Cluster sketch • Web page rank The remainder pages are displayed in the order of the affinity to the selected cluster.

Related work(1/2) • Disambguation

Related work(2/2) Web people serch 1.server-side setting 2.middleware approach （ˇ）

Experimental results • 1. Experimental setup • 2. testing disambiguation quality • 3. impact on search • 4. efficiency

Experimental setup • Data sets（ Leave-one-out cross validation） : • 1. www 2005 data set • 2. WEPS data set • 3. Context data set • Quality evaluation measures • B-cubed , Fp • Baseline methods • Agglomerative Vector Space Clustering • Statistical significance test • t-test

Testing disambiguation quality—Experiment 1 （disambiguation quality : overall）

Testing disambiguation quality— Experiment 2 （disambiguation quality :group identification）

Testing disambiguation quality— Experiment 3 （disambiguation quality :queries with context）

Testing disambiguation quality— Experiment 4 （ quality of generating cluster sketches）

Impact on search—measures（experiment 5 ） • First-dominant cluster Regular cluster

Impact on search—measures（experiment 5 ） • average

Impact on search—with context

Efficiency experiment 6 1.由於透過第三者 (NE extractor, GATE) 摘錄NEs，一開始的下載及前處理，每個網頁需要用3.82秒。 2.假如用 server-side approach, 前處理過程就可以離線事先做好。 3.集群演算法本身執行時，平均每個名字花4.7秒。

Future work • Employ external data sources for disambiguation as well • Use more advances extraction capabilities • a better interpretation of extracted entities by taking into account the roles they play with respect to each other • Develop disambiguation algorithms for other people search problems that have different settings • A algorithms for a generic entity search

Thank you for listening

指導老師：陳彥良教授　許秉瑜教授報告人：楊詠喬　龍晶珠