120 likes | 224 Views
Disambiguation Algorithm for People Search on the Web. Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions visit: http://www.ics.uci.edu/~dvk Computer Science Department University of California, Irvine. Entity (People) Search. Person2.
E N D
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions visit: http://www.ics.uci.edu/~dvk Computer Science Department University of California, Irvine
Entity (People) Search Person2 Person1 Top-K Webpages Person3 Unknown beforehand
Overall Algorithm Overview • User Input.A user submits a query to the middleware via a web-based interface. • Web page Retrieval.The middleware queries a search engine’s API, gets top-K Web pages. • Preprocessing.The retrieved Web pages are preprocessed: • TF/IDF.Preprocessing steps for computing TF/IDF are carried out. • Ontology.Ontologies are used to enrich the Webpage content. • Extraction.Named entities, and web related information is extracted from the Webpages. • Graph Creation.The Entity-Relationship Graph is generated • Enhanced TF/IDF.Ontology-enhanced TF/IDF values are computed • Clustering.Correlation clustering is applied • Cluster Processing.Each resulting cluster is then processed as follows: • Sketches.A set of keywords that represent the web pages within a cluster is computed for each cluster. The goal is that the user should be able to find the person of interest by looking at the sketch. • Cluster Ranking.All cluster are ranked by a choosing criteria to be presented in a certain order to the user • Web page Ranking.Once the user hones in on a particular cluster, the Web pages in this cluster are presented in a certain order, computed on this step. • Visualization of Results.The results are presented to the user in the form of clusters (and their sketches) corresponding to namesakes and which can be explored further.
Correlation Clustering • In CC, each pair of nodes (u,v) is labeled • with “+” or “-” edge • labeling is done according to a similarity function s(u,v) • Similarity function s(u,v) • if s(u,v) believes u and v are similar, then label “+” • else label “-” • s(u,v) is typically trained from past data • Clustering • looks at edges • tries to minimize disagreement • disagreement for element x placed in cluster C, is a number of “-” edges that connect x and other elements in C
Similarity Function • Connection strength between u and v: • where ck – the number of u-v paths of type k • and wk – the weigh of u-v paths of type k • Similarity s(u,v) is a combination
Experiments: Quality of Disambiguation By Artiles, et al. in SIGIR’05 By Bekkerman & McCallum in WWW’05