130 likes | 223 Views
The Google Similarity Distance. Presenter : Chien-Hsing Chen Author: Rudi L. Cilibrasi Paul M.B. Vitanyi. 2007,TKDE. Outline. Motivation Objective NGD Experiments Conclusions Personal Opinion. Motivation.
E N D
The Google Similarity Distance Presenter:Chien-Hsing Chen Author: Rudi L. Cilibrasi Paul M.B. Vitanyi 2007,TKDE
Outline • Motivation • Objective • NGD • Experiments • Conclusions • Personal Opinion
Motivation • great cost of designing structures capable of manipulating knowledge • entering high quality contents in these structures by knowledgeable human experts • the efforts are long-running • large scale
Objective • The author develop a method that uses only the name of an object and obtains knowledge about the similarity of objects • a regular FCA, used in Ontology, acquires the similarity between objects and attributes
The Google Similarity Distance Kolmogorov complexity
The Google Similarity Distance • NGD (horse, rider) = 0.443 • “horse” 46,700,000 pages • “rider” 12,200,000 pages • “horse, rider” 2,630,000 pages • N= Indexed 8,058,044,651 pages NGD(pensi, cola)=0.797 NGD(賓拉登, 攻擊)=0.64 NGD(horse, rider)=0.898 NGD(book, drink)=0.694 NGD(web, network)=0.2768
Applications and Experiments • Hierarchical Clustering • Given a set of objects in a space provided with a distance measure, the matrix has as entries the pairwise distances between the objects.
Applications and Experiments • Hierarchical Clustering • Dataset: 17th Century painters
Applications and Experiments • SVM-NGD Learning • The author uses the anchor words to convert each of the 40 training words w1, …, w40 to 6-dimensional training vector v1,…v40. • The entry vj,i of vj=(vj,1,…,vj,6) is defined as vj,i=NGD(wj,ai) (1≦j ≦ 40, 1 ≦ i ≦ 6)
Comparison to WordNet semantics • Randomly selected 100 semantic categories from the WordNet database • for each category, SVM is trained on 50 labeled training samples Positive examples are from WordNet, others are from dictionary • Per experiment is used a total of six anchors, 3 are from WordNet, 3 are from dictionary • Testing dataset, 20 new examples • Running with 100 experiments • The author ignores the false negatives
Conclusion • This knowledge base was created over the course of decades by paid human experts. • Google has already indexed more than 8 billion pages and shows no signs of slowing down. • Someone who estimated the 8-billion indexed pages was in 2004.
Opinion • Advantage • Google search engine was respected recently for similarity measure. • Drawback • anchors determination, accuracy measure (ignore false-negative) • NGD is a nothing novel but a demonstration straightly • Application