1 / 13

The Google Similarity Distance

The Google Similarity Distance. Presenter : Chien-Hsing Chen Author: Rudi L. Cilibrasi Paul M.B. Vitanyi. 2007,TKDE. Outline. Motivation Objective NGD Experiments Conclusions Personal Opinion. Motivation.

vilmos
Download Presentation

The Google Similarity Distance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Google Similarity Distance Presenter:Chien-Hsing Chen Author: Rudi L. Cilibrasi Paul M.B. Vitanyi 2007,TKDE

  2. Outline • Motivation • Objective • NGD • Experiments • Conclusions • Personal Opinion

  3. Motivation • great cost of designing structures capable of manipulating knowledge • entering high quality contents in these structures by knowledgeable human experts • the efforts are long-running • large scale

  4. Objective • The author develop a method that uses only the name of an object and obtains knowledge about the similarity of objects • a regular FCA, used in Ontology, acquires the similarity between objects and attributes

  5. The Google Similarity Distance Kolmogorov complexity

  6. The Google Similarity Distance • NGD (horse, rider) = 0.443 • “horse” 46,700,000 pages • “rider” 12,200,000 pages • “horse, rider” 2,630,000 pages • N= Indexed 8,058,044,651 pages NGD(pensi, cola)=0.797 NGD(賓拉登, 攻擊)=0.64 NGD(horse, rider)=0.898 NGD(book, drink)=0.694 NGD(web, network)=0.2768

  7. Applications and Experiments • Hierarchical Clustering • Given a set of objects in a space provided with a distance measure, the matrix has as entries the pairwise distances between the objects.

  8. Applications and Experiments • Hierarchical Clustering • Dataset: 17th Century painters

  9. Applications and Experiments • SVM-NGD Learning • The author uses the anchor words to convert each of the 40 training words w1, …, w40 to 6-dimensional training vector v1,…v40. • The entry vj,i of vj=(vj,1,…,vj,6) is defined as vj,i=NGD(wj,ai) (1≦j ≦ 40, 1 ≦ i ≦ 6)

  10. NGD Translation

  11. Comparison to WordNet semantics • Randomly selected 100 semantic categories from the WordNet database • for each category, SVM is trained on 50 labeled training samples Positive examples are from WordNet, others are from dictionary • Per experiment is used a total of six anchors, 3 are from WordNet, 3 are from dictionary • Testing dataset, 20 new examples • Running with 100 experiments • The author ignores the false negatives

  12. Conclusion • This knowledge base was created over the course of decades by paid human experts. • Google has already indexed more than 8 billion pages and shows no signs of slowing down. • Someone who estimated the 8-billion indexed pages was in 2004.

  13. Opinion • Advantage • Google search engine was respected recently for similarity measure. • Drawback • anchors determination, accuracy measure (ignore false-negative) • NGD is a nothing novel but a demonstration straightly • Application

More Related