120 likes | 365 Views
Duplicate Detection. Exercise 1. Use Extended Key to do Entity Identification[1]. Table R and S as shown below: Table R Table S. Suppose the extended key is {name, city, homeaddress} and the following ILFDs: (E. HomeAddress=” Myskviksvägen 8 ”) ->(E.City= ” INGARÖ ”)
E N D
Table R and S as shown below: Table R Table S
Suppose the extended key is {name, city, homeaddress} and the following ILFDs: • (E. HomeAddress=” Myskviksvägen 8”) ->(E.City= ”INGARÖ”) • (E. HomeAddress=”Myrvägen 2”) ->(E.City= ” INGARÖ”) • (E. HomeAddress=” Pilgatan 9 ”) ->(E.City= ”STOCKHOLM”) • (E. HomeAddress=” Nyängsvägen 39A”) ->(E.City= ” TULLINGE”) • Please construct the integrated table. ----------------------------------------------------- [1] Lim , Jaideep Srivastava , Satya Prabhakar , James Richardson, Entity Identification in Database Integration, Proceedings of the Ninth International Conference on Data Engineering, p.294-301, April 19-23, 1993
Answer Exercise • Integrated Table
Table R,which is already sorted according to application-specific key: Similarities between tuples • Given conditions below, please use Priority Queue algorithm to find the Duplicate Clusters within.
Method to count Matching Sorce: Given one cluster, the Matching Sorce of one tuple is : The average of the tuple’s similarity with the cluster’s all representitives. • The condition to declare a new cluster : matching score < 0.5 • The condition to declare a representitive: 0.5 < matching score < 0.8 • The size of Priority Queue: 2 ----------------------------------------------------- [2] A.E. Monge and C.P. Elkan, “An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records,” Proc. ACM-SIGMOD Workshop Research Issues on Knowledge Discovery and Data Mining, 1997
Answer Record 1 Queue{1} Record 2 2:1 = 0.6 > 0.5 and < 0.8 Queue {1,2} Record 3 3:1 = 0.1 3:2 = 0.2 representitive = (0.1 + 0.2) /2 = 0.15 < 0.5 Queue {3} {1, 2} Record 4 4:1 =0.3 4:2= 0.4 representitive = (0.3+0.4) /2 = 0.35 < 0.5 4:3= 0.9 > 0.5 and > 0.8 Queue {3, 4} {1,2} Record 5 5:1 = 0.5 5:2 = 0.4 representitive = (0.5 +0.4) /2 = 0.45 < 0.5 5:3= 0.4 representitive = 0.4 <0.5 Queue {5} {3, 4} {1,2} Record 6 6:3 = 0.6 representitive = 0.6 > 0.5 and < 0.8 6:5 = 0.4 < 0.5 Queue {3, 4, 6} {5} {1,2} Record 7 7:3 = 0.5 7:6 = 0.4 representitive = (0.5 +0.4)/2 = 0.45 < 0.5 7:5 = 0.8 >0.5 Queue {5, 7} {3, 4, 6} {1,2}