220 likes | 410 Views
Entity Resolution Tool ‘ sdlink ’. - Darshana Pathak - Dr. Hye -Chung Kum. Index:. Overview Entity resolution process About Framework Configuration file Class Details How to … Future Work Questions?. Overview:.
E N D
Entity Resolution Tool‘sdlink’ - DarshanaPathak - Dr. Hye-Chung Kum
Index: • Overview • Entity resolution process • About Framework • Configuration file • Class Details • How to … • Future Work • Questions?
Overview: • Framework for developing Entity Resolution Tool - named ‘sdlink’ • Idea is to provide a ‘Lab’ • For whom? • Research assistants, students • Why? • To contribute towards research
Entity Resolution Process: Configure: Define link Variable Compare: Similarity Metrics, Find Distance Decide: Supervised/ Unsupervised Decision Model Search: Reduce space (Blocking) Evaluate: Assess the linked data Analyze: Error Propagation Refine: Relationships and Deduplication Data Management
Various Tools: • Searching Methods • Blocking • Sorting • Hashing • Sorted Neighborhood • Comparison Functions • Hamming Distance • Edit Distance • Jaro’s Algorithm • N-grams • SoundexCode
Various Tools: • Decision Models • Probabilistic Model • Induction Model • Clustering Model • Hybrid Model • Measurement Tools • Reduction Ratio • Pairs Completeness • Accuracy • Completeness
About Framework: • Basic framework includes: • Configuration file: configure.xml • Main class: SDLink.java • ConfigFile and ConfigReader • CSVFile, CSVReader and CSVWriter • BlockingModel.java • DistanceCalculator.java Everything explained in further slides.
Configuration File: • Name: configure.xml • Specifies: • 2 CSV Files to be linked • List of attributes • Blocking method • Weight for each attribute • Clustering method
Java Class Details: • SDLink.java – Initializes all classes to • Read configuration file • Read 2 CSV Files • Perform blocking • Calculate distances • Perform clustering • Writing output to output files
Java Class Details: • ConfigFile.java and ConfigReader.java • Read configure.xml • Know everything about CSVFiles, attributes, blocking methods and clustering method. • Store all these information in an instance of ConfigFile.java so that other classes can readily access this information whenever required.
Java Class Details: • CSVFile.java, CSVReader.java & CSVWriter.java • Read both CSV Files • Combine two files into one • Form a 2-D matrix of all attributes in CSV files • Store all the data from CSV file into an instance of CSVFile.java
Java Class Details: • BlockingModel.java • Performs blocking on the 2-D matrix of data • Knows how to partition rows from configure.xml • Important step as further clustering is done on each block. • Necessary to handle large data.
Java Class Details: • DistanceCalculator.java • Performs operations on each block (formed in blocking step) separately. • Calculates distance between two attributes • Compares distances and calculates densities iteratively • Forms many tiny clusters as the process runs for multiple iterations • Process runs until no clusters can be formed.
Java Class Details: • Everything runs in a big LOOP… • There can be multiple blocking attributes. • The whole process of blocking and clustering runs for each blocking attribute. • The output of every iteration is an input to the next iteration. • Be careful: It should not be an infinitely long process!
How to… : • Using this basic framework, you can implement your own ideas • E.g. A new clustering algorithm – • Write the code and just plug it into distance calculator class • Make sure not to disturb existing functionality • Be purely object oriented • Check the new algorithm’s output
How to… : • This code is available on Macbeth (but no version control till now…) • We will have version control system like SVN, where multiple developers can check out and check in code… • To avoid risk, we can add separate methods and classes without touching existing code.
Future Work: • Version Control System • Generate proper output files • Implement and test various clustering algorithms • Develop graphical user interface • And much more…
References: • TAILOR: A Record Linkage Toolbox (2002) Mohamed Elfeky , VassiliosVerykios , Ahmed Elmagarmid. • A GLASS BOX APPROACH FOR LINKING ADMINISTRATIVE RECORDS: PI: Gale Boyd, Co-PI: Wayne Gray and Hye- Chung Kum