1 / 0

Guided Data Repair

Guided Data Repair. Mohamed Yakout # Ahmed K. Elmagarmid * Jennifer Nevile # Mourad Ouzzani * Ihab F. Ilyas * # Purdue University West Lafayette, Indiana, USA. * Qatar Computing Research Institute Qatar Foundation – Doha, Qatar. Machine Learning Seminar Fall 2011. Data Quality.

landen
Download Presentation

Guided Data Repair

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Guided Data Repair

    Mohamed Yakout# Ahmed K. Elmagarmid* Jennifer Nevile#MouradOuzzani*Ihab F. Ilyas* #Purdue University West Lafayette, Indiana, USA. *Qatar Computing Research Institute Qatar Foundation – Doha, Qatar. Machine Learning Seminar Fall 2011
  2. Data Quality Real data has problems Entry errors Incomplete information Information extraction from text Data Integration from heterogeneous sources … etc
  3. Data Quality Problems manifest themselves in Duplicate Records Violations of Integrity Constraints (zip code determines city) Records with Missing Values Misalignment of Attribute Values … etc
  4. Data Quality Problems - Example Inconsistent Inconsistent Duplicates Missing Values
  5. Data Repair The Process of “Correcting” Data Problems An Essential Step in the Traditional ETL Process: the “T” Has Been the Focus of Multiple Research Communities: Statistics Machine Learning Database Theory Business Intelligence
  6. Approaches: Automatic Repair Inconsistent MICHIGAN CITY WESTVILLE MICHIGAN CITY Inconsistent Delete Duplicate MICHIGAN CITY Duplicates Missing Values
  7. Approaches: Involve an Expert Identify all problems Correct discovered problems in a consistent way Interactive systems to help explore problems and the expert manually specify transformations (e.g. AJAX, Potter’s Wheel) Time consuming Not scalable for large datasets
  8. Outline GDR framework Updates generation Ranking groups of update according to benefit estimation Active learning user feedback Experimental results Conclusion
  9. Guided Data Repair A novel approach that leverages scalability of automatic approaches and fidelity of expert-based approaches Use automatic techniques for problem identification and suggesting cleaning updates Benefit estimation techniques to prioritize quality problems and form user questions Iterative cleaning process that converges to the clean data state Active learning to piggyback learning with expert interaction
  10. GDR – System Architecture Demo in SIGMOD 2010
  11. GDR – Updates Generation
  12. GDR – Updates Generation 1 : 2 : Suggested Update: replace City “FORT WAYNE” with “Westville” in t5 Automatic techniques rely on heuristics to decide RISKY Suggested Update: replace Zip “46391” with “46825” in t5
  13. GDR – Updates Generation Contextual grouping for the suggested updates Update Group g1: The city should be “Michigan City” for {t2, t3, t4}. Update Group g2: The zip should be “46825” for {t5, t8}. …. …. ….
  14. GDR – Ranking Updates
  15. GDR – Ranking Updates Contextual grouping for the suggested updates Seeking User feedback for g1 is more beneficial to the data quality: Contains updates that are more likely to be correct. Higher numbers of correct updates allow for fast convergence to better quality. Update Group g1: The city should be “Michigan City” for {t2, t3, t4}. Update Group g2: The zip should be “46825” for {t5, t8}. …. …. ….
  16. GDR – Ranking Updates In Decision Theory, VOI (Value of Information) is a mean of quantifying the potential benefit of determining the true value of some unknown (i.e., user feedback) Define a loss (or utility) function for actions Compare the loss before and after an action to help make decisions
  17. GDR – Ranking Updates We need to define a DQ loss function (L). Given a group of updates c = {r1, r2, …}, where the probability for rj to be correct is pj. The DQ benefit from verifying c is: Quality Loss w.r.t a rule  : Challenges: We do not know pj and we do not know Dopt
  18. GDR – Ranking Updates We estimatepj using , (the prediction probability obtained from the learning component)
  19. GDR – Ranking Updates
  20. GDR – Active Learning Learning User Feedback: There could be correlations between the attribute values and the correct updates Example: when SRC = H2, the CT attribute is incorrect most of the time. This could help reject any suggested updates for the ZIP and consider only updates for the CT. Modeling these correlations by a machine learning algorithm, can help minimize user involvement.
  21. GDR – Active Learning We learn a classifier for each attribute. The training example is a tuple containing original record, the suggested replacement, the distance between the original and suggested value, and finally the label whether the suggested value is correct or not. For example, a training example for the city attribute <Tom, H2, REDWOOD DR, WESTVILLE, IN, 46360, MICHIGAN CITY, 0.7, CORRECT> Distance Label Suggested City Original record values
  22. GDR – Active Learning Active learning is used when unlabeled instances is plentiful but there is a cost for labeling examples for training. Acquire feedback for instances that would strengthen the learned model. Rank updates by uncertainty. We used Random Forest model, which is a set of decision trees forming a committee. The uncertainty can be quantified using the entropy of the predicted labels fractions. r1 r2 T1 T2 T3 T4 T5 Uncertainty (r1) = - 1/5 log(1/5) – 4/5 log(4/5) = 0.72 Uncertainty (r2) = - 2/5 log(2/5) – 3/5 log(3/5) = 0.97
  23. Experiments Data and ground truth: Dataset 1: 20,000 records of patients personal and address information repaired manually using lookup address web sites to get a ground truth. Dataset 2: Adults dataset from UCI 23,000 records Rules (CFDs): Rules specified during the manual cleaning process for Dataset1 For Dataset 2 we implemented a CFD discovery technique User simulation By consulting the ground truth dataset Data Quality metric Compute the improvement in data quality through the reduction in the loss L(D) --- we know Dopt.
  24. Evaluating: VOI Ranking Number of feedback as percentage of maximum number of updates required by a technique
  25. Overall Evaluation Number of feedback as percentage of the number of initially identified dirty records.
  26. Overall Evaluation
  27. Conclusion GDR guides the user to focus the efforts on inspecting the updates that would improve quality faster, while the user guides the system to automaticallyrepair the data A novel combination of decision theory and active learning as a new application for data repair. We presented GDR as an end-to-end framework for interactive data cleaning to provide fast conversion to better DQ. We are currently studying: Better ways to model the dependencies between the suggested updates. Leverage the uncertainty of the user input.
  28. Thank you

  29. Experiments: Results
More Related