1 / 25

Overcoming the Quality Curse

This research delves into enhancing data quality through improved entity resolution and data cleaning, exploring novel approaches to optimize decision-making processes. The study investigates methodologies to analyze, evaluate, and enhance the accuracy of data through innovative strategies. It sheds light on overcoming challenges in data quality assurance, particularly in domains like video, image, speech, and sensor data analysis. The research also addresses the Quality Curse phenomenon and proposes advanced techniques to break free from subpar results. Join this insightful journey into optimizing data quality to elevate organizational efficiency and decision-making accuracy.

cervantez
Download Presentation

Overcoming the Quality Curse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan Zhang Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari

  2. Beyond DASFAA 2003 paper .. Improving Quality Improving Efficiency New Domains DASFAA 2003 Video data Image data Speech data Sensor data Entity Search People Search Location Search

  3. Data Cleaning – a vital component of Enterprise Data Processing Workflow Analysis/Mining ETL Quality of Analysis Decisions Quality of Decisions Data Quality of Data • Historical data analyses • Trends, patterns, rules, models, .. • Long term strategies • Business decisions Data Cleaning Quality(Data)  Quality(Decisions) OLTP Point of sale Organizational customer data Data Sources

  4. Entity Resolution Problem Real World Digital World

  5. Standard Approach to Entity Resolution Deciding if two reference u and v co-refer Analyzing their features ? J. Smith John Smith ? Feature 2 Feature 2 ? u v Feature 3 Feature 3 ? js@google.com sm@yahoo.com s (u,v) = f (u,v) “Similarity function” “Feature-based similarity” (if s(u,v) > t then u and v are declared to co-refer)

  6. Measuring Quality of Entity Resolution • Entity dispersion • for an entity, into how many clusters its repr. are clustered, ideal is 1 • Cluster diversity • for a cluster, how many distinct entities it contains, ideal is 1 • Measures: • F-Measure. • B-Cubed F-Measure. • Variation of Information (VI). • Generalized Merge Distance (GMD). • …

  7. The Quality Curse -- Why Standard “Feature-based” Approach leads to Poor Results S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his bachelors from IIT, Kanpur in India Photo Collection of Sharad Mehrotra from Beijing, ChinaJune 2007 SIGMOD Trip S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor machines. He received his bachelors from India. Sharad Mehrotra, research interests: data management, Professor, UC Irvine Significant entity dispersion. Significant cluster diversity.

  8. Overcoming the Quality Curse (1).. Look more carefully at data for additional evidences

  9. Exploiting Relationships among Entities Author table (clean) Publication table (to be cleaned) ? A1, ‘Dave White’, ‘Intel’ A2, ‘Don White’, ‘CMU’ A3, ‘Susan Grey’, ‘MIT’ A4, ‘John Black’, ‘MIT’ A5, ‘Joe Brown’, unknown A6, ‘Liz Pink’, unknown P1, ‘Databases . . . ’, ‘John Black’, ‘Don White’ P2, ‘Multimedia . . . ’, ‘Sue Grey’, ‘D. White’ P3, ‘Title3 . . .’, ‘Dave White’ P4, ‘Title5 . . .’, ‘Don White’, ‘Joe Brown’ P5, ‘Title6 . . .’, ‘Joe Brown’, ‘Liz Pink’ P6, ‘Title7 . . . ’, ‘Liz Pink’, ‘D. White’ • Context Attraction Principle (CAP): Nodes that are more connected have a higher chance of co-referring to the same entity ER Graph

  10. Exploiting Relationships for ER Ph.D. Thesis, Stella Chen • Formalizing the CAP principle [SDM 05, IQIS 05] • Scaling to large graphs [TODS 06] • Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11] • Not all relationships are equal • E.g., mutual interest in Bruce Lee movies possibly not as important as being colleagues at a university for predicting co-authorship. • Merging relationship evidence with other evidences [SIGMOD ‘09] • Applying to People search on Web [ICDE ‘07, TDKE 08, ICDE 09 (demo)]

  11. Effectiveness of Exploiting Relationships WEPS Multimedia

  12. Smart Video Surveillance Query/ Analysis Event Database CS Building in UC Irvine Semantic Extraction Surveillance Video Database Video collection Camera Array to track human activities

  13. Event Model Event model : Query Examples: Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or embedded computing students? Query /Analysis when what who Temporal placement Activity recognition Event Database Face recognition event extraction localization Other property Semantic Extraction Surveillance Video Database where

  14. Person Identification Challenge Bob ? Event model : when what ? Alice who Temporal placement Activity recognition Face recognition ? event extraction localization other Other property Person Identification Who ? where

  15. Traditional Approach ? ? ? Traditional Approach Face Detection Face Recognition Poor Performance Detect 70 faces/ 1000 images 2~3 images/ person

  16. Rationale for Poor Performance resolution Sampling rate original performance original performance (original) Poor Quality of Data No faces Small faces Low resolution Low temporal Resolution 1 frame/sec 1 frame/sec Drop to Drop to 53% 70% 1/2 frame/sec (1/2 original) Drop to Drop to 30% 35% 1/3 frame/sec (1/3 original)

  17. Effectiveness of Exploiting Relationships WEPS Multimedia [IQ2S PERCOM 2011]

  18. Results on Face Clustering [ACM ICMR 2013 Best Paper Award]

  19. Results High Precision, 662 clusters 31 Real Person, 631 merges 4 Times High Precision, 203 clusters 31 Real Person, 172 merges

  20. Overcoming the Quality Curse (2).. Look outside the box

  21. Exploiting Search Engine Statistics Google Search results of “Andrew McCallum” • Correlations amongst context entities provide additional source of information to resolve entities Search Engine Queries to learn correlations amongst contexts SebastianThrunAND Tom Mitchell Andrew McCallum AND Sebastian ThrunAND Tom Mitchell (Machine Learning OR Text Retrieval) AND (CRF OR UAI 2003) Andrew McCallum AND (Machine Learning OR Text Retrieval ) AND (CRF OR UAI 2003) Sebastian Thrun Machine Learning Text Retrieval Tom Mitchell CRF UAI 2003 Andrew McCallum AND Sebastian Thrun AND (CRF OR UAI 2003)

  22. Exploiting Web Search Engine Statistics Ph.d. Thesis, Rabia Nuray Web Queries to Learn correlations [SIGIR 08] Application to Web People Search [WePS 09] Cluster refinement to overcome the singleton cluster problem [TODS 11-a] Making Web querying robust to server side fluctuations [tech. report] Scaling up the Web Query Technique [TODS 11-a]

  23. Comparing with the State-of-the-art on WEPS-2 Dataset

  24. Observation/Conclusion… • Additional Evidences can be exploited to improve data quality • BUT …it is Expensive!! • Example: Web Queries Approach • Number of queries : 4K2 ( ~ 40K for 100 results) • Very large to submit to a search engine & expect real-time results • ~6-8 minutes (network costs, search engine load) • Solutions: • Local Caching of the Web • Ask only important queries • Reduces to 1-2 min. without degrading quality much

  25. (Near) Future: Addressing the Efficiency Curse … Improving Quality Improving Efficiency New Domains DASFAA 2003 Two complementary approaches • Pay as you go data cleaning – • Progressive algorithm to obtain best quality given budget constraint • Query driven data cleaning – • Perform minimal cleaning to answer query/analyses task. Prevent having to clean unnecessary data.

More Related