250 likes | 258 Views
This research delves into enhancing data quality through improved entity resolution and data cleaning, exploring novel approaches to optimize decision-making processes. The study investigates methodologies to analyze, evaluate, and enhance the accuracy of data through innovative strategies. It sheds light on overcoming challenges in data quality assurance, particularly in domains like video, image, speech, and sensor data analysis. The research also addresses the Quality Curse phenomenon and proposes advanced techniques to break free from subpar results. Join this insightful journey into optimizing data quality to elevate organizational efficiency and decision-making accuracy.
E N D
Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham Altwaijry, Jeffrey Xu, Liyan Zhang Alumini Stella Zhaoqi Chen, Rabia Nuray-Turan, Virag Kothari
Beyond DASFAA 2003 paper .. Improving Quality Improving Efficiency New Domains DASFAA 2003 Video data Image data Speech data Sensor data Entity Search People Search Location Search
Data Cleaning – a vital component of Enterprise Data Processing Workflow Analysis/Mining ETL Quality of Analysis Decisions Quality of Decisions Data Quality of Data • Historical data analyses • Trends, patterns, rules, models, .. • Long term strategies • Business decisions Data Cleaning Quality(Data) Quality(Decisions) OLTP Point of sale Organizational customer data Data Sources
Entity Resolution Problem Real World Digital World
Standard Approach to Entity Resolution Deciding if two reference u and v co-refer Analyzing their features ? J. Smith John Smith ? Feature 2 Feature 2 ? u v Feature 3 Feature 3 ? js@google.com sm@yahoo.com s (u,v) = f (u,v) “Similarity function” “Feature-based similarity” (if s(u,v) > t then u and v are declared to co-refer)
Measuring Quality of Entity Resolution • Entity dispersion • for an entity, into how many clusters its repr. are clustered, ideal is 1 • Cluster diversity • for a cluster, how many distinct entities it contains, ideal is 1 • Measures: • F-Measure. • B-Cubed F-Measure. • Variation of Information (VI). • Generalized Merge Distance (GMD). • …
The Quality Curse -- Why Standard “Feature-based” Approach leads to Poor Results S Mehrotra has joined the faculty at University of Illinois. He received his PhD from UT, Austin. He got his bachelors from IIT, Kanpur in India Photo Collection of Sharad Mehrotra from Beijing, ChinaJune 2007 SIGMOD Trip S. Mehrotra, PhD from University of Illinois is visiting UT, Austin to give a talk on prefetching on multiprocessor machines. He received his bachelors from India. Sharad Mehrotra, research interests: data management, Professor, UC Irvine Significant entity dispersion. Significant cluster diversity.
Overcoming the Quality Curse (1).. Look more carefully at data for additional evidences
Exploiting Relationships among Entities Author table (clean) Publication table (to be cleaned) ? A1, ‘Dave White’, ‘Intel’ A2, ‘Don White’, ‘CMU’ A3, ‘Susan Grey’, ‘MIT’ A4, ‘John Black’, ‘MIT’ A5, ‘Joe Brown’, unknown A6, ‘Liz Pink’, unknown P1, ‘Databases . . . ’, ‘John Black’, ‘Don White’ P2, ‘Multimedia . . . ’, ‘Sue Grey’, ‘D. White’ P3, ‘Title3 . . .’, ‘Dave White’ P4, ‘Title5 . . .’, ‘Don White’, ‘Joe Brown’ P5, ‘Title6 . . .’, ‘Joe Brown’, ‘Liz Pink’ P6, ‘Title7 . . . ’, ‘Liz Pink’, ‘D. White’ • Context Attraction Principle (CAP): Nodes that are more connected have a higher chance of co-referring to the same entity ER Graph
Exploiting Relationships for ER Ph.D. Thesis, Stella Chen • Formalizing the CAP principle [SDM 05, IQIS 05] • Scaling to large graphs [TODS 06] • Self-Tuning [DASFAA 07, JCDL 07, Journal IQ 11] • Not all relationships are equal • E.g., mutual interest in Bruce Lee movies possibly not as important as being colleagues at a university for predicting co-authorship. • Merging relationship evidence with other evidences [SIGMOD ‘09] • Applying to People search on Web [ICDE ‘07, TDKE 08, ICDE 09 (demo)]
Effectiveness of Exploiting Relationships WEPS Multimedia
Smart Video Surveillance Query/ Analysis Event Database CS Building in UC Irvine Semantic Extraction Surveillance Video Database Video collection Camera Array to track human activities
Event Model Event model : Query Examples: Who was the last visitor to Mike Carey’s office yesterday? Who spends more time in Labs – database students or embedded computing students? Query /Analysis when what who Temporal placement Activity recognition Event Database Face recognition event extraction localization Other property Semantic Extraction Surveillance Video Database where
Person Identification Challenge Bob ? Event model : when what ? Alice who Temporal placement Activity recognition Face recognition ? event extraction localization other Other property Person Identification Who ? where
Traditional Approach ? ? ? Traditional Approach Face Detection Face Recognition Poor Performance Detect 70 faces/ 1000 images 2~3 images/ person
Rationale for Poor Performance resolution Sampling rate original performance original performance (original) Poor Quality of Data No faces Small faces Low resolution Low temporal Resolution 1 frame/sec 1 frame/sec Drop to Drop to 53% 70% 1/2 frame/sec (1/2 original) Drop to Drop to 30% 35% 1/3 frame/sec (1/3 original)
Effectiveness of Exploiting Relationships WEPS Multimedia [IQ2S PERCOM 2011]
Results High Precision, 662 clusters 31 Real Person, 631 merges 4 Times High Precision, 203 clusters 31 Real Person, 172 merges
Overcoming the Quality Curse (2).. Look outside the box
Exploiting Search Engine Statistics Google Search results of “Andrew McCallum” • Correlations amongst context entities provide additional source of information to resolve entities Search Engine Queries to learn correlations amongst contexts SebastianThrunAND Tom Mitchell Andrew McCallum AND Sebastian ThrunAND Tom Mitchell (Machine Learning OR Text Retrieval) AND (CRF OR UAI 2003) Andrew McCallum AND (Machine Learning OR Text Retrieval ) AND (CRF OR UAI 2003) Sebastian Thrun Machine Learning Text Retrieval Tom Mitchell CRF UAI 2003 Andrew McCallum AND Sebastian Thrun AND (CRF OR UAI 2003)
Exploiting Web Search Engine Statistics Ph.d. Thesis, Rabia Nuray Web Queries to Learn correlations [SIGIR 08] Application to Web People Search [WePS 09] Cluster refinement to overcome the singleton cluster problem [TODS 11-a] Making Web querying robust to server side fluctuations [tech. report] Scaling up the Web Query Technique [TODS 11-a]
Observation/Conclusion… • Additional Evidences can be exploited to improve data quality • BUT …it is Expensive!! • Example: Web Queries Approach • Number of queries : 4K2 ( ~ 40K for 100 results) • Very large to submit to a search engine & expect real-time results • ~6-8 minutes (network costs, search engine load) • Solutions: • Local Caching of the Web • Ask only important queries • Reduces to 1-2 min. without degrading quality much
(Near) Future: Addressing the Efficiency Curse … Improving Quality Improving Efficiency New Domains DASFAA 2003 Two complementary approaches • Pay as you go data cleaning – • Progressive algorithm to obtain best quality given budget constraint • Query driven data cleaning – • Perform minimal cleaning to answer query/analyses task. Prevent having to clean unnecessary data.