350 likes | 449 Views
Data Quality Aware Query Systems. N. Khodabandehloo Yeganeh, S. Sadiq, K. Deng, X. Xue School of Information Technology and Electrical Engineering The University of Queensland. Understanding Data Quality. Garbage-in Garbage-out !. ???.
E N D
Data Quality Aware Query Systems N. Khodabandehloo Yeganeh, S. Sadiq, K. Deng, X. Xue School of Information Technology and Electrical Engineering The University of Queensland
Understanding Data Quality Garbage-in Garbage-out ! ??? I am sorry Mrs Smith, but our records show that you are dead! "A One-House, $400 Million Bubble Goes Pop," New York Times, February 15, 2006 by Gretchen Ruethling. A house valued at $121,900 skyrocketed to $400 million after someone "most likely hit a wrong key and changed the figure in the county's computer system." The article goes on, "The error probably happened when someone tried to check the value of a house in a county computer program that showed real estate values, but then accidentally gained access to a different program without authorization, said James Murphy, the county treasurer. The inflated value ...was used to calculate tax rates and led the county to expect $8 million in property taxes that did not exist." www.dataqualitysolutions.com 2
Historical Perspective • Data as Product Data IS the product ... Intention and usage of data is no longer aligned in current large scale information systems Wang, Storey & Firth (1995) A Framework for Data Quality research. IEEE TKDE Vol.7/4, Aug, 1995.
Changing Nature of DQ Data IS the product ... Intention and usage of data is no longer aligned in current large scale information systems No absolute data quality metrics ... Diversity of data usage makes quality definitions application/usage specific and feedback driven Data is shared ... A central control of data quality is not possible
Data Life Cycle Design requirements and usage benefits may not have explicit connections Diversity in data format, scale and usage unprecedented Data IS the product No absolute data quality metrics Data is being integrated, more than it is being designed Data is shared
Related Works • Problems
Related Works Data Quality Dimensions [SMB05] [WSF95] [WSB96] • Accuracy (Erroneous) • Postcode “4107” is typed “4017” • Consistency (Inconsistent) • ITEE Vs. Information Technology and Electrical Engineering • Completeness (Missing) • Students don’t have to declare a major till graduation, so major is missing in most enrolments • Currency (Obsolete) • Old phone numbers • Accessibility (Unavailable) • Server down, privacy concerns • Reliability & Trust (Uncertainty)
Related Works Related Computational DQ Problems • Record Linkage. Different records representing a single real-life. • Approximate matching [GIJ01], record linkage [JIK03] and entity resolution techniques [BGS08] • Constraints. Maintain the semantic integrity of data. • Semantic Integrity Constraints [CFG07] [BWG07] • Uncertainty. How complete or certain is the data. • Probabilistic data bases [LLR97] • Provenance or lineage of data is to identify the pedigree of a data object. • Annotation and non annotation approaches [SPG05]
Data Quality Aware Query Systems • Architecture • Quality Aware Components: • Quality Aware Queries (Source Selection
Quality Aware Source Selection • Example: Virtual Shop Google product search returns 91345 results for “Cannon Powershot”. Is user going to check all results to find the best source of information that matches his own requirements? How can we help user to find what he really wants?
Quality Aware Source Selection • Multiple Sources of Information
Quality Aware Source Selection • Multiple Sources of Information • In a virtual shop user query can be answered from various sources of information (virtual shops). • Each source is accessed through a common schema as below: • Item title • Item Description • Numbers Available • Price • Tax • User Comments
Quality Aware Source Selection • Multiple Sources of Information • Phases to answer the query: • Select k best sources. • Run the query against selected sources and generate the results. • If more results are required, continue step 2 with next k’ sources. • Our approaches uses source quality and user preferences on data quality for phase 1.
Quality Aware Source Selection • User Requirements – Querying for different reasons • Learn about different items (products) - such a user may not care about he “Numbers Available” and “Tax” columns. “Price” is somewhat important to the user although obsoleteness and inaccuracy in price values can be tolerated. However, consistency of “Item Title" and completeness within the populations of “User Comments" in the query results, is of highest importance. • Compare prices - thus user is sure about the item to purchase but is searching for the best price. Obviously “Price” and “Tax” fields have the greatest importance in this case. They should be current and accurate. "Numbers Available" is also important although slight inaccuracies in this column are acceptable as any number more than 1 will be sufficient. Which source should be selected? Selection of a good source for data is subjected to what does the term “good" mean to the user
Quality Aware Source Selection • Following Questions Should be Answered • How to obtain information about the generic data quality of data sources? • How to model and capture user specific data quality requirements? • How to conduct the quality aware query and rank results based on the quality aware query? Which source should be selected? Selection of a good source for data is subjected to what does the term “good" mean to the user
Quality Aware Queries Communication Network DQ Profiling DQ Profiling DQ Profiling Org n Org 1 Org 2 DB DB DB M M M Quality Aware Metadata • Schema + Profiling
Quality Aware Metadata • Schema + Profiling
Quality Aware SQL • For purpose of simple illustration we allow using quality-aware metadata, to query any metric as [Column Name.Metric Name] as part of the SQL query formulation. • SELECT Title, Price FROM ShopItem WHERE Title.Completeness>0.8 • SELECT Title, Title.Accuracy, Price FROM ShopItem ORDER BY Price.Accuracy • ORDER BY Price.Accuracy models a one dimensional preference that indicates sources with higher price accuracy are preferred, a two dimensional preference can not be intuitively achieved.
Quality Aware SQL To prioritize preferences: E.g. from the sources with highest data quality, sources with high currency of price are prioritized over sources with high completeness of price. • Hierarchy Clause SELECT Title AS t, Price AS p, [User Comments] AS u FROM ShopItem WHERE ... HIERARCHY(ShopItem) p OVER (t,u) 7, u OVER (t) 3 HIERARCHY(ShopItem.p) p.Currency OVER (p.Completeness) 3 Generally: • HIERARCHY(a) a.x OVER (a.x',...) n Why Hierarchy? Intuitively human defines preferences as partial orders (pairs). E.g. I prefer cofee to tea.
Related Works • Preferences: • Preference theory [Wri63], logic [Man92] and reasoning [WD91] • Multi criteria decision making [Saa22] • Preferences in database [LL87]: • Preference queries [GJM00] • Skyline operator [BKS01] • Formulation [Kie02]
Processing Quality Aware SQL • AHP (Analytical Hierarchy Processes) • AHP Tree for buying a car • Define a hierarchy of objectives starting from the Goal. • Each level of hierarchy defines lower lever objectives. • Continue to the lowest level of hierarchy.
Processing Quality Aware SQL • AHP (Analytical Hierarchy Processes) • AHP Tree for the following query • SELECT Title AS t, Price AS p, [User Comments] AS u • FROM ShopItem WHERE ... • HIERARCHY(ShopItem) p OVER (t,u) 7, u OVER (t) 3 • HIERARCHY(ShopItem.p) p.Currency OVER (p.Completeness) 3 .
Processing Quality Aware SQL • AHP (Analytical Hierarchy Processes) • AHP weights for the following query • SELECT Title AS t, Price AS p, [User Comments] AS u • FROM ShopItem WHERE ... • HIERARCHY(ShopItem) p OVER (t,u) 7, u OVER (t) 3 • HIERARCHY(ShopItem.p) p.Currency OVER (p.Completeness) 3 .
Processing Quality Aware SQL • AHP (Analytical Hierarchy Processes) • Ranking Sources Using the Calculated Weights: • Simple Additive Weight Method
Processing Quality Aware SQL • Generating AHP Tree from Quality Aware SQL
Evaluation • Three different scenarios: • Making a list of products: price is not important. SELECT Title AS t, Price AS p, [User Comments] AS u FROM ShopItem AS s HIERARCHY(s) t OVER (p) 9, u OVER (p) 3 • Selecting a product to buy: price is very important. SELECT Title AS t, Price AS p, [User Comments] AS u FROM ShopItem AS s HIERARCHY(s) p OVER (t,u) 9 • Reviewing user comments: User comments are important. SELECT Title AS t, Price AS p, [User Comments] AS u FROM ShopItem AS s HIERARCHY(s) u OVER (p) 9, t OVER (p) 3
Evaluation • Three different source ranking methods: * F. Naumann, J.C. Freytag, and M. piliopoulou. Qualitydriven source selection using Data Envelopment Analysis. Proc. of the 3rd Conference on InformationQuality (IQ), Cambridge, MA, 1998.
Future Works • Remember the research challenges? • How to obtain information about the generic data quality of data sources? • How to model and capture user specific data quality requirements? • How to conduct the quality aware query?
Future Works • How to obtain information about the generic data quality of data sources? • Smart ProfilingMethods to update data profiling (e.g. view updating and data warehouse updating approaches) and generate smarter profiles using complex business rules. • How to model and capture user specific data quality requirements? • How to conduct the quality aware query?
Future Works • How to obtain information about the generic data quality of data sources? • How to model and capture user specific data quality requirements? • Model Analysis And VerificationHierarchy Clause should handle inconsistencies in user requirements : Check the consistency of user preferences and repair it if possible. • How to conduct the quality aware query?
Future Works • How to obtain information about the generic data quality of data sources? • How to model and capture user specific data quality requirements? • How to conduct the quality aware query? • Improve Source Selection and RankingResulting quality of columns for a joined table is different from the column quality of individual tables. Query plans should be ranked instead of sources.
Questions? End of presentation…
References [SMB05] M. Scannapieco, P. Missier, and C. Batini. Data quality at a glance. Datenbank-Spektrum, 14:6-14, 2005. [WSF95] RY Wang, VC Storey, and CP Firth. A framework for analysis of data quality research. Knowledge and Data Engineering, IEEE Transactions on, 7(4):623-640,1995. [WSB96] R.Y. Wang and D.M. Strong. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems, 12(4):5-33, 1996. [CFG07] G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: consistency and accuracy. Proceedings of the 33rd international conference on Very large data bases, pages 315-326, 2007. [BW07] P. Bohannon, F. Wenfei, F. GEERTS, J. Xibei, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. ICDE, 2007. [GIJ01] L. Gravano, P.G. Ipeirotis, H.V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate String Joins in a Database (Almost) for Free. Proceedings of the international conference on very large data bases, pages 491-500,2001. [GIK03] L. Gravano, P.G. Ipeirotis, N. Koudas, and D. Srivastava. Text Joins for Data Cleansing and Integration in an RDBMS. Proc. of Int. Conf. on Data Engineering (ICDE), 2003. [BGS08] O. Benjelloun, H. Garcia-Molina, Q. Su, and J. Widom. Swoosh: A generic approach to entity resolution. VLDB Journal, 2008.
References [SPG05] Y.L. Simmhan, B. Plale, and D. Gannon. A Survey of Data Provenance in eScience. SIGMOD RECORD, 34(3):31, 2005. [LLR97] L.V.S. Lakshmanan, N. Leone, R. Ross, and VS Subrahmanian. ProbView: aexible probabilistic database system. ACM Transactions on Database Systems (TODS), 22(3):419-469, 1997. [Wri63] G.H. von Wright. The Logic of Preference. Edinburgh University Press, 1963. [Man92] Suryanarayana Murthy Mantha. First-order preference theories and their applications. PhD thesis, Mathematics, Salt Lake City, UT, USA, 1992. [WD91] M.P. Wellman and J. Doyle. Preferential semantics for goals. In Proceedings of the National Conference on Artificial Intelligence, pages 698-703, 1991. [Saa96] T.L. Saaty. Multicriteria Decision Making: The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. RWS Publications, 1996. [IL87] M. Lacroix and P. Lavency. Preferences: Putting More Knowledge into Queries. In Proceedings of the 13th International Conference on Very Large Data Bases, pages 217{225. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA, 1987. [GJM00] K. Govindarajan, B. Jayaraman, and S. Mantha. Preference Queries in Deductive Databases. New Generation Computing, 19(1):57-86, 2000. [BKS01] S. Borzsonyi, D. Kossmann, and K. Stocker. The skyline operator. In Proc. Of ICDE, pages 421-430, 2001. [Kie02] W. Kiebling. Foundations of preferences in database systems. In Proceedings of the 28th international conference on Very Large Data Bases, pages 311{322.VLDB Endowment, 2002.