180 likes | 306 Views
EFFICIENT PROFILING FOR ESTIMATION OF QUERY RESULT QUALITY. Naiem K. Yeganeh University of Queensland naiem@itee.uq.edu.au. Shazia Sadiq University of Queensland shazia@itee.uq.edu.au. Mohamed A. Sharaf University of Queensland m.sharaf@uq.edu.au.
E N D
EFFICIENT PROFILING FOR • ESTIMATION OF QUERY RESULT QUALITY Naiem K. YeganehUniversity of Queenslandnaiem@itee.uq.edu.au Shazia SadiqUniversity of Queenslandshazia@itee.uq.edu.au Mohamed A. SharafUniversity of Queenslandm.sharaf@uq.edu.au Ke DengUniversity of Queenslandk.deng@uq.edu.au Executive Summary/Abstract: Data quality profiles consist of statistical measurements about the quality of data sets. Query systems can use DQ profiles as a form of metadata to estimate the quality of a query result set. Traditional DQ profiling provides an estimate on the overall quality of a data set or data source, but quality of a query result can be remarkably different from the overall quality of the data set because conditions within the query typically select a subset of the data. In this paper we propose an efficient conditional DQ profiling method which can estimate the quality of a result set for a given query with guaranteed user definable level of accuracy.
Objectives of this presentation • Study the need to profile the quality of data sets. • Discuss Data Quality Profiling and its position in literature. • Proposing Conditional Data Quality Profiling as an tool for estimation of the quality of query results. • Proposing methods for improving the efficiency of genering Conditional Data Quality Profile.
Scenario What is the quality of the data below? • For the above question to be answered in a consistent way, we need to be more specific: • Quality of what attribute do we want to measure? • What do we mean by quality or what aspect of quality do we want to measure? Data Quality metrics are measurements of a specific data quality dimension for a specific part of data (i.e. a specific attribute like Price)
Definition Data Quality (DQ) metric and Data Quality Profile • Data Quality Metric : • Statistical measurement of a data quality dimension for a specific attribute over a dataset. • For example: Completeness of Price. • Data Quality Profile : • A data set (meta-data) that contains statistical information about the quality of another data set. • Usually contains values (or aggregated values) for different metrics. Data Data Quality Services Assumption Data Data QualityService DQ Metric
Scenario What is the quality of the data below? What is the completeness of all attributes for the data below? A simple definition for the completeness metric. Completeness(x) = If x is null then return 0 else return 1 DQ Profile for the Data Set “ShoppingItems” Data Set “ShoppingItems”
Scenario What is the quality of the data below? What is the completeness of all attributes for the data below? Be more specific orHow to estimate quality of the query results What is the completeness of image for Sony Cameras? DQ Profile can not estimate the quality of query results. Let us call this type of DQ profile, traditional DQ profile.We will propose conditional DQ profile in contrast. DQ Profile for the Data Set “ShoppingItems” Data Set “ShoppingItems”
Scenario Conditional DQ Profile- A dq profile that can be used to estiamte metric results for every query with conjunctive selection cobditions.- One conditional DQ profile for each metric B = BrandM = ModelP = PriceI = Image C = CannonS = Sony S = SLRN = Normal H = HighL = Low Data Set (T) One conditional DQ Profilefor Completeness of I (Image) for T
Sample Queries:What is the Completeness of I where B=C and M=S What is the Completeness of I where P=H Scenario Conditional DQ Profile- A dq profile that can be used to estiamte metric results for every query with conjunctive selection cobditions.- One conditional DQ profile for each metric B = BrandM = ModelP = PriceI = Image C = CannonS = Sony S = SLRN = Normal H = HighL = Low Data Set (T) One conditional DQ Profilefor Completeness of I (Image) for T
Method Create Conditional DQ Profile - Brute ForceSearch Every possible conjunctive selection condition.
Issue Conditional DQ Profile may become bigger than data set! Hence, size should be reduced. Remove every record from the DQ profile data set if the value of DQ metric is in a specific range of its superset (which is called certainty threshold). B=C and M=S -> completeness of I = 0.66 B=C ->completeness of I = 0.50 SLR Cannon cameras have about the same completeness of image as all Cannon cameras. An image reduced to only 16 colors with lower resolution is much smaller but conveys enough and correct information. Error distribution is not always random. Different subsets of the data set may have different error distributions.
Issue Conditional DQ Profile may become bigger than data set! Hence, size should be reduced. Epsilon and Tau Reduced Conditional DQ Profile may loose some data. I.e. it may not be able to estimate some queries correctly. Reduced conditional DQ profile with threshold τ=2 and ε=0.2.
Method Generation and Optimization of Conditional DQ Profile Create a Conditional DQ Profile Reduce the size of profile using certainty and minimum set thresholds. Return records to the DQ profile that can not be estimated from the conditional DQ profile.
Method Querying Conditional DQ Profile SELECT * FROM D WHERE Brand= “Cannon” AND Model= “SLR” translates to SELECT TOP 1 #, Q FROM T WHERE (Brand= “Cannon” OR Brand= “_”) AND (Model= “SLR” OR Model = “_”) AND (Price=“_”) ORDER BY Brand, Model, Price.
Evaluation Effectiveness of DQ Estimation Comparison of the average estimation error for traditional DQ profile (DQP) and conditional DQ profile with different certainty thresholds (ε=0.05 to ε=0.45) and for different variations in distribution of dirty data d.
Evaluation Effect of the distribution of dirty data (a) Effect of variation in distribution of dirty data d on profile size (b) Effect of error distribution on profile generation time.
Evaluation Effect of certainty and minimum set thresholds (a) Effect of certainty threshold ε on the size of DQ profile for different minimum-set thresholds τ (b) Effect of τ on percent of correct estimation made using the DQ profile
Evaluation Effect of input data size Scalability graphs (a) generated profile size versus number of records in dataset (b) profile generation time versus database size
Next Steps FurtherWork Maintenance of Conditional Data Quality Profile • Data quality changes over time and DQ profile should remain valid • Scalability • Scalability of conditional DQ profiles can be improved further more. For example sampling techniques can help making conditional DQ profiles more scalable.