A Black-Box Approach to Query Cardinality Estimation

A Black-Box Approach to Query Cardinality Estimation Tanu Malik, Randal Burns The Johns Hopkins University Nitesh V. Chawla Notre Dame University

The Black Box Approach • Estimate query result sizes without knowledge of • Underlying data distributions • Query execution plan • Machine learning techniques • Group queries into syntactic families (templates) • Learn in a high-dimension, complex input space • Attributes, operators, function arguments, aggregates • Partition input space • Learn regression functions in each partition • Self-tuning, self-correcting models • When compared with bottom-up estimation • Produces accurate, highly compact, and fast models • Lose ability to evaluate sub-plans

Are new techniques needed? • Working with federated and remote data sources • No access to data (privacy and performance concerns) • Many data sources (can’t keep estimates for all) • Our motivation: caching in federations • Ask the DB optimizer? • Other applications • Replica maintenance • Grid workflow • Distributed query schedulers

Astronomy Example • Typical query • User-defined functions • Mathematical expressions • Sample bottom-up plan • Many sub-estimates

The Spatial Function • Executed at the backend database • Data distribution and queries in attribute domains • Function computes a range query

Workload Observed at Cache • Point queries in 3-dimensional space • 2-d projection on attributes shown • Query result-size (log cardinality)

Learning • Query yields are k-means clustered into classes • Two-shown, typically 4-8

Learning • Query yields are k-means clustered into classes • Class boundaries and regression functions • Learning techniques: model trees, classification and regression, and locally-weighted regression

Virtues of the Black Box • No errors from modeling assumptions, because it makes no assumptions • Conditional independence • Join distributions • Accurate estimates for complex queries • User-defined functions • High-dimensional queries • Multi-way joins • Point queries • Performance (later)

Drawbacks of the Black Box • Semantic losses • Does not use indexes, uniqueness, constraints • When available, treat as exceptions • Not integrated with query execution plans • No sub-plan estimates • No what-if scenarios can be explored • Parallel execution • Operator re-ordering • Not naturally suited to the database optimization • It’s a middleware technique

Overview of Results • How many trees? • How accurate?

Space and Time • How big? • How fast?

A Black-Box Approach to Query Cardinality Estimation Tanu Malik, Randal Burns The Johns Hopkins University Nitesh V. Chawla Notre Dame University

Quick Comparison • Self-tuning histograms, e.g. STHoles, STGrid, others • Machine learning, self-tuning, based on observed workload • Produce an estimated data distribution • Histograms limited to range queries • Costing User-Defined Functions [He et al. 2005] • Estimate based on weighted nearest k-neighbors • Restricted to function arguments • Does not build a model • The Black Box approach • Data independent in both inputs and estimation • Rich input space: enumerated domains, operators, and aggregates • Compact models, summary data structures

A Black-Box Approach to Query Cardinality Estimation

A Black-Box Approach to Query Cardinality Estimation

Presentation Transcript

Black Box Checking

Black Box Testing

Black Box Testing

BLACK BOX TESTING

Black box testing

Black-box (oracle)

Black Box Testing

Black Box Electronics

Cardinality Estimation for Large-scale RFID Systems

Black Box

A Polyhedral Approach to Cardinality Constrained Optimization

Black Box Electronics

Cardinality

Matter as a Black Box

Income black box

Black-Box Testing

Black Box Testing

Black-box (oracle)