Interactive Data Exploration using Constraints

Interactive Data Exploration using Constraints Alexander Kalinin Ugur Cetintemel, Stan Zdonik

CP + DBMSfor Data Intensive Exploration

Interactive Data Exploration (IDE) Where’s Horrible Gelatinous Blob? Where’s Waldo? Searching for the “interesting” within big data • Exploratory-analysis: ad-hoc & repetitive • Questions are not well defined • “Interesting” can be complex • Human-in-the loop operation • Fast, online results • Query refinement

Exploratory Queries: Some examples • First-order • “Celestial 3-5o by 5-7o regions with brightness > 0.8” • Higher-order • “Pairs of 2o by 2o celestial regions with similarity > 0.5” • Optimized • “Celestial 3o by 7o region with maximum brightness” Sloan Digital Sky Survey (SDSS)

“Celestial 3-5o by 5-7o regions with average brightness > 0.8” in SQL Divide the data into cells Enumerate all regions Final filtering (> 0.8)

DBMSs for IDE? • No native support for exploratory constructs • No power set • No user-defined objective functions • No support for interactivity • No online results • No notion of a “query session”

Data Exploration as a CP problem “Celestial 3-5o by 5-7o regions with average brightness > 0.8” • Decision variables: • Constraints: Left-most corner Lengths

CP Solvers • Large variety of methods for exploring a search space • Branch-and-Cut • Large Neighborhood Search (LNS) • Randomized search with Restarts • Highly extensible – important for ad-hoc exploration! • New constraints/functions • New search heuristics • But… comparing with DBMSs • In-memory data (CP) vs. efficient disk data handling (DBMS) • No I/O cost-awareness (CP) vs. cost-based query planning (DBMS)

SearchLight SearchLight Exploration Query Metadata Buffering • A fusion of CP solvers and DBMSs • The DBMS stores and maintains data • The CP solver explores the constrained search space • SearchLight is a mediator • Extends CP solvers • Provides buffering, prefetching • Distributes the search • Makes CP solvers cost-aware Constraints/ Functions Search Heuristics Data, schema info Requests, Solutions Data, estimates, decisions Data requests, constraints DBMS (PostgreSQL, SciDB) CP Solver (OR-tools, Gecode)

Research Issues • A cost model for data-intensive CP • Each search decision has an I/O cost • Mediation of data access • Meta-data for guiding and optimizing search (annotated trees, samples, etc.) • Prefetching • Distributed search • Multi-node parallel branch processing • CP/DBMS integrated query planning • Propagating CP/Schema constraints

Semantic Windows (SW) • First step towards constraint-based exploration • Supports first-order queries • Exploration via multi-dimensonal “windows of interest” • Shape-based constraints (“a 3-5o by 5-7o region”) • Content-based constraints (“avg_br() > 0.8") • Custom distributed cost-aware solver

SQL/CP Extensions for Data Exploration SELECTlb(ra), rb(ra), lb(dec), rb(dec), avg(brightness) FROMsdss GRID BYraBETWEEN 100 AND 300 STEP 1 decBETWEEN 5 AND 40 STEP 1 HAVINGavg(brightness) > 0.8 AND size(ra) = 5AND size(dec) >= 5 AND size(dec) <= 7

Cost-aware Solver • Best-first search based on the utility • Utility = f(benefit, cost) • Benefit – how close a window is to satisfy the constraints • A distance between the constraint’s value and the estimated value • Cost – how expensive it is to read a window from disk • Measured in cells we have to read • Adjustments are made for skewed data

Optimizations • Cost and benefit are estimated by sampling • Objective function values are cached in a cell cache • Dynamic utility updates • Avoiding same cells re-reads • Constraint-based pruning during the search • Distributed search • Multiple nodes work in parallel

Adaptive Prefetching No prefetching • Dispersed reads hit total performance • Prefetching: read the neighborhood with every window • Progress-drivenprefetching: how much? • Finding new results? Prefetch a small amount • No new results? Increase the prefetchexponentially 1 3 4 2 With prefetching 3 1 2 4

Online vs. Total Performance Results • 35GB data set (part of the SDSS) • 4GB total memory (1GB shared buffer) • First results in 10-20 seconds

Conclusions • Integrate CP and DBMS technologies • SearchLight: Data-Intensive CP Engine • Initial implementation: Semantic Windows • Cost-aware solver • Mediating disk access (sampling, prefetching) • Distributed search • Current work: • OR-Tools as the CP solver • SciDB as the DBMS

Questions? Supported by:

Interactive Data Exploration using Constraints

Interactive Data Exploration using Constraints

Presentation Transcript

Using Interactive Charting

Univariate Data Exploration

Interactive Data Analysis and Model Exploration: A Visual Analytics Approach

XBRL - Interactive Data

Interactive Technology… An Exploration and Discussion

Interactive Parallel Data Visualization and Exploration

Using Interactive Notebooks

Using Interactive Notebooks

Using Interactive Evolution for Exploratory Data Analysis

Exploration Mobility Within Driveback Constraints

GPX: Interactive Exploration of Time-series Microarray Data

Interactive Exploration in Virtual Environments

Interactive Exploration of Typed Networks

Constraints on primordial non-Gaussianity using Planck simulated data

Data Exploration #1

Interactive Exploration of Multidimensional Data

Data Curation Exploration

Constraints on Supersymmetry using the latest LHC data

Data Visualization & Exploration

GPX: Interactive Exploration of Time-series Microarray Data

Data Integrity Constraints

UFinder An interactive tool for data exploration and decision-making