Adaptive Loading for Efficient Query Processing in Flat File Systems

Here are my Data Files. Here are my Queries. Where are my Results? StratosIdreos* IoannisAlagiannis‡ Ryan Johnson§ Anastasia Ailamaki‡ §University of Toronto ‡ÉcolePolytechniqueFédérale de Lausanne *CWI, Amsterdam

CERN ($20B physics experiment) • Last year: 35PB! • Experiments, simulation, user data… • All stored in flat files • Database only stores metadata • Custom solutions & scripts • Almost never a DBMS Why???

Why people don’t use DBMS? Requirements Analysis Define a schema Load the data Iterate to convergence Tune the system Evolving requirements => no convergence

Data import & tuning Massage Data Load Tuples DBMS owns the data now Flat Files Why wait? Why complete load? Database Which format? Hire DB expert? Not worth the startup cost

Avoiding up-front overheads Flat File Flat files an integral part of the system Hot data Query over flat files Adaptive loads Tuning in background DBMS actions driven by workload

Adaptive loading Flat File Metadata ColumnLoad Loaded Columns: a2 a3 Partial Load Full Load Metadata Loaded Parts: a2 a3 Storage

Dynamic file adaptation New Flat Files a) Parse only needed columns b) New flat file per attribute Original Flat File Analyze non-tokenized attributes

Adaptive loading in practice Q1: Loading Cost + First Query Constant performance for all queries Q11: load from FF Filtering on-the-fly Q1: half the cost On-the-fly load Cache data select sum(a1), avg(a2) from R where a1<v1 and a2<v2 Amortize loading cost over the query sequence

Towards a fully autonomous system Give me your queries Give me your data as is Get your results! Adaptive Load Adaptive Data Store Adaptive Kernel Invisible DBMS (supports SQL + your tools) grep, awk Challenge: make this invisible

Adaptive Loading for Efficient Query Processing in Flat File Systems