Indexing HDFS Data in PDW: Splitting the data from index

Indexing HDFS Data in PDW: Splitting the data from index Vinitha Gankidi#, Nikhil Teletia*, Jignesh M. Patel#, Alan Halverson*, David J. DeWitt* #University of Wisconsin-Madison * Microsoft Jim Gray Systems Lab

Motivation SQL SQL Server PDW with Polybase Result Hot Data Cold Data RDBMS HDFS Data lives in two worlds Is it possible to run highly-selective queries on HDFS-resident cold data with both low latency and minimal system changes? Hybrid SQL-On-Hadoop solutions (Microsoft PolyBase, Teradata QueryGrid, IBM Big SQL etc.) Familiar SQL interface Load first, schema later Decades of research and optimization Cheap and scalable data store

Query Execution over External Data • Can we execute queries - • without entirely scanning all the HDFS files • without running a Map job? SELECT * FROM hdfs_Employee WHERE DeptID = 1 IMPORT PATH The HDFS files have to be entirely imported 1 Import HDFS files into PDW 2 Run the rest of the query inside PDW YES. How? By using an index PUSH-DOWN PATH Significant startup overhead for MAP task Run a Map job to filter 1 All the HDFS files are scanned entirely Import the result of the Map job into PDW 2 HDFS Run the rest of the query inside PDW 3

What is a Split-Index? • Index is stored in RDBMS, while the data is in HDFS • Index is stored as a RDBMS table • Hash-partitioned across multiple node • Each partition has clustered B+ tree Index HDFS RDBMS Split-Index is similar to a materialized view (with an external pointer) Split-Index can be out-of-sync with the data

Query Execution using Split-Index 4 Return the result SELECT name FROM hdfs_Employee WHERE DeptID = 1 1 RDBMS 2 SELECT [HDFS File Name], [HDFS Offset], [Rec Len] FROM index_Employee WHERE DeptID = 1 Using index, we can answer queries without having to sequentially scan each HDFS file. Qualifying Tuples Index_Emp (Index on DeptID) HDFS 3 Retrieve qualifying tuples from HDFS files.

Incremental Index Update • Given the append-only property of the HDFS data, index can be updated incrementally • A new HDFS file is added • Append the rows of the new file to the existing index • An HDFS file is deleted • Delete the rows of the deleted file from the existing index

Hybrid Scan • A stale Split-Index can still be used during query execution • Examples: • An HDFS file is added • Scan the new file using non-index approach • Process existing files using index • An HDFS file is deleted • When probing the index, remove the rows associated with the deleted file

Experiments

Split-Index Performance • Cluster • 9 Node SQL Server PDW cluster (8 compute nodes + 1 control node) • 29 Node Windows HDP 2.0 cluster (28 data nodes + 1 name node) • Data Set • 10 TB Scale Lineitem table • Compare Push-Down approach with Split-Index approach Map Cost RID Materialize Cost COST Data Import Cost Data Import Cost Push-Down Approach Split-Index Approach

Split-Index Performance SELECT * FROM lineitemWHERE l_orderkey <= [Variable] Split-Index on l_orderkey Data Size:~800GB Index Size:~80GB Map Cost RID Materialize Cost Index performance is sensitive to the access pattern. Data Import Cost Data Import Cost Split-Index Approach Push-Down Approach

Space vs. Time Trade off • Cost of storing the data in RDBMS is higher compared to HDFS • Split-Index can be used as a covering index • Quantify the performance and space trade-off as we move columns from HDFS to PDW • Experiment Setup • 1 TB Scale Lineitem • Modified Query 6 SELECT SUM(l_extendedprice*l_discount) AS REVENUE FROM lineitem WHERE l_shipdate >= '1994-01-01' AND l_shipdate < dateadd(mm, 1, cast('1994-01-01' as date)) AND l_discount BETWEEN .06 - 0.01 AND .06 + 0.01 AND l_quantity < 24

Space vs. Time Trade off Split-Index can be used to balance the query execution time and the PDW disk footprint The Lineitem table is in PDW. No index. The Lineitem table is in HDFS Split-Index on l_shipdate, l_discount, l_quantity, l_extendedprice The Lineitem table is in HDFS. No index. (Push-Down) The Lineitem table is in HDFS Split-Index on l_shipdate

Conclusions and Future Work

Indexing HDFS Data in PDW: Splitting the data from index

Indexing HDFS Data in PDW: Splitting the data from index

Presentation Transcript

Introduction to Hive

Data Profiling The Path to More Accurate Data

K-12 Statewide Longitudinal Data System (SLDS)

13.9 Spin-Spin Splitting

Overview of the VHA Corporate Data Warehouse (CDW), the VSSC Portal and Importance of Accurate Data

Programs and Data

Data Analysis State Accountability

Data Archiving @ SAP

Data Minin g and Knowledge Acquizition — Chapter 2 — — Data Preprocessing —

Data Mining Toon Calders

Introduction to Spring Data

Geographic Data and Relationships

Data Mining: Introduction

Data Mining: Data

DBMS Storage and Indexing

FLASH CARDS

Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration

Data Structures