Enhancing Hive with a Cost-Effective Multi-dimensional Range Index for Smart Grid Data

DGFIndex for Smart Grid: Enhancing Hive with a Cost Effective Multi-dimensional Range Index Yue Liu, Songlin Hu*, Tilmann Rabl, Wantao Liu, Hans-Arno Jacobsen, Kaifeng Wu, Jian Chen, Jintao Li MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 国网电力科学研究院 STATE GRID ELECTRIC POWER RESEARCH INSTITUTE 国网浙江省电力公司 STATE GRID ZHEJIANG ELECTRIC POWER COMPANY

Outline • Big Data challenges in Smart Grid • DGFIndex design • Experiments on smart grid data • Conclusions

Big Data Challenges in Smart Grid Smart Meter • Example • 22 million smart meters in Zhejiang province, as required by China State Grid, it should be 96 measurements/day • Will be 2.1 billion records in a single table The Electricity ConsumptionInformation Collection System of the Grid Collector RDBMS Figure 1 Data flow in State Grid

Features of Smart Meter Data • Smart meter data features: • Time stamp field • Append only • Schema is static Table 1 An example format of smart meter data • Queries features: • Multi-dimensional range query • Lots of aggregation query • Query examples: • What is the average power consumption of user ids in the range 100 to 1000 and dates in the ranges “2013-01-01” to “2013-02-01”?

Why Migrate to Hadoop/Hive • Limitations of RDBMS • Low write throughput • Weak scalability • High license cost 16 times faster with only 1/10 cost Figure 2 Write throughput comparison of RDBMS and HDFS Hadoop/Hive is a good choice for solving smart meter big data problem • Hadoop and Hive • High write throughput • Flexible scalability • Low budget and cost effective

Indexes in Hive • Index • Compact Index, Aggregate Index and Bitmap Index • Store all combinations of index dimensions and location • Data partition • Each partition is a directory, reorganize data into different directory Table 2 3-dimensional compact index • Disadvantages on multi-dimensional range queries when large number of distinct value in index dimensions • Index: large index table size • Data partition: lots of directories and small files

Limitations of Indexing in Hive • Limitations: • Storing combination of index dimensions leads to extremely large index table • High selectivity leads to large temporary file, which may overflow the JobTracker’s memory • Poor filter effect when the value of index dimension scattered evenly in data file Temporary File File:Offset 2 4 1 Predicate 3 JobTracker InputFormat.getSplits 5 Chosen Splits 6 MR Job Scan Index Table Figure 3 Query with current index

Recall Data Features • Smart meter data features: • Time stamp field • Append only • Schema is static • Queries features: • Multi-dimensional range query • Lots of aggregation query

DGFIndex Design data file on HDFS • Using grid file to split logical data space into units(GFU) • Data in same GFU is stored together in the file of HDFS, named Slice • GFU is stored as a GFUKey/GFUValue pair in key/value store • GFUKey is the left lower coordinate of GFU in the data space • GFUValue consists of header and location • Header contains some pre-computed aggregation values • Location is the start and end offset of corresponding data segment in file of HDFS 2-dimensional DGFIndex aggregation values Slice dimension Y GFUKey GFUValue header location GFU dimension X Figure 4 DGFIndex architecture

DGFIndex Construction Input Map Phase Reduce Phase Output Splitting Policy DGFIndex Table test test after reorganization A:9|B:14 7_13 Reducer Mapper Input:{36,9|14|0.8} Input:{7_13,<9|14|0.8,8|13|0.2>} Output:{null,<9|14|0.8,8|13|0.2>} Output:{7_13,9|14|0.8} Figure 6 DGFIndex construction

DGFIndex Query SELECT SUM(C) FROM test WHERE A>=5 AND A<12 AND B>=12 AND B<16 1.0 Step 1 (Hive) overlapped? test:0 test:0 2.2 combine Step 2 (InputFormat.getSplits) chosen splits all splits Step 3 (Reader.next) 1.2 test Figure 7 DGFIndex query

Advantages of DGFIndex • Smaller index size • High index read speed, selective • Pre-computation Figure 8 DGFIndex architecture

Experiments • Comparison System • Hive with Compact Index, HadoopDB • Environment • Hardware • 29 virtual nodes, each has 8 cores, 8GB RAM, 300GB disk • Software • CentOS 6.5 b4bit, Jdk 1.6.0_45 64bit, Hadoop-1.2.1, HBase-0.94.13 • DGFIndex is implemented in Hive-0.10.0 • Replication factor is 2 in HDFS, mapred.task.io.sort.mb=512MB • PostgreSQL 8.4.20 for HadoopDB • Data Set and Query • Real meter data (1TB in TextFile and 890GB in RCFile, no compression) and ad-hoc queries from Zhejiang Grid • Lineitem table (518GB in TextFile and 468GB in RCFile, no compression) and Q6 from TPC-H

Index Size and Construction Time DGFIndex construction costs more time, but has smaller size Table 3 The number of distinct value in index dimensions Table 4 The number of intervals in UserId dimension Table 5 Index size and construction time

Aggregation Query SELECTSUM(powerConsumed) FROM meterdata WHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2 For aggregation query, DGFIndex is 2-50 faster than Compact Index and HadoopDB Figure 9 Point query Figure 10 5% selectivity Figure 11 12% selectivity Listing 5 Aggregation query Table 6 number of records needed to read after being filtered by index

GroupyBy Query SELECT time,SUM(powerConsumed) FROM meterdata WHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2 GROUP BY time For non-aggregation query, DGFIndex is 2-5 faster than Compact Index and HadoopDB, only need to read 1/40-1/5 data of Compact Index. Figure 12 Point query Figure 13 5% selectivity Figure 14 12% selectivity Listing 6 GroupBy query Table 7 number of records needed to read after being filtered by index

TPC-H Data Set and Q6 SELECTSUM(l_extendedprice*l_discount) as revenue FROM lineitem WHERE l_shipdate>=date’[DATE]’ AND l_shipdate<date’[DATE]’+interval ‘1’ year AND l_discount between [DISCOUNT]-0.01 and [DISCOUNT]+0.01 AND l_quantity<[QUANTITY] DGFIndex is also efficient for general case data Listing 8 Q6 from TPC-H Figure 16 Q6 from TPC-H cost time Table 10 number of records needed to read after being filtered by index Table 8 Index size and construction time

Conclusions • Multi-dimensional range index is essential for Hive-based smart meter data processing • We propose a cost effective multi-dimensional range index for Hadoop/Hive • Experimental results show the efficiency of our DGFIndex

Thanks Questions?

Enhancing Hive with a Cost-Effective Multi-dimensional Range Index for Smart Grid Data

Enhancing Hive with a Cost-Effective Multi-dimensional Range Index for Smart Grid Data

Presentation Transcript

ACTA-10-014R1 Smart Grid Communications Overview

State Policies on Smart Grid

Pacific Northwest Smart Grid Demonstration Project

Smart Grid: an Ontario Perspective

Smart Grid Overview

Smart Grid Implications for Gas Turbine Technology

Smart Grid Regulatory Framework

India Smart Grid Summit - Towards Sustainable Energy Solution-

Smart Grid implications for cable operators

“We’ll fund a better, smarter electricity grid and train workers to build it… ”

ICS 624 Spring 2011 Multi-Dimensional Clustering

Smart Grid Architecture Committee Semantic Working Party Update

Multi-dimensional Range Query Processing on the GPU

Multi-dimensional measures of empowerment

An Overview of the Smart Grid

Smart Grid Question

SKOS-2-HIVE

Bitmap Indices for Data Warehouse

Smart Grid Overview US Chamber of Commerce Kieran McLoughlin Smart Grid Solution Leader

Chapter 12 Multi-dimensional Arrays

Rethinking Choices for Multi-dimensional Point Indexing

State Policies on Smart Grid