190 likes | 203 Views
This paper presents DGFIndex, a cost-effective multi-dimensional range index for Smart Grid data in Hive. It overcomes limitations of traditional indexing methods and improves query performance for large-scale smart meter data. Experimental results demonstrate its effectiveness and cost efficiency.
E N D
DGFIndex for Smart Grid: Enhancing Hive with a Cost Effective Multi-dimensional Range Index Yue Liu, Songlin Hu*, Tilmann Rabl, Wantao Liu, Hans-Arno Jacobsen, Kaifeng Wu, Jian Chen, Jintao Li MIDDLEWARE SYSTEMS RESEARCH GROUP MSRG.ORG 国网电力科学研究院 STATE GRID ELECTRIC POWER RESEARCH INSTITUTE 国网浙江省电力公司 STATE GRID ZHEJIANG ELECTRIC POWER COMPANY
Outline • Big Data challenges in Smart Grid • DGFIndex design • Experiments on smart grid data • Conclusions
Big Data Challenges in Smart Grid Smart Meter • Example • 22 million smart meters in Zhejiang province, as required by China State Grid, it should be 96 measurements/day • Will be 2.1 billion records in a single table The Electricity ConsumptionInformation Collection System of the Grid Collector RDBMS Figure 1 Data flow in State Grid
Features of Smart Meter Data • Smart meter data features: • Time stamp field • Append only • Schema is static Table 1 An example format of smart meter data • Queries features: • Multi-dimensional range query • Lots of aggregation query • Query examples: • What is the average power consumption of user ids in the range 100 to 1000 and dates in the ranges “2013-01-01” to “2013-02-01”?
Why Migrate to Hadoop/Hive • Limitations of RDBMS • Low write throughput • Weak scalability • High license cost 16 times faster with only 1/10 cost Figure 2 Write throughput comparison of RDBMS and HDFS Hadoop/Hive is a good choice for solving smart meter big data problem • Hadoop and Hive • High write throughput • Flexible scalability • Low budget and cost effective
Indexes in Hive • Index • Compact Index, Aggregate Index and Bitmap Index • Store all combinations of index dimensions and location • Data partition • Each partition is a directory, reorganize data into different directory Table 2 3-dimensional compact index • Disadvantages on multi-dimensional range queries when large number of distinct value in index dimensions • Index: large index table size • Data partition: lots of directories and small files
Limitations of Indexing in Hive • Limitations: • Storing combination of index dimensions leads to extremely large index table • High selectivity leads to large temporary file, which may overflow the JobTracker’s memory • Poor filter effect when the value of index dimension scattered evenly in data file Temporary File File:Offset 2 4 1 Predicate 3 JobTracker InputFormat.getSplits 5 Chosen Splits 6 MR Job Scan Index Table Figure 3 Query with current index
Recall Data Features • Smart meter data features: • Time stamp field • Append only • Schema is static • Queries features: • Multi-dimensional range query • Lots of aggregation query
DGFIndex Design data file on HDFS • Using grid file to split logical data space into units(GFU) • Data in same GFU is stored together in the file of HDFS, named Slice • GFU is stored as a GFUKey/GFUValue pair in key/value store • GFUKey is the left lower coordinate of GFU in the data space • GFUValue consists of header and location • Header contains some pre-computed aggregation values • Location is the start and end offset of corresponding data segment in file of HDFS 2-dimensional DGFIndex aggregation values Slice dimension Y GFUKey GFUValue header location GFU dimension X Figure 4 DGFIndex architecture
DGFIndex Construction Input Map Phase Reduce Phase Output Splitting Policy DGFIndex Table test test after reorganization A:9|B:14 7_13 Reducer Mapper Input:{36,9|14|0.8} Input:{7_13,<9|14|0.8,8|13|0.2>} Output:{null,<9|14|0.8,8|13|0.2>} Output:{7_13,9|14|0.8} Figure 6 DGFIndex construction
DGFIndex Query SELECT SUM(C) FROM test WHERE A>=5 AND A<12 AND B>=12 AND B<16 1.0 Step 1 (Hive) overlapped? test:0 test:0 2.2 combine Step 2 (InputFormat.getSplits) chosen splits all splits Step 3 (Reader.next) 1.2 test Figure 7 DGFIndex query
Advantages of DGFIndex • Smaller index size • High index read speed, selective • Pre-computation Figure 8 DGFIndex architecture
Experiments • Comparison System • Hive with Compact Index, HadoopDB • Environment • Hardware • 29 virtual nodes, each has 8 cores, 8GB RAM, 300GB disk • Software • CentOS 6.5 b4bit, Jdk 1.6.0_45 64bit, Hadoop-1.2.1, HBase-0.94.13 • DGFIndex is implemented in Hive-0.10.0 • Replication factor is 2 in HDFS, mapred.task.io.sort.mb=512MB • PostgreSQL 8.4.20 for HadoopDB • Data Set and Query • Real meter data (1TB in TextFile and 890GB in RCFile, no compression) and ad-hoc queries from Zhejiang Grid • Lineitem table (518GB in TextFile and 468GB in RCFile, no compression) and Q6 from TPC-H
Index Size and Construction Time DGFIndex construction costs more time, but has smaller size Table 3 The number of distinct value in index dimensions Table 4 The number of intervals in UserId dimension Table 5 Index size and construction time
Aggregation Query SELECTSUM(powerConsumed) FROM meterdata WHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2 For aggregation query, DGFIndex is 2-50 faster than Compact Index and HadoopDB Figure 9 Point query Figure 10 5% selectivity Figure 11 12% selectivity Listing 5 Aggregation query Table 6 number of records needed to read after being filtered by index
GroupyBy Query SELECT time,SUM(powerConsumed) FROM meterdata WHERE regionId>r1 AND regionId<r2 AND userId>u1 AND userId<u2 AND time>t1 AND time<t2 GROUP BY time For non-aggregation query, DGFIndex is 2-5 faster than Compact Index and HadoopDB, only need to read 1/40-1/5 data of Compact Index. Figure 12 Point query Figure 13 5% selectivity Figure 14 12% selectivity Listing 6 GroupBy query Table 7 number of records needed to read after being filtered by index
TPC-H Data Set and Q6 SELECTSUM(l_extendedprice*l_discount) as revenue FROM lineitem WHERE l_shipdate>=date’[DATE]’ AND l_shipdate<date’[DATE]’+interval ‘1’ year AND l_discount between [DISCOUNT]-0.01 and [DISCOUNT]+0.01 AND l_quantity<[QUANTITY] DGFIndex is also efficient for general case data Listing 8 Q6 from TPC-H Figure 16 Q6 from TPC-H cost time Table 10 number of records needed to read after being filtered by index Table 8 Index size and construction time
Conclusions • Multi-dimensional range index is essential for Hive-based smart meter data processing • We propose a cost effective multi-dimensional range index for Hadoop/Hive • Experimental results show the efficiency of our DGFIndex
Thanks Questions?