390 likes | 559 Views
BigDataBench : a Big Data Benchmark Suite from Internet Services. Lei Wang , Jianfeng Zhan , Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao , Zhen Jia , Yingjie Shi , Shujie Zhang , Gang Lu, Kent Zhang, Xiaona Li, and Bizhu Qiu HPCA 2014 .
E N D
BigDataBench: a Big Data Benchmark Suite from Internet Services Lei Wang, Jianfeng Zhan, ChunjieLuo, Yuqing Zhu, Qiang Yang, Yongqiang He, WanlingGao, Zhen Jia, YingjieShi, ShujieZhang, Gang Lu, Kent Zhang, XiaonaLi, and BizhuQiu HPCA 2014
Why Big Data Benchmarking? Measuring big data systems and architectures quantitatively
What is BigDataBench? • An open source big data benchmarking project • http://prof.ict.ac.cn/BigDataBench/ • 6 real-world data sets • Generate (4V) big data • 19 workloads • OLTP, Cloud OLTP, OLAP, and offline analytics • Same workloads: different implementations
Executive summary • Big Data Benchmarks • Do we know enough about big data benchmarking? • Big Data workload characterization • What are differences from traditional workloads? • Exploring best big data architectures • brawny-core or wimpy multi-core or wimpy many-core?
Outline • Benchmarking Methodology and Decision • Big Data Workload Characterization • Evaluating Hardware Systems with Big Data • Conclusion 2 3 3
Methodology System and architecture characteristics BigDataBench 4V of Big Data Refine
Methodology (Cont’) • Data Types • Structured • Semi-structured • Unstructured • Data Sources • Text data • Graph data • Table data • Extended … Big Data Sets Preserving 4V Diverse Data Sets Big Data Workloads Investigate Typical Application Domains BDGS: big data generation tools BigDataBench Basic & Important Operations and Algorithms Extended… • Application Types • OLTP • Cloud OLTP • OLAP • Offline analytics Diverse Workloads Represent Software Stack Extended…
Top Sites on the Web More details in http://www.alexa.com/topsites/global;0 Search Engine, Social Network and Electronic Commerce take 80% page views of all the Internet service.
(Cloud) OLTP BigDataBench Summary OLAP Offline Analytics Six Real-world Data Sets BDGS(Big Data Generator Suite) for scalable data Amazon Movie Reviews Search Engine Wikipedia Entries Facebook Social Network Google Web Graph ProfSearch Person resumes Social Network E-commerce Transaction 19 Workloads Software Stacks E-commerce NoSql Impala Shark MPI
Outline • Benchmarking Methodology and Decision • Big Data Workload Characterization • Evaluating Hardware Systems with Big Data • Conclusion 2 3 3 5
Big Data Workloads Analyzed Input data size varying from 32GB to 1TB
Other Benchmarks Compared • HPCC • Representative HPC benchmark suite • 7 benchmarks • PARSEC • CMP (Multi-threaded) benchmark suite • 12 benchmarks • SPECCPU • SPECCFP • SPECINT
Metrics • User-perceivable metrics • OLTP services: requests per second(RPS) • Cloud OLTP: operations per second(OPS) • OLAP and Offline analytics: data processed per second(DPS) • Micro-architecture characteristics • Hardware performance counter
Experimental Configurations • Testbed Configurations • Fifteen nodes: 1 master + 14 slaves • Data input size: 32GB~1TB • Each node: 2*Xeon E5645, 16GB Memory, 8TB Disk • Network: 1Gb Ethernet • Software Configurations • OS: Centos 5.5 with Linux kernel 2.6.34. • Stacks: Hadoop 1.0.2, Hbase 0.94.5, Hive 0.9, MPICH2 1.5, Nutch 1.1, and Rubis 5.0
Instruction Breakdown Services Data Analytics FP instruction: X87+SSE FP (X87, SSE_Pack_Float, SSE_Pack_Double, SSE_Scalar_Float and SSE_Scalar_Double) Integer instruction: Total _Ins - FP_Ins - Branch_Ins - Store_Ins - Load_Ins • More integer instructions (Less floating point instructions) • The average ratio of integer to floating point instructions is 75
Floating Point Operation Intensity (E5310) Total number of floating point instructions divided by total number of memory access bytes in a run of workload. Data Analytics Services Very low floating point operation intensity : two orders of magnitude lower than in the traditional workloads
Floating Point Operation Intensity Data Analytics Services Floating point operation intensity on E5645 is higher than that on E5310
Integer Operation Intensity Data Analytics Services • Integer operation intensity is in the same order like the traditional workloads • Integer operation intensity on E5645 is higher than that on E5310 • L3 Cache is effective & Bandwidth improvement
Possible reasons (Xeon E5645 vs. Xeon E5310) Technique improvements of Xeon E5645: • More cores in one processor Sixe cores in Xeon E5645 vs. four cores in Xeon E5310 • Deeper cache hierarchy level: L1~L3 vs. L1~L2 L3 cache is effective in decreasing memory access traffic for big data workloads • Larger bandwidth in Front Side Bus Xeon E5645 adopts Intel QuickPath Interconnect (QPI) to eliminate bottlenecks in Front Side Bus [ASPLOS 2012] • Hyperthreading technology Hyperthreading can improve performance by factors of 1.3~1.6 times for scale-out workloads
Cache Behaviors • Higher L1I Cache misses than the traditional workloads • Data analytic workloads have better L2 Cache behaviors than service workloads with the exception of BFS • Good L3 Cache behaviors 56 83 74 Data Analytics Services
TLB Behaviors 5 14 service data analysis Higher ITBL misses than the traditional workloads
Computation intensity (integer operations) • X axis : (total number of integer instructions)/(total memory access bytes) • Higher : execute more integer operations between two memory accesses • Y axis : (total number of integer instructions)/(total bytes receiving from networks) • Higher : execute more integer operations on the same receiving bytes Integer Operations per Byte (Receiving from networks) Integer Operations per Byte (Memory Accesses)
Big Workloads Characterization Summary • Data movement dominated computing • Low computation intensity • Cache Behaviors (Xeon E5645) • Very high L1I MPKI • L3 Cache is effective • Diverse workload behaviors • Computation/communication vs. computation/memory accesses
Outline • Benchmarking Methodology and Decision • Big Data Workload Characterization • Evaluating Hardware Systems with Big Data • Y. Shi, S. A. McKee et al. Performance and Energy Efficiency Implications from Evaluating Four Big Data Systems, Submitted to IEEE Micro. • Conclusion 3 3
State-of-art Big Data System Architectures Brawny-core processors Big Data System & Architecture Trends Wimpy multi-core processors Wimpy many-core processors Hardware Designers: What are the best big data system and architectures in terms of both performance and energy efficiency? Data Center Administrators: How to choose appropriate hardware for big data applications?
Evaluated Platforms • Xeon E5310 (Brawny-core) scale-up Xeon E5645 (Brawny-core) • Atom D510 (Wimpy multi-core) scale-out TileGx36 (Wimpy many-core) Architectural Characteristics Basic Information
Experimental Configurations • Software stack:Hadoop 1.0.2 • Cluster configuration: • Xeon & Atom-based systems:1 master + 4 slaves • Tilera system:1 master + 2 slaves • Data Size: 500MB, 2GB, 8GB, 32GB, 64GB, 128GB • Apples-to-Applescomparison: • Deploy the systems with the same network • and disk configurations • Provide about 1GB memory for each hardware thread / core • Adjust the Hadoop parameters to optimize performance
Metrics • Performance: Data processed per second (DPS) • Energy Efficiency:Data processed per joule(DPJ) Data Input Size DPS = Running Time Data Input Size DPJ = Energy Consumption • Report DPS and DPJ per processor
General Observations The Average DPJ Comparison The Average DPS Comparison • I/O intensive workload(Sort):many-core TileGx36 achieves the best performance and energy efficiency,The brawny-core processors do not provide performance advantages. • CPU-intensive and floating point operation dominated workloads (Bayes & K-means): brawny-core processors show obvious performance advantages with close energy efficiency to wimpy-core processors. • Other workloads: no platform consistently wins in terms of both performance and energy efficiency. Report the average number only when the data sizes bigger than 8GB (not fully utilized on small data sizes).
Improvements from Scaling-out the Wimpy Core(TileGx36 vs. Atom D510) • The core of TileGx36 is more wimpy than Atom D510 • AdoptsMIPS-derived VLIW instruction set. • Does not support hyperthreading. • Less stages in the pipeline depth. • Does not have dedicated floating point units. • TileGx36 integrates more cores on the NOC(Network on Chip) 36 cores in TileGx36 vs. 4 cores Atom D510
Improvements from Scaling-out the Wimpy Core(TileGx36 vs. Atom D510) • CPU-intensive and floating point operation dominated workloads(Bayes & K-means): TileGx36 shows 2.5 times performance advantage and 0.7 times energy efficiency (on average). • Other workloads: TileGx36 shows2.5 times performance improvement, 1.03 times energy improvement (on average). The DPJ Comparison The DPS Comparison • I/O intensive workload(Sort): TileGx36shows 4.1 times performance improvement, 1.01 times energy improvement (on average).
Improvements from Scaling-out the Wimpy Core(TileGx36 vs. Atom D510) • The core of TileGx36 is more wimpy than Atom D510 • AdoptsMIPS-derived VLIW instruction set. • Does not support hyperthreading. • Less stages in the pipeline depth. • Does not have dedicated floating point units. • TileGx36 integrates more cores on the NOC(Network on Chip) 36 cores in TileGx36 vs. 4 cores Atom D510 • Scaling out the wimpy core can bring performance advantage by improving execution parallelism. • Simplifying the wimpy cores and integrating more cores on the NOC is an option for Big Data workloads.
Scale-up the Brawny Core(Xeon E5645) vs. Scale-out the Wimpy Core(TileGx36) • Other workloads: E5645shows performance advantage, but with no consistent energy improvement. • CPU-intensive and floating point operation dominated workloads (Bayes & K-means): E5645shows 4.2 times performance improvement, 2.0 times energy improvement (on average). • I/O intensive workload(Sort): TileGx36shows1.2 times performance improvement, 1.9 times energy improvement (on average). The DPJ Comparison The DPS Comparison
Hardware Evaluation Summary • No one-size-fits-all solution • None of the microprocessors consistently wins in terms of both performance and energy efficiency for all of our Big Data workloads • One-size-fits-a-bunch solution • There are different classes of Big Data workloads, and each class of workload realizes better performance and energy efficiency on different architectures.
Outline • Benchmarking Methodology and Decision • Big Data Workload Characterization • Evaluating hardware systems With Big Data • Conclusion 3 3
Conclusion • An open source big data benchmark suite • Data-centric benchmarking methodology • http://prof.ict.ac.cn/BigDataBench • Big Data workload characterization • Data movement dominated computing • Diverse behaviors • Must including diversity of data and workloads • Eschew one-size-fits-all solution • Tailor system designs to specific workload requirements.