1 / 30

Benchmarking “No One Size Fits All” Big Data Analytics

Benchmarking “No One Size Fits All” Big Data Analytics. BigFrame Team The Hong Kong Polytechnic University Duke University HP Labs. Analytics System Landscape. MPP DB Greenplum, SQL server PDW, Teradata, etc. Columnar Vertica, Redshift, Vectorwise, etc. MapReduce

gizi
Download Presentation

Benchmarking “No One Size Fits All” Big Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Benchmarking “No One Size Fits All” Big Data Analytics BigFrame Team The Hong Kong Polytechnic University Duke University HP Labs

  2. Analytics System Landscape • MPP DB • Greenplum, SQL server PDW, Teradata, etc. • Columnar • Vertica, Redshift, Vectorwise, etc. • MapReduce • Hadoop, Hive, HadoopDB, Tenzing, etc • Streaming • Storm, Streambase, etc • Graph • Pregel, GraphLab, etc • Multi-tenancy • Mesos, Yarn, etc

  3. Analytics System Landscape • MPP DB • Greenplum, SQL server PDW, Teradata, etc. • Columnar • Vertica, Redshift, Vectorwise, etc. • MapReduce • Hadoop, Hive, HadoopDB, Tenzing, etc • Streaming • Storm, Streambase, etc • Graph • Pregel, GraphLab, etc • Multi-tenancy • Mesos, Yarn, etc What does this mean for Big Data Practitioners?

  4. Gives them a lot of power!

  5. Even the mighty may need a little help

  6. Challenges for Practitioners Which system to use for the app that I am developing? • Features (e.g. graph data) • Performance (e.g., claims like System A is 50x faster than B) • Resource efficiency • Growth and scalability • Multi-tenancy App Developers, Data Scientists

  7. Challenges for Practitioners Which system to use for the app that I am developing? Different parts of my app have different requirements App Developers, Data Scientists Compose "best of breed" systems Or Use "one size fits all" System?

  8. Challenges for Practitioners Which system to use for the app that I am developing? Different parts of my app have different requirements App Developers, Data Scientists Total Cost of Ownership (TCO)? Managing many systems is hard! • CIO System Admins

  9. Need Benchmarks

  10. One Approach Categorize systems Develop a benchmark per system category

  11. Useful, But ... • MPP DB, Columnar • TPC-H/TPC-DS, Berkeley Big Data Benchmark etc. • MapReduce • Terasort, DFSIO, GridMix, HiBench etc. • Streaming • Linear Road, etc. • Graph • Graph 500, PageRank, etc. • ...

  12. Problem: May miss the Big Picture

  13. Problem: May miss the Big Picture • Cannot capture the complexities and end-to-end behavior of big data applications and deployments: • Bottlenecks • Data conversion, transfer, & loading overheads • Storage costs & other parts of the data life-cycle • Resource management challenges • Total Cost of Ownership (TCO)

  14. A Better Approach: BigBench or Deep Analytics Pipeline: • Applications driven • Involved multiple types of data: • Structured • Semi-structured • Unstructured • Involved multiple types of operator: • Relation Operators: join, group by • Text Analytics: Sentiment analysis • Machine Learning

  15. Problem: Benchmark X Give a man fish and you will feed him for a day. Give him fishing gear and you will feed him for life. --Anonymous X Benchmark Generator

  16. BigFrame A Benchmark Generator for Big Data Analytics

  17. How a user uses BigFrame bigif (benchmark input format) BigFrame Interface Benchmark Generator bigspec (benchmark specification) result System Under Test Hive MapReduce Benchmark Driver for System Under Test HBase run the benchmark

  18. bigspec: Benchmark Specification Hive MapReduce HBase

  19. What should be captured by the benchmark input format The 3Vs Volume Velocity Variety

  20. bigif: BigFrame's InputFormat

  21. Benchmark Generation bigif (benchmark input format) bigspec (benchmark specification) Benchmark Generator bigif describes points in a discrete space of {Data, Query} X {Variety, Volume, Velocity} • Initial data to load • Data refresh pattern • Query streams • Evaluation metrics Benchmark generation can be addressed as a search problem within a rich application domain

  22. Application Domain Modeled Currently E-commerce sales, promotions, recommendations Social media sentiment & influence Benchmark generation can be addressed as a search problem within a rich application domain

  23. Application Domain Modeled Currently

  24. Application Domain Modeled Currently Web_sales Promotion Item

  25. Application Domain Modeled Currently

  26. Use Case 1: Exploratory BI • Large volumes of relational data • Mostly aggregation and few join • Can Spark's performance match that of a MPP DB Data Variety = {Relational} Query Variety = {Micro} BigFrame will generate a benchmark specification containing relational data and (SQL-ish) queries

  27. Use Case 2: Complex BI • Large volumes of relational data • Even larger volumes of text data • Combined analytics Data Variety = {Relational, text} Query Variety = {Macro} (application-focused instead of micro-benchmark) BigFrame will generate a benchmark specification that includes sentiment analysis tasks over tweets

  28. Use Case 3: Dashboards • Large volume and velocity of relational and text data • Continuously-updated Dashboards Data Velocity= Fast Query Variety = continuous (as opposed to Exploratory) • BigFrame will generate a benchmark specification that includes data refresh as well as continuous queries whose results change upon data refresh

  29. Working with the community • First release of BigFrame planned for August 2013 • open source with extensibility APIs • Benchmark Driver for more systems • Utilities (accessed through the benchmark Driver to drill down into system behavior during benchmarking) • Instantiate the BigFrame pipeline for more app domains

  30. Take Away • Benchmarks shape a field (for better or worse); they are how we determine the value of change. --(David Patterson, University of California Berkeley, 1994). • Benchmarks meet different needs for different people • End customers, application developers, system designers, system administrators, researchers, CIOs • BigFrame helps users generate benchmarks that best meet their needs

More Related