170 likes | 354 Views
Visualizing Big Data. David Schmittdiel CSC 9010-003 9/16/2014. Outline. Me Big Data review and background Problem statement Case study: StubHub. Intro. I don’t have a Computer Science background (but I really, really regret it) MATLAB PHP MySQL Oracle
E N D
Visualizing Big Data David Schmittdiel CSC 9010-003 9/16/2014
Outline • Me • Big Data review and background • Problem statement • Case study: StubHub
Intro • I don’t have a Computer Science background (but I really, really regret it) • MATLAB PHP MySQL Oracle • Manager of Business Intelligence Development at StubHub • Bringing actionable data to the masses • Self-service, on-demand, exploratory BI • Data discovery through visualization • Automation
Big Data, Big Ruse? • Stephen Few: “What the hell is Big Data anyway?” • BI vendor-driven responses: • Increased data volume AND velocity • New data sources (unstructured) • Fundamental question: Do you really need Big Data? • “Until you’ve figured out how to use the data that you already have, collecting more will only distract you from the real task. Time spent collecting more data is time that could be better spent weaving it into something meaningful.” • Stephen Few, Perceptual Edge - July/August/September 2012, “Big Data, Big Ruse” • http://www.perceptualedge.com/articles/visual_business_intelligence/big_data_big_ruse.pdf
The Real Task • Transforming raw data into meaningful, useful, actionable information • Leveraging the past to guide future endeavors • Finding the signals amidst the noise • Driving forces: • Scientific research • Business (ecommerce) • Government • Stephen Few: “The success of BI … [is] measured in our increased ability to understand data and then make better decisions based on that understanding.”
Visualizing Small Data • MS Excel • Ease of use for tasks involving smaller data sets, limited interactivity • Stephen Few: “building applications on top of Excel can be arduous and painful” • Stephen Few, Perceptual Edge – September/October 2009, “Fundamental Differences in Analytical Tools” • http://www.perceptualedge.com/articles/visual_business_intelligence/differences_in_analytical_tools.pdf
Visualizing Small Data • Static dashboards: “custom analytics” • Time-consuming to build but relatively easy to maintain • “Remove … functionality that isn’t relevant to the analytical objective of its users”
Unique Challenges • Juliana Freire: “Visualization: Big Data Considerations” • Interactivity is key, but challenging for Big Data • Need better integration between data management and visualization components • Phil Simon describing Netflix’s data mindset: • Data should be accessible, easy to discover, and easy to process for everyone • The longer you take to find the data, the less valuable it becomes • Whether a dataset is large or small, being able to visualize it makes it easier to explain • Juliana Freire, DIMACS 2013, “Big Data Analysis and Integration” • http://dimacs.rutgers.edu/Workshops/BigData/Slides/2013-dimacs.pdf • Phil Simon, HBR Webinar, “The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions” • http://www.scribd.com/doc/232032215/HBR-Webinar-Summary-The-Visual-Organization
Case Study: StubHub • Using SAP Business Objects (BO) since at least 2008 on top of Oracle 11g DW • Included in the “Leaders” quadrant • of 2014 Gartner report • BO “delivers a broad range of BI and analytic capabilities through a semantic layer best suited for large IT-managed deployments that require robust governance and administrative capabilities” • Customers use “primarily for reporting; the number that use it for interactive discovery or visualization was well below the average” • Gartner,Magic Quadrant for Business Intelligence and Analytics Platforms • www.gartner.com/technology/reprints.do?id=1-1QLGACN&ct=140210&st=sb
Case Study: StubHub • Feedback from business users was universally poor • Hard to use • Limited number of (inadequate) visualizations available • Not interactive • Supported by Tech org only • Reporting Team within Analytics org formed in January, 2013 • Innovative • Responsive • Promote self-service • Objective vs subjective use of data
Case Study: StubHub General concept: aggregate any metrics by any breakdown, over any time period, filtered for anything Supports “exploratory analytics”: pursue each question as it arises Settle instead for a collection of dashboards categorized by business use case
Case Study: StubHub • First iteration: Dynamic SQL • Complicated rules for commenting based on front-end selections • select • -- DATE: sp.src_created_dttm_sale • g.genre_cat_final as "GCF", -- DISPLAY CATEGORY: GCF • g.genre_descr as "Genre", -- DISPLAY CATEGORY: Genre • sum(sp.ticket_cost) as "GTS", -- DATA METRIC: GTS • count(distinct transaction_id) as "# Orders", -- DATA METRIC: # Orders • from owbruntarget_dw.dw_sales_pipeline_factsp • join owbruntarget_dw.dw_genre_dim g on sp.genre_dw_id = g.genre_dw_id -- DISPLAY CATEGORY or FILTER: GCF, Genre • where 1=1 • -- FILTER: g.genre_cat_final for GCF • -- FILTER: g.genre_descr for Genre • AND trunc(src_created_dttm_sale) between :startdate and :enddate • group by • g.genre_cat_final, -- DISPLAY CATEGORY: GCF • g.genre_descr, -- DISPLAY CATEGORY: Genre • -- DATEG: sp.src_created_dttm_sale '' • Proved unworkable because of long query execution times, even after incorporating bind variables
Case Study: StubHub • Next iteration: “pandas” dataframes • Open source Python library for data manipulation and analysis • Fast and efficient DataFrame object for data manipulation with integrated indexing • Tools for reading and writing data between in-memory data structures and different formats (e.g. CSV) • For each dashboard, one static query • Tuning + Oracle query optimizer • Retrieve comprehensive data set needed to power the dashboard • Store data in CSV files on network • “Jukebox” functionality: only files needed are loaded into memory for processing • Pandas: http://pandas.pydata.org/pandas-docs/stable/index.html
Case Study: StubHub • Results: • Huge decrease in dashboard run times • Corresponding increase in adoption rate
Case Study: StubHub • Where does the interactivity necessary for data discovery come from? • Template-based front end built with PHP + HTML + CSS + jQuery • Provide different levels of granularity • Decreases amount of time needed to create a new dashboard (vs. Tableau) • Menus control requests for: • Categories group by • Metrics aggregate functions • Filters where clause • Date range • Chart types, date aggregation
Case Study: StubHub • How to provide integration between back-end data management and front-end visualization components? • Solution is Data-Driven Documents (D3.js) • JavaScript library to drive the creation and control of dynamic and interactive graphical forms which run in web browsers • W3C-compliant, making use of the widely implemented Scalable Vector Graphics (SVG), JavaScript, HTML5, and Cascading Style Sheets (CSS3) standards • Large data sets can be easily bound to SVG objects using JSON and simple D3 functions to generate charts and diagrams • D3: http://d3js.org/
Case Study: StubHub • Summary of approach • Create a collection of BI dashboards that are: • Fast • Customizable • Interactive • Highly visual • On-demand • Scalable • Consistent • Custom build EVERYTHING as needed • Leverage open source technologies whenever possible • Data source agnostic to accommodate new data stores as they become available • Output from MapReduce jobs in CSV format