440 likes | 654 Views
Building Scalable OLAP Applications with Mondrian and MySQL. Julian Hyde: Chief Architect, OLAP, at Pentaho and Mondrian Project Founder Tuesday, April 24th 2007. Agenda. Pentaho Introduction What is OLAP? Key features and architecture Getting started Schemas & queries
E N D
Building Scalable OLAP Applications with Mondrian and MySQL Julian Hyde: Chief Architect, OLAP, at Pentaho and Mondrian Project Founder Tuesday, April 24th 2007
Agenda • Pentaho Introduction • What is OLAP? • Key features and architecture • Getting started • Schemas & queries • Tuning & scalability • Active Data Warehousing • Case Studies • Business Intelligence suite • Q & A
Pentaho Introduction • World’s most popular enterprise open source BI Suite • 2 million lifetime downloads, averaging 100K / month • Founded in 2004: Pioneer in professional open source BI • Management - proven BI and open source veterans • from Business Objects, Cognos, Hyperion, JBoss, Oracle, Red Hat, SAS • Board of Directors – deep expertise and proven success in open source • Larry Augustin - founder, VA Software, helped coin the phrase “open source” • New Enterprise Associates – investors in SugarCRM, Xensource, others • Index Ventures – investors in MySQL, Zend, others • Widely recognized as the leader in open source BI • Distributed worldwide by Red Hat via the Red Hat Exchange • Embedded in next release of OpenOffice (40 million users worldwide)
What is OLAP? • View data “dimensionally” • i.e. Sales by region, by channel, by time period • Navigate and explore • Ad Hoc analysis • “Drill-down” from year to quarter • Pivot • Select specific members for analysis • Interact with high performance • Technology optimized for rapid interactive response
A brief history of OLAP 2010 Pentaho BEE Open-source BI suites JRubik OpenI JPivot Palo Open-source OLAP Mondrian MySQL 2000 Microsoft OLAP Services Oracle partitions,bitmap indexes RDBMS support for DW Informix MetaCube Codd’s “12Rules of OLAP” Sybase IQ Business Objects 1990 Desktop OLAP MicroStrategy Essbase Cognos PowerPlay Spreadsheets 1980 Comshare Mainframe OLAP Express 1970 APL
Key Features • On-Line Analytical Processing (OLAP) cubes • automated aggregation • speed-of-thought response times • Open Architecture • 100% Java • J2EE • Supports any JDBC data source • MDX and XML/A • Analysis Viewers • Enables ad-hoc, interactive data exploration • Ability to slice-and-dice, drill-down, and pivot • Provides insights into problems or successes
MySQL Provides Data storage SQL query execution Heavy-duty sorting, correlation, aggregation Integration point for all BI tools Mondrian Provides Dimensional view of data MDX parsing SQL generation Caching Higher-level calculations Aggregate awareness How Mondrian Extends MySQL for OLAP Applications
Microsoft Excel(via Spreadsheet Services) Open Architecture Viewers • Open Standards (Java, XML, MDX, XML/A, SQL) • Cross Platform (Windows & Unix/Linux) • J2EE Architecture • Server Clustering • Fault Tolerance • Data Sources • JDBC • JNDI Web Server JPivot servlet J2EE Application Server File or RDBMS Repository JPivot servlet XML/A servlet Mondrian Cube Schema XML Cube Schema XML Cube Schema XML cube cube cube JDBC JDBC JDBC RDBMS RDBMS
Getting started with Mondrian • Download mondrian from http://mondrian.pentaho.org and unzip • Copy mondrian.war to TOMCAT_HOME/webapps. • Create a database: $mysqladmin create foodmart$mysqlmysql>grant all privileges on *.* to 'foodmart'@'localhost' identified by 'foodmart';Query OK, 0 rows affected (0.00 sec)mysql>quitBye • Install demo dataset (300,000 rows): $ java -cp "/mondrian/lib/mondrian.jar:/mondrian/lib/log4j-1.2.9.jar:/mondrian/lib/eigenbase-xom.jar:/mondrian/lib/eigenbase-resgen.jar:/mondrian/lib/eigenbase-properties.jar:/usr/local/mysql/mysql-connector-java-3.1.12-bin.jar" mondrian.test.loader.MondrianFoodMartLoader -verbose -tables -data -indexes -jdbcDrivers=com.mysql.jdbc.Driver -inputFile=/mondrian/demo/FoodMartCreateData.sql -outputJdbcURL= "jdbc:mysql://localhost/foodmart?user=foodmart&password=foodmart“ • Start tomcat, and hit http://localhost:8080/mondrian
A Mondrian schema consists of… A dimensional model (logical) • Cubes & virtual cubes • Shared & private dimensions • Calculated measures in cube and in query language • Parent-child hierarchies … mapped onto a star/snowflake schema (physical) • Fact table • Dimension tables • Joined by foreign key relationships
Writing a Mondrian Schema • Regular cubes, dimensions, hierarchies • Shared dimensions • Virtual cubes • Parent-child hierarchies • Custom readers • Access-control <!-- Shared dimensions --> <Dimension name="Region"> <Hierarchy hasAll="true“ allMemberName="All Regions"> <Table name="QUADRANT_ACTUALS"/> <Level name="Region" column="REGION“ uniqueMembers="true"/> </Hierarchy> </Dimension> <Dimension name="Department"> <Hierarchy hasAll="true“ allMemberName="All Departments"> <Table name="QUADRANT_ACTUALS"/> <Level name="Department“ column="DEPARTMENT“ uniqueMembers="true"/> </Hierarchy> </Dimension> • (Refer to http://mondrian.pentaho.org/documentation/schema.php )
Tools • Schema Workbench • Pentaho cube designer • cmdrunner
MDX – Multi-Dimensional Expressions • A language for multidimensional queries • Plays the same role in Mondrian’s API as SQL does in JDBC • SQL-like syntax • … but un-SQL-like semantics SELECT {[Measures].[Unit Sales]} ON COLUMNS, {[Store].[USA], [Store].[USA].[CA]} ON ROWS FROM [Sales] WHERE [Time].[1997].[Q1] • (Refer to http://mondrian.pentaho.org/documentation/mdx.php )
Tuning is a Process Some techniques: • Test performance on a realistic data set! • Use tracing to identify long-running SQL statements • mondrian.trace.level=2 • Tune SQL queries • Run the same query several times, to examine cache effectiveness • When Mondrian starts, prime the cache by running queries • Get to know Mondrian’s system properties
Tuning MySQL for Mondrian • Fact table: • Index foreign keys • Try indexing multiple combinations of foreign keys • Use MyISAM or MERGE (also known as MRG_MyISAM) • Partitioned tables • Dimension tables: • Index primary and foreign keys • Use MyISAM for large dimension tables, MEMORY for smaller dimension tables • Create aggregate tables: • Contain pre-computed summaries • Prevent full table scan followed by massive GROUP BY • Parent-child hierarchies • Create closure tables
Scaling to large datasets • Tune MySQL • Partitions allow parallelism • Aggregate tables allow you to do the hard work to load-time, not runtime • Increase cache size
Viewers Viewers Viewers Web Server Web Server JPivot servlet JPivot servlet Web Server Web Server Web Server Web Server Web Server Web Server Web Server Web Server JPivot servlet JPivot servlet JPivot servlet JPivot servlet JPivot servlet JPivot servlet JPivot servlet JPivot servlet J2EE Application Server J2EE Application Server J2EE Application Server J2EE Application Server XML/A servlet JPivot servlet J2EE Application Server J2EE Application Server XML/A servlet XML/A servlet Mondrian Mondrian XML/A servlet XML/A servlet cube cube Mondrian Mondrian Mondrian Mondrian cube cube cube cube JDBC JDBC JDBC JDBC JDBC JDBC RDBMS RDBMS RDBMS RDBMS RDBMS Scaling to large numbers of concurrent users • Tune MySQL • Increase JVM memory • Make sure that connections are using the same cache • Multiple web servers • Multiple mondrian instances • Multiple DBMS instances with read-only copies of the data
Active Data Warehousing An Active Data Warehouse: • Extends the traditional DW to support operational processing • Provides a single view of the business • Is updated in near real-time Cost/benefit: • High cost (technical challenges) • High reward (time is money)
Technical challenges of Active Data Warehousing • DBMS has mixed work-load • Short, fast queries to query specific records • Longer queries to look at changing patterns • Emphasis on recent data • Background DML to keep database up to date • Continuous, incremental ETL • Continuous or near real-time • Incremental • Consider message-driven ETL • OLAP cache • OLAP servers use cache and/or MOLAP database for performance • Cache consistency
Mondrian and Active Data Warehouse • “ROLAP with caching” • No MOLAP data store to maintain • Disable aggregate tables • Unless your ETL process can maintain these incrementally • Either turn off cache • mondrian.rolap.star.disableCaching=true • “Pure ROLAP” mode • Results are cached for the lifetime of a single MDX statement • Or use new CacheControl API to selectively flush mondrian’s cache
Cache control • In an Active Warehouse, the most recent data are modified much more often than historic data • Cache control API allows the application to selectively flush the cache • (Refer to http://mondrian.pentaho.org/documentation/cache_control.php )
Cache control: A simple example • Let’s run a simple MDX query: SELECT{[Time].[1997], [Time].[1997].[Q1], [Time].[1997].[Q2]} ON COLUMNS,{[Customers].[USA], [Customers].[USA].[OR], [Customers].[USA].[WA]} ON ROWSFROM [Sales]
Cache contents after running query • year=1997, nation=USA, measure=unit_sales • year=1997, nation=USA, state={OR, WA}, measure=unit_sales • year=1997, quarter=any, nation=USA, measure=unit_sales • year=1997, quarter=*, nation=USA,state={OR, WA}, measure=unit_sales
Flushing part of a segment Cachemanager • year=1997, quarter=*, nation=USA,state={OR, WA}, measure=unit_sales • year=1997, quarter=*, nation=USA,state={OR, WA}, measure=unit_sales;exclude: state=WA and quarter=Q2 • year=1997, quarter=*, nation=USA,state={CA, WA}, measure=unit_sales
Code using CacheControl to flush part of a segment Connection connection; CacheControl cacheControl = connection.getCacheControl(null); Cube salesCube = connection.getSchema().lookupCube("Sales", true); SchemaReader schemaReader = salesCube.getSchemaReader(null); // Create region containing [Customer].[USA].[OR]. Member memberOregon = schemaReader.getMemberByUniqueName( new String[]{“Customer", “USA", “OR"}, true); CacheControl.CellRegion regionOregon = cacheControl.createMemberRegion(memberOregon, true); // Create region containing [Customer].[USA].[OR]. Member memberTimeQ2 = schemaReader.getMemberByUniqueName( new String[]{“Time", “1997", “Q2"}, true); CacheControl.CellRegion regionTimeQ2 = cacheControl.createMemberRegion(memberTimeQ2, true); // Create region containing ([Customer].[USA].[OR], [Time].[1997].[Q2]) CacheControl.CellRegion regionOregonQ2 = cacheControl.createCrossjoinRegion(regionOregon, regionTimeQ2); cacheControl.flush(regionOregonQ2);
Case Study: Frontier Airlines Frontier Airlines Key Challenges • Understanding and optimizing fares to ensure • Maximum occupancy (no empty seats) • Maximum profitability (revenue per seat) Pentaho Solution • Pentaho Analysis (Mondrian) • Chose Open Source RDBMS and Mondrian over Oracle • 500 GB of data, 6 server cluster Results • Comprehensive, integrated analysis to set strategic pricing • Improved per-seat profitability (amount not disclosed) Why Pentaho • Rich analytical and MDX functionality • Cost of ownership “The competition is intense in the airline industry and Frontier is committed to staying ahead of the curve by leveraging technology that will help us offer the best prices and the best flight experience…. [the application] fits right in with our philosophy of providing world-class performance at a low price.”
Case Study: U.S.-based, Public Software Company OEM Example Key Requirements • Embedding analysis within a pricing planning application Incumbent Vendor • MicroStrategy Decision Criteria • Embeddability • Extensibility • Cost of ownership Best Practices • Evaluate replacement costs holistically • Treat migrations as an opportunity to improve a deployment, not just move it • Understand and prioritize requirements – functionality, usability, architecture, etc. “The old solution (MicroStrategy) was built to be a standalone application. It expected to own the browser session, to have separate security, and its own user interface standards. We were able to make Mondrian a seamless part of the application, and it was much easier to do so.”
Business Intelligence Suite • Mondrian OLAP • Analysis tools: • Pivot table • Charting • Dashboards • ETL (extract/transform/load) • Integration with operational reporting • Integration with data mining • Actions on operational data • Design/tuning tools
Pentaho Open Source BI Offerings All available in a Free Open Source license
A Sample of Joint MySQL-Pentaho Users “Pentaho provided a robust, open source platform for our sales reporting application, and the ongoing support we needed. The experts at OpenBI provided outstanding services and training, and allowed us to deploy and start generating results very quickly.” “We selected Pentaho for its ease-of-use. Pentaho addressed many of our requirements -- from reporting and analysis to dashboards, OLAP and ETL, and offered our business users the Excel-based access that they wanted.”
Next Steps and Resources • Contact Information • Julian Hyde, Chief Architect, jhyde@pentaho.org • More information http://www.pentaho.org and http://mondrian.pentaho.org • Pentaho Community Forum http://community.pentaho.org • Go to Developer Zone • Discussions • Pentaho BI Platform including Mondrian http://www.pentaho.org/download/latest • Mondrian OLAP Library only http://sourceforge.net/project/showfiles.php?group_id=35302
Birds of a Feather session:MySQL Data Warehousing and BI Speakers: • James Dixon, Senior Architect and Chief Technology (a.k.a. "Chief Geek"), Pentaho Corporation • Roland Bouman, Certification Developer, MySQL A.B. • Matt Casters, Data Integration Architect and Kettle Project Founder, Pentaho • Julian Hyde, OLAP Architect and Mondrian Project Founder, Pentaho • Brian Miezejewski, Principal Consultant, MySQL inc. Track: Data Warehousing and Business Intelligence Date: TONIGHT!! Time: 7:30pm - 9:00pm Location: Camino Real