380 likes | 386 Views
Learn how to exploit MySQL 5.1 for advanced business intelligence applications, including data integration, partitioning, clustering, and metadata.
E N D
Exploiting MySQL 5.1 For Advanced Business Intelligence Applications Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User Conference, Tuesday April 24, 2007
Agenda • Great news • Pentaho Data Integration introduction • Pentaho Data Integration MySQL support • Table partitioning • Database partitioning • Clustering • Metadata to the rescue • Pentaho company overview & platform demo • Questions & closing
Great news • The San Francisco Giants beat the Arizona Diamondbacks last Saturday • Barry Zito pitched a great game • The hot-dogs where great
Pentaho news • Mondrian 2.3. was released a few weeks ago • Detailed cache control (clearing) • Sun Microsystems & Pentaho announced that Pentaho Reporting will be used to drive Business Intelligence requirements in Open Office 2.3: to be released in the second half of this year. • Tomorrow we’ll do another one.
Pentaho Data Integration introduction • From source systems … • to the data warehouse … • to reports … • to analyses … • to dashboard reports … • to better information
Data Transformation and Integration Examples • Data filtering • Is not null, greater than, less than, includes • Field manipulation • Trimming, padding, upper and lowercase conversion • Data calculations • + - X / , average, absolute value, arctangent, natural logarithm • Date manipulation • First day of month, Last day of month, add months, week of year, day of year • Data type conversion • String to number, number to string, date to number • Merging fields & splitting fields • Looking up date • Look up in a database, in a text file, an excel sheet, …
Pentaho Data Integration (Kettle) Components • Spoon • Connect to data sources • Define transformation rules and design target schema(s) • Graphical job execution workflow engine for defining multi-stage and conditional transformation jobs • Pan • Command-line execution of single, pre-defined transformation jobs • Kitchen • Scheduler for multi-stage jobs • Carte • Remote execution “slave” server with embedded web server • Pentaho BI Platform • Integrated scheduling of transformations or jobs • Ability to call real-time transformations and use output in reports and dashboards
Data Transformation and Integration Examples • Demo time : load 1M rows into a database table
Case Study: Pentaho Data Integration • Organization: Flemish Government Traffic Centre • Use case: Monitoring the state of the road network • Application requirement: Integrate minute-by-minute data from 570 highway locations for analysis (1550 lanes) • Technical challenges: Large volume of data, more than 2.5 billion rows • Business Usage: Users can now compare traffic speeds based on weather conditions, time of day, date, season • Best practices: • Clearly understand business user requirements first • There are often multiple ways to solve data integration problems, so consider the long-term need when choosing the right way
Summary and Resources • Pentaho and MySQL can address help you manage your data infrastructure • Extraction, Transformation and Loading for Data Warehousing and Data Migration • kettle.pentaho.org • Kettle project homepage • kettle.javaforge.com • Kettle community website: forum, source, documentation, tech tips, samples, … • www.pentaho.org/download/ • All Pentaho modules, pre-configured with sample data • Developer forums, documentation • Ventana Research Open Source BI Survey • www.mysql.com • White paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html • Kettle Webinar - http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-19.php • Roland Bouman blog on Pentaho Data Integration and MySQL • http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html
Pentaho Data Integration MySQL support • Reading: MySQL Result Streaming (cursor emulation) support • Writing: MySQL dialects for data types • Job entry: Bulk Loader of text files for MySQL • Job entry: Bulk writer to a text file for MySQL
Table partitioning • Challenges • More data is being gathered all the time • Data is coming from more sources than ever • Faster access to stored information is becoming more important • More people require concurrent access to the data • The old solution • Split up the data into different tables • SALES_2003 • SALES_2004 • … • SALES_2006 • Hardcode reports to use one or the other table • Lots of database management issues • High report maintenance costs • Nothing is automatic, everything is manual
Table partitioning • The new solution: • MySQL 5.1. partitioned tables • Automatic split up of data into smaller chunks • SQL engine that can automatically pull results back together
Year 2003 Partition Year 2004 Partition Year 2005 Partition Year 2006 Partition Table partitioning Sales table Sales table
Table partitioning • Advantages • Reduces query time by reducing the amount of data to “plough” through. • Increases performance by • “Pruning” the list of partitions to search for automatically. This is done using the MySQL query optimizer based on the query that’s being issued. • Massive reduction in I/O • Smaller partitioned indexes leading to faster index tree traversal • Allowing parallel access to the different partitions
Table partitioning • How do I create a partitioned table? CREATE TABLE sales ( sales_date TIMESTAMP, ordernr INT, customernr INT, productnr INT, discount_pct TINYINT ) PARTITION BY RANGE( YEAR(sales_date) ) ( PARTITION P2000 VALUES LESS THAN (2001), PARTITION P2001 VALUES LESS THAN (2002) )
Table partitioning • How do I add another partition to the table? ALTER TABLE sales ADD PARTITION ( PARTITION P2002 VALUES LESS THAN (2003) )
Table partitioning • How do I drop a partition from the table? ALTER TABLE sales DROP PARTITION P2002 ;
Table partitioning • Retrieving partition information mysql> desc information_schema.partitions; +-------------------------------+--------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------------------------------+--------------+------+-----+---------+-------+ | TABLE_CATALOG | varchar(512) | YES | | NULL | | | TABLE_SCHEMA | varchar(64) | NO | | | | | TABLE_NAME | varchar(64) | NO | | | | | PARTITION_NAME | varchar(64) | YES | | NULL | | | SUBPARTITION_NAME | varchar(64) | YES | | NULL | | | PARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | | | SUBPARTITION_ORDINAL_POSITION | bigint(21) | YES | | NULL | | | PARTITION_METHOD | varchar(12) | YES | | NULL | | | SUBPARTITION_METHOD | varchar(12) | YES | | NULL | | | PARTITION_EXPRESSION | longtext | YES | | NULL | | | SUBPARTITION_EXPRESSION | longtext | YES | | NULL | | | PARTITION_DESCRIPTION | longtext | YES | | NULL | | | TABLE_ROWS | bigint(21) | NO | | 0 | | | AVG_ROW_LENGTH | bigint(21) | NO | | 0 | | | DATA_LENGTH | bigint(21) | NO | | 0 | | | MAX_DATA_LENGTH | bigint(21) | YES | | NULL | | | INDEX_LENGTH | bigint(21) | NO | | 0 | | | DATA_FREE | bigint(21) | NO | | 0 | | | CREATE_TIME | datetime | YES | | NULL | | | UPDATE_TIME | datetime | YES | | NULL | | | CHECK_TIME | datetime | YES | | NULL | | | CHECKSUM | bigint(21) | YES | | NULL | | | PARTITION_COMMENT | varchar(80) | NO | | | | | NODEGROUP | varchar(12) | NO | | | | | TABLESPACE_NAME | varchar(64) | YES | | NULL | | +-------------------------------+--------------+------+-----+---------+-------+
Table partitioning • Retrieving partition information SELECT partition_name FROM information_schema.partitions WHERE table_name = ’sales’ ; +----------------+ | partition_name | +----------------+ | P2000 | | P2001 | | P2002 | | P2003 | | P2004 | | P2005 | | P2006 | | P2007 | | P2008 | | P2009 | | P2010 | | P2011 | +----------------+
Table partitioning • Use-case: Flemish Traffic Center • Medium sized data set: 514M rows • Partitioned by week • PARTITION BY RANGE ( YEAR(mtime)*100 + WEEK(mtime) ) • 80GB of data in hundreds of partitions • 1 row of data: speed, number of vehicles (5 classes), etc. • Per minute • Per lane of traffic • For 1550 detectors • 815M rows of data per year • A test query aggregating speed & counts per road position per minute • Gave back 642.319 rows • 9 seconds for MySQL to get the result • 29 seconds to pass the data to the client over JDBC
Table partitioning • Should we do table partitioning? Yes! • However… • Even though you can do sub-partitioning, you need to partition the data in one certain way, by range, by list, … • The entry point for optimization is always going to be that partitioning method • Pruning is always going to be based on that partitioning method • Queries across the whole data set don’t always get benefit (neither a penalty) • More detailed information: Partitioning in MySQL 5.1 and Onwards Mikael Ronström, Senior Software Architect, MySQL AB Tomorrow 11:50am – 12:35pm, Ballroom A • How can we improve upon this?
Database partitioning • Challenge: • Dealing with even greater data sets • The need to present results even quicker • The old solution • Large boxes • Huge amounts of RAM • Very fast disk systems (Fibre attached, SAN, …) • Multi-CPU monsters $$$ Very expensive and complicated systems $$$ • The new solution: • 2 or more MySQL servers • Basic, off the shelf, hardware • Standard software • Simple configurations • Pentaho Data Integration & MySQL 5.1
Sales Sales Sales Sales Sales table Year 2003 Partition 2003 2003 2003 2003 Year 2004 Partition 2004 2004 2004 2004 Year 2005 Partition 2005 2005 2005 2005 2006 2006 2006 2006 Year 2006 Partition Database partitioning DB1 DB2 DB3 DB4
Database partitioning • Split the data over several databases • Pick a splitting mechanism in advance • Divide the load over the databases • Make the databases smaller • Increase the query performance • Increase the load performance
Database partitioning • How do you set it up? • First create a partitioned / clustered database connection • Then create a Partitioning Schema
Database partitioning • Demo time: • Creating partitions partitioned • Loading data partitioned • Reading back data partitioned • Reading back data partitioned and ordered
Table partitioning • Use-case: Flemish Traffic Center • Medium sized data set: 514M rows • Same partitioning as before • Same table layout, engine and configuration • Partitioned over 3 databases on detector (per lane) • A test query aggregating speed & counts per road position per minute • Gave back 642.319 rows • 3 seconds for MySQL to get the result • 10 seconds to pass the data to the client over JDBC • Demonstrating almost linear scalability!!
Table partitioning • Should we do database partitioning? Yes! • However… • The scalability is limited to the raw CPU power that is available on a single system. • This poses a scalability problem • when dealing with larger numbers of partitions • Reading back large volumes of data Enters Pentaho Data Integration: Clustering
Sales 2003 DBx 2004 2005 2006 Pentaho Data Integration : Clustering Server X
Pentaho Data Integration : Clustering • The challenge is to eliminate CPU bottlenecks • Pentaho Data Integration “Carte” comes to the rescue • Can be started up on any system turning it into a “Slave Server” • The combination of database partitioning and PDI Clustering: • Will split up the number of database partitions among the slave servers • Will split the total CPU load over the slave servers • Will split the total I/O load over the database partitions (MySQL instances)
Pentaho Data Integration : Clustering • Demo-time • Start up 2 slave servers • Run a step across the 2 servers • Monitor
Pentaho Metadata to the rescue • The problem: • Reporting becomes harder on a database partitioned system • In stead of 1 database you read from a bunch of them • This is where our new project Pentaho Metadata Comes in • Due in a few months • Auto-generates SQL now • Will auto-generate PDI transformations in the near future • Will auto-deliver the data to your report in the platform • Will make it very easy for your users to create drag and drop reports
Pentaho Introduction • World’s most popular enterprise open source BI Suite • 2 million lifetime downloads, averaging 100K / month • Founded in 2004: Pioneer in professional open source BI • Key Projects • JFreeReport Reporting • Kettle Data Integration • Mondrian OLAP • Pentaho BI Platform • Weka Data Mining • Management and Board • Proven BI veterans from Business Objects, Cognos, Hyperion, SAS, Oracle • Open source leaders - Larry Augustin, New Enterprise Associates, Index Ventures • MySQL Gold Partner
Departmental Pentaho Introduction Strategic Scorecards Analysis Aggregates Reports Operational Sales Marketing Inventory Production Financial
Pentaho Introduction • Demo and overview
Birds of a Feather • MySQL Data Warehousing and BI • Tonight at 7:30 • Camino Real • With: • Chief Geek James Dixon: Senior architect & CTO, Pentaho • Julian Hyde, OLAP Architect and Mondrian Project Founder, Pentaho • Brian Miezejewski, Principal Consultant, MySQL • Roland Bouman, Certification Developer, MySQL • MySELF™ • A few bottles of beer