220 likes | 353 Views
Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses. Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User Conference, Wednesday April 25, 2007. Agenda. Big News Data Integration challenges and open source BI adoption
E N D
Addressing Data Chaos: Using MySQL and Kettle to Deliver World-Class Data Warehouses Matt Casters: Chief Architect, Data Integration and Kettle Project Founder MySQL User Conference, Wednesday April 25, 2007
Agenda • Big News • Data Integration challenges and open source BI adoption • Pentaho company overview • Pentaho Data Integration Fundamentals • Schema design • Kettle basics • Demonstration • Resources and links
Announcing Pentaho Data Integration 2.5.0 • Again we offer big improvements over smash hit version 2.4.0 • Advanced error handling • Tight Apache VFS integration Allows us to directly load and save files from any location: file systems, web servers, FTP sites, ZIP-files, tar-files, etc. • Dimension key caching dramatically improving speed • A slew of new job entries and steps (including MySQL bulk operations) • Hundreds of bug fixes
Customer Order History Customer Service History Marketing Campaigns Data Warehouse Acquired System XML Managing Data Chaos: Data Integration Challenges • Data is everywhere • Customer order information in one system, customer service information in another • Data is inconsistent • The record of the customer is different in each system • Performance is an issue • Running queries to summarize 3 years of data in the operational system takes forever • AND it brings the operational system to its knees • The data is never ALL in the data warehouse • Acquisitions, Excel spreadsheets, new applications
MySQL Provides Data storage SQL query execution Heavy-duty sorting, correlation, aggregation Integration point for all BI tools Kettle Provides Data extraction, transformation, and loading Dimensional modeling SQL generation Aggregate creation Data enrichment / calculations Data migration How Pentaho Extends MySQL with ETL
“With professional support and world-class ETL from Pentaho, we've been able to simplify our IT environment and lower our costs.We were also surprised at how much faster Pentaho Data Integration was than our prior solution.” “We selected Pentaho for its ease-of-use. Pentaho addressed many of our requirements -- from reporting and analysis to dashboards, OLAP and ETL, and offered our business users the Excel-based access that they wanted.” Sample Companies that Use MySQL and Kettle from Pentaho
Other Kettle Users • And Thousands More…… “We chose Pentaho because it has a full range of functionality, exceptional flexibility, and a low total cost of ownership because of its open source business model. We can start delivering value to our business users quickly with embedded, web-based reporting, while integrating our disparate data sources for more strategic benefits down the road.”
Pentaho Introduction • World’s most popular enterprise open source BI Suite • 2 million lifetime downloads, averaging 100K / month • Founded in 2004: Pioneer in professional open source BI • Key Projects • JFreeReport Reporting • Kettle Data Integration • Mondrian OLAP • Pentaho BI Platform • Weka Data Mining • Management and Board • Proven BI veterans from Business Objects, Cognos, Hyperion, SAS, Oracle • Open source leaders - Larry Augustin, New Enterprise Associates, Index Ventures • MySQL Gold Partner
Overview: Data Warehouse Data Flow • From source systems … • to the data warehouse … • to reports … • to analyses … • to dashboard reports … • to better information
Departmental Pentaho Introduction Strategic Scorecards Analysis Aggregates Reports Operational Sales Marketing Inventory Production Financial
The star schema: a new data model is needed • Because data from various sources is “mixed” we need to design a new data model: a star schema. • A star schema is designed based on the requirements and populated by the ETL engine. • During modeling we split up the requirements in Facts and Dimensions: Dimensions Facts
Date Order Line Fact Table Customer Product Order The star schema: a new data model is needed • After grouping the dimension attributes by subject we get our data model. For example:
Overview: A new data model is needed • The fact table contains ONLY facts and dimension technical keys
Overview: A new data model is needed The dimensions contain technical fields, typically like in this customer dimension entry for customer_id = 100 If the address changes (at time T1) we get a new entry in the dimension. This is called a Ralph Kimball type II dimension update. NAL = Name, Address & Location
Overview: A new data model is needed If the birth_date changes we update all entries in the dimension. This is called a Ralph Kimball type I dimension update. NAL = Name, Address & Location
Implications • We are making it easier to create reports by using star schemas • We are shiftingwork from the reporting side to the ETL • We need a good toolset to do ETL because of the complexities • We need to turn everything upside down • … and this is where Pentaho Data Integration comes in.
Data Transformation and Integration Examples • Data filtering • Is not null, greater than, less than, includes • Field manipulation • Trimming, padding, upper and lowercase conversion • Data calculations • + - X / , average, absolute value, arctangent, natural logarithm • Date manipulation • First day of month, Last day of month, add months, week of year, day of year • Data type conversion • String to number, number to string, date to number • Merging fields & splitting fields • Looking up date • Look up in a database, in a text file, an excel sheet, …
Pentaho Data Integration (Kettle) Components • Spoon • Connect to data sources • Define transformation rules and design target schema(s) • Graphical job execution workflow engine for defining multi-stage and conditional transformation jobs • Pan • Command-line execution of single, pre-defined transformation jobs • Kitchen • Scheduler for multi-stage jobs • Pentaho BI Platform • Integrated scheduling of transformations or jobs • Ability to call real-time transformations and use output in reports and dashboards
Demonstration - create a MySQL db + repository - create dimensions - create facts - auditing & incremental loading - jobs
Case Study: Pentaho Data Integration • Organization: Flemish Government Traffic Centre • Use case: Monitoring the state of the road network • Application requirement: Integrate minute-by-minute data from 570 highway locations for analysis • Technical challenges: Large volume of data, more than 2.5 billion rows • Business Usage: Users can now compare traffic speeds based on weather conditions, time of day, date, season • Best practices: • Clearly understand business user requirements first • There are often multiple ways to solve data integration problems, so consider the long-term need when choosing the right way
Case Study: Replacement of Proprietary Data Integration • Organization: Large, public, North American based genetics and pharmaceutical research firm • Application requirement: Data warehouse for analysis of patient trials, and research spending • Incumbent BI vendor: Oracle (Oracle Warehouse Builder) • Decision criteria: Ease of use, openness, cost of ownership • “It was so much quicker and easier to do the things we wanted to do, and so much easier to maintain when our users’ business requirements change.” • Best practices: • Evaluate replacement costs holistically • Treat migrations as an opportunity to improve a deployment, not just move it • Good deployments are iterative and evolve regularly – if users like what you give them, they will probably ask for more
Summary and Resources • Pentaho and MySQL can address help you manage your data infrastructure • Extraction, Transformation and Loading for Data Warehousing and Data Migration • kettle.pentaho.org • Kettle project homepage • kettle.javaforge.com • Kettle community website: forum, source, documentation, tech tips, samples, … • www.pentaho.org/download/ • All Pentaho modules, pre-configured with sample data • Developer forums, documentation • Ventana Research Open Source BI Survey • www.mysql.com • White paper - http://dev.mysql.com/tech-resources/articles/mysql_5.0_pentaho.html • Kettle Webinar - http://www.mysql.com/news-and-events/on-demand-webinars/pentaho-2006-09-19.php • Roland Bouman blog on Pentaho Data Integration and MySQL • http://rpbouman.blogspot.com/2006/06/pentaho-data-integration-kettle-turns.html