1 / 31

MySQL Data Warehousing Survival Guide

Learn about efficiently managing MySQL data warehousing from server setup to ETL design. Find valuable insights, lessons learned, and tips for survival.

mmustafa
Download Presentation

MySQL Data Warehousing Survival Guide

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MySQL Data Warehousing Survival Guide Marius Moscovici (marius@metricinsights.com) Steffan Mejia (steffan@lindenlab.com)

  2. Topics • The size of the beast • Evolution of a Warehouse • Lessons Learned • Survival Tips • Q&A

  3. Size of the beast • 43 Servers • 36 active • 7 standby spares • 16 TB of data in MySQL • 12 TB archived (pre S3 staging) • 4 TB archived (S3)  • 3.5B rows in main warehouse  • Largest table ~ 500M rows (MySQL)

  4. Warehouse Evolution - First came slaving • Problems: • Reporting slaves easily fall    behind • Reporting limited to one-pass SQL

  5. Warehouse Evolution - Then came temp tables • Problems: • Easy to lock replication with temp table creation  • Slaving becomes fragile

  6. Warehouse Evolution - A Warehouse is Born • Problems: • Warehouse workload limited by what can be performed by a single server

  7. Warehouse Evolution - Workload Distributed • Problems: • No Real-Time Application integration support

  8. Warehouse Evolution - Integrate Real Time Data

  9. Lessons Learned - Warehouse Design Workload exceeds available memory

  10. Lessons Learned - Warehouse Design • Keep joins < available memory • Heavily Denormalize data for effective reporting • Minimize joins between large tables • Aggressively archive historical data 

  11. Lessons Learned - Data Movement • Mysqldump is your friend • Sequence parent/child data loads based on ETL assumptions • Orders without order lines • Order lines without orders • Data Movement Use Cases • Full • Incremental • Upsert (Insert on duplicate key update)

  12. Full Table Loads • Good for small tables • Works for tables with no primary key  • Data is fully replaced on each load

  13. Incremental Loads • Table contains new rows but no updates • Good for insert-only tables • High-water mark level included in Mysqldump where clause

  14. Upsert Loads • Table contains new and updated rows • Table must have primary key • Can be used to update only subset of columns

  15. Lessons Learned - ETL Design • Avoid large joins like the plague • Break out ETL jobs into bite-size-bites • Ensure target data integrity on ETL failure • Use memory staging tables to boost performance 

  16. ETL Design - Sample Problem Build a daily summary of customer event log activity

  17. ETL Design - Sample Solution

  18. ETL Pseudo code - Step 1 1) Create staging table & Find High Water Mark: SELECT IFNULL(MAX(calendar_date),'2000-01-01') INTO @last_loaded_date  FROM user_event_log_summary; set max_heap_table_size = <big enough number to hold several days data> CREATE TEMPORARY TABLE user_event_log_summary_staging (.....) ENGINE = MEMORY; CREATE INDEX user_idx   USING HASH on user_event_log_summary_staging(user_id);

  19. ETL Pseudo code - Step 2 2) Summarize events: INSERT INTO user_event_log_summary_staging ( calendar_date,  user_id,  event_type,  event_count) SELECT  DATE(event_time),  user_id,  event_type,  COUNT(*) FROM event_log WHERE event_time > CONCAT(@last_loaded_date, '23:59:59') GROUP BY 1,2,3;

  20. ETL Pseudo code - Step 3 3) Set denormalized user columns: UPDATE user_event_log_summary_staging log_summary,               user SET log_summary.type = user.type,      log_summary.status = user.status WHERE user.user_id = log_summary.user_id;

  21. ETL Pseudo code - Step 4 3) Insert into Target Table: INSERT INTO user_event_log_summary (...) SELECT ... FROM user_event_log_summary_staging;

  22. Functional Partitioning • Benefits depend on • Partition Execution Times • Data Move Times • Dependencies between functional partitions

  23. Functional Partitioning

  24. Job Management • Run everything single-threaded on a server • Handle dependencies between jobs across servers • Smart re-start key to survival • Implemented 3-level hierarchy of processing • Process (collection of build steps and data moves) • Build Steps (ETL 'units of work') • Data Moves

  25. DW Replication • Similar to other MySQL environments • Commodity hardware  • Master-slave pairs for all databases • Mixed environments can be difficult • Use rsync to create slaves • But not with ssh (on private network) • Monitoring  • Reporting queries need to be monitored • Beware of blocking queries • Only run reporting queries on slave (temp table issues) • Nagios • Ganglia • Custom scripts

  26. Infrastructure Planning • Replication latency • Warehouse slave unable to keep up • Disk utilization > 95% • Required frequent re-sync • Options evaluated •  Higher speed conventional disks •  RAM increase •  Solid-state-disks

  27. Optimization • Check / reset HW RAID settings • Use general query log to track ETL / Queries • Application timing  • isolate poor-performing parts of the build • Optimize data storage - automatic roll-off of older data

  28. Infrastructure Changes • Increased memory 32GB -> 64GB • New servers have 96GB RAM • SSD Solution • 12 & 16 disk configurations • RAID6 vs. RAID10 • 2.0T or 1.6TB formatted capacity • SATA2 HW BBU RAID6 • ~ 8 TB data on SSD

  29. Results • Sometimes it pays to throw hardware at a problem • 15-hour warehouse builds on old system • 6 hours on optimized system • No application changes

  30. Finally...Archive Two-tiered solution • Move data into archive tables in separate DB • Use select to dump data - efficient and fast • Archive server handles migration • Dump data • GPG • Push to S3

  31. Survival Tips • Efforts to scale are non-linear • As you scale, it becomes increasingly difficult to manage • Be prepared to supplement your warehouse strategy • Dedicated appliance • Distributed processing (Hadoop, etc) • You can gain a great deal of headroom by optimizing I/O • Optimize current disk I/O path • Examine SSD / Flash solutions • Be pragmatic about table designs • It's important to stay ahead of the performance curve • Be proactive - monitor growth, scale early • Monitor everything, including your users • Bad queries can bring replication down

More Related