310 likes | 322 Views
Learn about efficiently managing MySQL data warehousing from server setup to ETL design. Find valuable insights, lessons learned, and tips for survival.
E N D
MySQL Data Warehousing Survival Guide Marius Moscovici (marius@metricinsights.com) Steffan Mejia (steffan@lindenlab.com)
Topics • The size of the beast • Evolution of a Warehouse • Lessons Learned • Survival Tips • Q&A
Size of the beast • 43 Servers • 36 active • 7 standby spares • 16 TB of data in MySQL • 12 TB archived (pre S3 staging) • 4 TB archived (S3) • 3.5B rows in main warehouse • Largest table ~ 500M rows (MySQL)
Warehouse Evolution - First came slaving • Problems: • Reporting slaves easily fall behind • Reporting limited to one-pass SQL
Warehouse Evolution - Then came temp tables • Problems: • Easy to lock replication with temp table creation • Slaving becomes fragile
Warehouse Evolution - A Warehouse is Born • Problems: • Warehouse workload limited by what can be performed by a single server
Warehouse Evolution - Workload Distributed • Problems: • No Real-Time Application integration support
Lessons Learned - Warehouse Design Workload exceeds available memory
Lessons Learned - Warehouse Design • Keep joins < available memory • Heavily Denormalize data for effective reporting • Minimize joins between large tables • Aggressively archive historical data
Lessons Learned - Data Movement • Mysqldump is your friend • Sequence parent/child data loads based on ETL assumptions • Orders without order lines • Order lines without orders • Data Movement Use Cases • Full • Incremental • Upsert (Insert on duplicate key update)
Full Table Loads • Good for small tables • Works for tables with no primary key • Data is fully replaced on each load
Incremental Loads • Table contains new rows but no updates • Good for insert-only tables • High-water mark level included in Mysqldump where clause
Upsert Loads • Table contains new and updated rows • Table must have primary key • Can be used to update only subset of columns
Lessons Learned - ETL Design • Avoid large joins like the plague • Break out ETL jobs into bite-size-bites • Ensure target data integrity on ETL failure • Use memory staging tables to boost performance
ETL Design - Sample Problem Build a daily summary of customer event log activity
ETL Pseudo code - Step 1 1) Create staging table & Find High Water Mark: SELECT IFNULL(MAX(calendar_date),'2000-01-01') INTO @last_loaded_date FROM user_event_log_summary; set max_heap_table_size = <big enough number to hold several days data> CREATE TEMPORARY TABLE user_event_log_summary_staging (.....) ENGINE = MEMORY; CREATE INDEX user_idx USING HASH on user_event_log_summary_staging(user_id);
ETL Pseudo code - Step 2 2) Summarize events: INSERT INTO user_event_log_summary_staging ( calendar_date, user_id, event_type, event_count) SELECT DATE(event_time), user_id, event_type, COUNT(*) FROM event_log WHERE event_time > CONCAT(@last_loaded_date, '23:59:59') GROUP BY 1,2,3;
ETL Pseudo code - Step 3 3) Set denormalized user columns: UPDATE user_event_log_summary_staging log_summary, user SET log_summary.type = user.type, log_summary.status = user.status WHERE user.user_id = log_summary.user_id;
ETL Pseudo code - Step 4 3) Insert into Target Table: INSERT INTO user_event_log_summary (...) SELECT ... FROM user_event_log_summary_staging;
Functional Partitioning • Benefits depend on • Partition Execution Times • Data Move Times • Dependencies between functional partitions
Job Management • Run everything single-threaded on a server • Handle dependencies between jobs across servers • Smart re-start key to survival • Implemented 3-level hierarchy of processing • Process (collection of build steps and data moves) • Build Steps (ETL 'units of work') • Data Moves
DW Replication • Similar to other MySQL environments • Commodity hardware • Master-slave pairs for all databases • Mixed environments can be difficult • Use rsync to create slaves • But not with ssh (on private network) • Monitoring • Reporting queries need to be monitored • Beware of blocking queries • Only run reporting queries on slave (temp table issues) • Nagios • Ganglia • Custom scripts
Infrastructure Planning • Replication latency • Warehouse slave unable to keep up • Disk utilization > 95% • Required frequent re-sync • Options evaluated • Higher speed conventional disks • RAM increase • Solid-state-disks
Optimization • Check / reset HW RAID settings • Use general query log to track ETL / Queries • Application timing • isolate poor-performing parts of the build • Optimize data storage - automatic roll-off of older data
Infrastructure Changes • Increased memory 32GB -> 64GB • New servers have 96GB RAM • SSD Solution • 12 & 16 disk configurations • RAID6 vs. RAID10 • 2.0T or 1.6TB formatted capacity • SATA2 HW BBU RAID6 • ~ 8 TB data on SSD
Results • Sometimes it pays to throw hardware at a problem • 15-hour warehouse builds on old system • 6 hours on optimized system • No application changes
Finally...Archive Two-tiered solution • Move data into archive tables in separate DB • Use select to dump data - efficient and fast • Archive server handles migration • Dump data • GPG • Push to S3
Survival Tips • Efforts to scale are non-linear • As you scale, it becomes increasingly difficult to manage • Be prepared to supplement your warehouse strategy • Dedicated appliance • Distributed processing (Hadoop, etc) • You can gain a great deal of headroom by optimizing I/O • Optimize current disk I/O path • Examine SSD / Flash solutions • Be pragmatic about table designs • It's important to stay ahead of the performance curve • Be proactive - monitor growth, scale early • Monitor everything, including your users • Bad queries can bring replication down