280 likes | 386 Views
Moving Data in real-time into Amazon Redshift. By Matthew Lang, Director of Professional Services May 30, 2019. Welcome!. “ Continuent, the MySQL Availability Company ”. Most people call me Matt Started as Programmer/DBA for Progress 4GL database
E N D
Moving Data in real-time into Amazon Redshift By Matthew Lang, Director of Professional Services May 30, 2019
Welcome! “Continuent, the MySQL Availability Company”
Most people call me Matt • Started as Programmer/DBA for Progress 4GL database • Wrote real time replicator from/to MySQL and Progress Database called Pro2MySQL • Part time Musician • Live in Miami • Just learned how to do Animations in PowerPoint Who are you anyway?
Topics In this session, we will discuss: • What is Amazon’s Redshift? • Traditional methods of getting data out of MySQL* databases and into data warehouses • Introducing Tungsten Replicator • Extra tricks • Filters * MySQL is understood in a broad context, including MySQL, MariaDB, Percona Server, RDS MySQL, RDS Aurora and Google Cloud SQL
Why Amazon RedShift? • High Performance, clustered analytics data warehouse • Easy to start/stop and expand • PostgresSQL JDBC interface • Easy to Acccess • Column based store, parallel processing • Automated Backups • Integrates with S3 • Online Query Editor
ETL, again What really happens with ETL (Extract) • Need to query tables based on changes since last read • Tables in query need keys based usually based on timestamp • Read all necessary columns • If successful, note success time, or position of primary key for each table • If it breaks, find the error, fix it, and rerun job
Your MySQL Database on ETL • Database is spending time on large queries • Because of the above, you can run it on a slave • We need to do a nightly backup on that slave too • buffer_pool pollution • Meaningful OTLP data is evicted from the buffer pool • Replaced with ETL results, reducing cache hits (37%) • Can control with innodb_old_blocks_pct and innodb_old_blocks_time • Data is stale as soon as extract process is done • If it breaks, admin fixes the error. • It might be too late to rerun a nightly job. Too much database load during the day • So, run a ”double” nightly job • Stale analytics for today • Repeat from step 1
ETL (Transform) • Second phase of ETL process • Sometimes is combined with Extract process, creating yet additional load on our database • Filters • Add column (source db identifier) • Drop columns, like blobs, CC numbers • Combine or denormalize table • Can be (another) time consuming task
ETL (Load) • Finally! Load into target (Redshift and/or others) • What about rows that are updated? Target contains: New Data Extracted: • We must remove the old row and insert new row • Multiple updates to the same row mean multiple deletes and inserts for that key
Data Warehouse Integration is Changing • Need to query tables based on changes since last read • Traditional data warehouse usage was based on dump from transactional store, loads into data warehouse • Data warehouse and analytics were done off historical data loaded • Data warehouses often use merged data from multiple sources, which is difficult to combine and manage • Data warehouses are now frequently sources as well as targets for data, i.e.: • Export data to data warehouse • Analyze data • Feed summary data back to application to display stats to users
Tungsten Replicator: Data Warehouses Tungsten Replicator is a fast database replication engine • Designed for speed and flexibility • Read directly from MySQL Binary logs • Replicate to many targets:
MySQL Extraction • Read directly from MySQL Binary logs, or remotely request the binary logs • Write into THL (Transaction History Log) • Our own format • Contains transactional data • Contains additional metadata • Extract once MySQL DBMS Logs MySQL Binary Logging Master Replicator: Extractor THL THL = Events + Metadata
Apply into Redshift • Extract data from THL • Batch incoming rows into CSV • CSV is loaded into S3 using s3utils • Redshift imports the CSV files into staging tables • A “materialization” is performed to merge staging data into existing production tables merge S3 Copy s3cmd CSV JS JDBC Slave Replicator: Applier THL
Fun stuff available for Redshift with Tungsten Replicator • Long Term CDC Collection • Save the CSV files in S3 even after merging changes • Provides complete audit trail of all changes. For instance, track all price changes for a particular item • CSV file compression • Adds some overhead to merge • Can save on storage costs • Upload more quickly • Fan-in • Merge databases into a single schema • Perfect for sharded datasets • DDL Translation • Filters, Filters, and more Filters!
Replicating into a single Schema USA Replicator Replicator UK Add DB name to each row Japan
Filters • Over 40 filters included • Add filters in any stage of replication • Want even more? Write your own in JavaScript!
Popular Filters • Database Transform – change names of extracted schemas or tables • Dropcolumn – drops specified columns from THL • Replicate – include or exclude tables to be replicated • Rowadddbname – add the database name to each row (just like in the example) • Timedelay – delay writing events to THL. Not useful for data warehouse, but for real time targets, can specify a time delay to purposely keep them behind • ddltranslate – replicate DDL changes
ddlscan CREATE TABLE `mytable` ( `id` bigint(20) NOT NULL AUTO_INCREMENT, `stuff` varchar(128) DEFAULT NULL, `mytime` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, `units` decimal(8,5) DEFAULT NULL, `morestuff` text, PRIMARY KEY (`id`) CREATE TABLE matt.mytable ( id BIGINT, stuff VARCHAR(512) /* VARCHAR(128) */, mytime TIMESTAMP, units DECIMAL(8,5), morestuff VARCHAR(65535) /* WARN: MySQL TEXT translated to max VARCHAR */, PRIMARY KEY (id) );
Next Steps Sign up for a private demo for your team, setup a POC, email us at sales@continuent.com Learn more at your own pace • Training and webinar library at www.continuent.com/videos/ • White papers at www.continuent.com/white-papers/ • Read the documentation at http://docs.continuent.com/