220 likes | 235 Views
Learn how to handle fast data processing and storage using Kafka and Kudu, with examples and insights from Cloudera experts Ted Malaska and Jeff Holoman.
E N D
Fast Data Made Easy Ted Malaska Cloudera • With Kafka and Kudu • @jeffholoman @tedmalaska Jeff Holoman Cloudera
Bank Ledger • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • 100% Correct! txns 100 insert RDBMS Hadoop XML 101 insert SQL 100 update 100 update 102 insert
Distributed Systems • Things Fail • Systems are designed to tolerate failure • We must expect failures and design our code and configure our systems to handle them
Options • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • 100 % Correct Option 1 RDBMS Hadoop SQL Sqoop
Bank Ledger • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • 100% Correct Option 2 txns txns txns RDBMS txn Hadoop SQL • Compaction • De-Duplication • In-Process Hard
“There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery ” --Mathias Verraes @mathiasverraes
Bank Ledger • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • Correct! Option 3a txns txns txns RDBMS txn HBase SQL + App • Compaction • Hbase->HDFS • Complex • Hbase ScANS Slow • Joins Hadoop
Bank Ledger • All txns must be queryable within 5 min • XML must be parsed and reformatted • In-process counting • Correct! Option 3b txns txns txns RDBMS txn HBase Check SQL + App • Compaction • Complex Hadoop
Bank Ledger • All txns must be queryable within 5 min SECONDS • XML must be parsed and reformatted • In-process counting • Correct! The new Option txns txns txns RDBMS txn SQL OR App • Free Exactly-Once • Immediately Available • Guaranteed ORDERING • UPDATES
Apache Kudu (Incubating) • Columnar Datastore • Fast Inserts/Updates • Efficient Scans • Complements HDFS and HBase • Real-time Row-Based Storage A B C A1 B1 C1 A2 B2 C2 A3 B3 C3 A1 B1 C1 A2 B2 C2 Columnar Storage A3 B3 C3 A1 A2 A3 B1 B2 B3 C1 C2 C3
Kudu Ledger Table create table `ledger` ( uuid STRING, transaction_id STRING, customer_id INT, source STRING, db_action STRING, time_utc STRING, `date` STRING, amount_dollars INT, amount_cents INT, local_timestamp BIGINT ) DISTRIBUTE BY HASH(transaction_id) INTO 20 BUCKETS TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'ledger', 'kudu.master_addresses' = 'jhol-1.vpc.cloudera.com:7051', 'kudu.key_columns' = 'transaction_id,uuid', 'kudu.num_tablet_replicas' = '3') ;
Kudu Aggregation Demo CREATE EXTERNAL TABLE `gamer` ( `gamer_id` STRING, `last_time_played` BIGINT, `games_played` INT, `games_won` INT, `oks` INT, `deaths` INT, `damage_given` INT, `damage_taken` INT, `max_oks_in_one_game` INT, `max_deaths_in_one_game` INT ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'gamer', 'kudu.master_addresses' = 'ip-172-31-43-177.us-west-2.compute.internal:7051', 'kudu.key_columns' = 'gamer_id' );
Kudu Aggregation Architecture Impala SparkSQL Generator Kafka Spark Streaming Kudu Spark MlLib SparkSQL
Kudu Aggregation MlLib val resultDf = sqlContext.sql("SELECT gamer_id, oks, games_won, games_played FROM gamer")val parsedData = resultDf.map(r => {val array = Array(r.getInt(1).toDouble, r.getInt(2).toDouble, r.getInt(3).toDouble)Vectors.dense(array)})val dataCount = parsedData.count()if (dataCount > 0) {val clusters = KMeans.train(parsedData, 3, 5) clusters.clusterCenters.foreach(v => println(" Vector Center:" + v))}
Kudu CDC Demo CREATE EXTERNAL TABLE `gamer_cdc` ( `gamer_id` STRING, `eff_to` STRING, `eff_from` STRING, `last_time_played` BIGINT, `games_played` INT, `games_won` INT, `oks` INT, `deaths` INT, `damage_given` INT, `damage_taken` INT, `max_oks_in_one_game` INT, `max_deaths_in_one_game` INT ) TBLPROPERTIES( 'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler', 'kudu.table_name' = 'gamer_cdc', 'kudu.master_addresses' = 'ip-172-31-43-177.us-west-2.compute.internal:7051', 'kudu.key_columns' = 'gamer_id, eff_to' );
Kudu Aggregation Architecture Get Gamer_Id + empty Eff_To yes no If Record Found? Put New Gamer_Id + Empty Eff_To Put Old Gamer_Id + New Eff_To Update Old Gamer_Id + Empty Eff_To
Kudu Bitemporality Starting Point Insert New Eff_To Update Old Record to New
Kudu CDC Demo Demo