Oracle In-Database MapReduce : When Hadoop Meets Exadata

Oracle In-Database MapReduce: When Hadoop Meets Exadata KuassiMensahDirector Product Management

The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.

Agenda • Big Data & In-Database MapReduce • SQL Map Reduce • In-Database Container for Hadoop • Oracle’s Big Data Solution

Big Data Concept MapReduce Infrastructure RDBMS MapReduce (phase I) DataMining (phase II) Any Data MapReduce Convention: Process Data Locally

Big Data In Real Life, Today MapReduce Infrastructure RDBMS Unstructured Data (HDFS, NoSQL, etc) MapReduce (phase I) DataMining (phase II) RDBMS Structured Data

Problems with Big Data Today • Shipping Data from RDBMS to MapReduce Infrastructure • Too Big to Move • Operational Issues • Data Correctness/Loss • Lack of Enterprise Class Security on MapReduce Infrastructure • Breaking MapReduce Convention • Cost of MapReduce Infrastructure or Storage • Lack of MapReduce Development Skills • Lack of MapReduce Deployment Skills

Big Data with In-Database MapReduce Hadoop Cluster RDBMS MapReduce DataMining Unstructured Data (HDFS, NoSQL, etc) In-Database MapReduce MapReduce Structured Data (RDBMS) DataMining

In-Database MapReduce Trends • Hybrid Platforms: DBMS + MapReduce • Projects/Products/Initiatives • DataStax: Cassandra + Hadoop • HadaptHadoopDB: Postgress + Hadoop • Greenplum HD • MongoDBMapReduce: JavaScript • Aster Data / TeraData • Limitations • Dependency on a Hadoop infrastructure in addition to DBMS • Source compatibility: Need to rewrite Hadoop jobs in different lang.

Oracle’s Big Data Strategy MapReduce APIs Across Data Infrastructure Hadoop, R, SQL RDBMS ( In-Database MapReduce) Big Data Appliance Weblogs Sales Records

Oracle In-Database MapReduceIntegration with Oracle Big Data Solution

Oracle In-Database MapReduce Feature of Oracle database 12c releases In-Database Container for Hadoop (currently Beta) SQL MapReduce (12.1.0.1)

SQL MapReduceDeclarative MR Analytics Collection of Existing and New Features • SQL Analytic functions • User-defined Aggregates functions • Parallel Pipelined Table Functions • SQL Pattern Matching MATCH_RECOGNIZE -- new!

SQL Pattern Matching Stock price Find 10-day periods where a stock price has “double-bottomed” Find event A (“privilege revoked”) followed by 3 or more occurences of event B (“attempted login”) within 1 minute • SQL Pattern Matching provides expressive syntax and fast execution for pattern matching • New SQL construct: MATCH_RECOGNIZE • Define patterns using regular expression syntax 9 12 1 days 19

SQL Pattern Matching Sessionization SELECT user_id, session_idstart_time, no_of_events, duration FROM Events MATCH_RECOGNIZE ( PARTITION BY User_ID ORDER BY Time_Stamp MEASURES match_number() session_id, count(*) as no_of_events, first(time_stamp) start_time, last(time_stamp) - first(time_stamp) duration PATTERN (b s*) DEFINE s as (s.Time_Stamp - prev(Time_Stamp) <= 10) ) ORDER BY user_id, session_id;

DEMO

Vanilla Hadoop Hadoop Cluster Physical partitions (DataNodes) Mappers Materialization of Intermediate data Reducers

In-Database Container for Hadoop Components • Apache Hadoop • Task execution: In-Database JVM • Data partitioning & task scheduling: PQ engine • Data storage: Table, external table, object view. • Data type mapping: TableReader, TableWriter

In-Database Container For Hadoop RDBMS Server Table partitions Mappers processes Pipelining Intermediate data Reducers processes Parallel DML

In-DB Cont. 4 Hadoopvs Vanilla Hadoop RDBMS Server Hadoop Cluster Physical vs Logical data partitions Mappers Materialization vs Pipelining Intermediate data Reducers Parallel DML

In-Database Container for HadoopSummary • A “Hadoop container” in the RDBMS engine: no Hadoop cluster required. • Data processing in-situ: no need to ship data to a separate infrastructure. • API and Source-compatibility: accept HadoopMappers and Reducers as-is • Java interface: invoke Hadoop jobs a-la vanilla Hadoop • SQL interface: Map & Reduce steps in SQL statements

In-Database Container for Hadoop SQL and Java interfaces public class WordCount { public static void main() throws Exception { /* Setup the parameters and run the job */ …… job.init(); job.run(); } SELECT * FROM TABLE (HREDUCE_JP_WORDCOUNT(:ConfKey, CURSOR(SELECT * FROM TABLE (HMAP_JP_WORDCOUNT(:ConfKey, CURSOR(SELECT * from InTable))))))

DEMO

Pipelining Hadoop Jobs Through the SQL Interface Pipelining Hadoop steps without intermediate materialization select * from table (HREDUCE_JP_JOB2 (:Confkey2, .... (HMAP_JP_JOB2 (:ConfKey2, .... (HREDUCE_JP_JOB1 (:ConfKey1, .... (HMAP_JP_JOB1 (:ConfKey1, ...), ))));

In-Database Container for Hadoop Projected Features • Reuse Mappers & Reducers (including R-generated) • Dynamic Data Partitioning • Apache Hadoop API 2.00 • Custom WritablesHadoop types • Serialized Data Formats • InputFormats: HDFS, HBase, Others • Java interface (Similar to Vanilla Hadoop Driver). • SQL interface: Hadoop Job Steps in SQL queries • Mahout

Develop/Deploy with In-Db Cont. 4 Hadoop Reuse existing Mappers & Reducers Develop HadoopMappers & Reducers from scratch Create or Update Hadoop Job Configuration file Load all Java code in RDBMS and create Call Specs Invoke Hadoop job via Java or SQL interfaces. Populate output table with parallel INSERT

Oracle’s Big Data Solution Oracle Endeca Information Discovery Oracle Big Data Appliance Oracle Exadata Oracle Exalytics InfiniBand InfiniBand Oracle Real-TimeDecisions Acquire Organize Analyze Decide

Oracle In-Database MapReduce Summary • Declarative Analytics (SQL MapReduce) • Programmatic Analytics (Complex Algorithms, Hadoop) • MapReduce Jobs steps in SQL Queries. • Custom extensions (InputFormats) • RDBMS QoS (e.g., Enterprise Class Security) • Developers and DBAs friendly • Seamless integration with Oracle’s Big Data solution

Oracle In-Database MapReduce : When Hadoop Meets Exadata