170 likes | 213 Views
HIVE. Bucharest Big Data Meetup March 10, 2015. w hoami. Developer with SQL Server team since 2001 Apache contributor Hive Hadoop core (security) s tackoverflow user 105929s @ rusanu. What is Hive. Datawarehouse for querying and managing large datasets
E N D
HIVE Bucharest Big Data Meetup March 10, 2015
whoami • Developer with SQL Server team since 2001 • Apache contributor • Hive • Hadoop core (security) • stackoverflow user 105929s • @rusanu
What is Hive • Datawarehouse for querying and managing large datasets • A query engine that use Hadoop MapReduce for execution • A SQL abstraction for creating MapReduce algorithms • SQL interface to HDFS data • Developed at FacebookVLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce Framework • ASF top project since September 2010
How does Hive work Beeswax JDBC Shell ODBC • SQL submitted via CLI or Hiveserver(2) • Metadata describing tables stored in RDBMS • Driver compiles/optimizes execution plan • Plan and execution engine submitted to Hadoop as job • MR invokes Hive execution engine which executes plan HCatalog Hive CLI Hiveserver2 Metastore RDBMS Driver Compiles, Optimizes Hadoop MapReduce HDFS Job Tracker Task Split Task Split
Hive Query execution • Compilation/Optimization results in an AST containing operators eg: • FetchOperator: scans source data (the input split) • SelectOperator: projects column values, computes • GroupByOperator: aggregate functions (SUM, COUNT etc) • JoinOperator:joins • The plan forms a DAG of MR jobs • The plan tree is serialized (Kryo) • Hive Driver dispatches jobs • Multiple stages can result in multiple jobs • Task execution picks up the plan and start iterating the plan • MR emits values (rows) into the topmost operator (Fetch) • Rows propagate down the tree • ReduceSinkOperator emits map output for shuffle • Each operator implements both a map side and a reduce side algorithm • Executes the one appropriate for the current task • MR does the shuffle, many operators rely on it as part of their algorithm • Eg. SortOperator, GroupByOperator • Multi-stage queries create intermediate output and the driver submits new job to continue next stage • TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later) • Vectorized execution mode emits batches of rows (1024 rows)
Hive features • Data types: • Numeric: tinyint, smallint, int, bigint, float, double, decimal(precision, scale) • Date/Time: timestamp, date • Character types: string, char(size), varchar(size) • Misc. types: Boolean, binary • Complex types: ARRAY<type>, MAP<type, type>, STRUCT<name:type, name:type>, UNIONTYPE<type, type, type> • Storage formats: text, sequencefile, ORC, Parquet, RC, arbitrary SerDe • Data Load: INSERT, LOAD, external tables, dynamic partitioning • Bucketized tables • JOIN optimizations: MapJoin, SMB join • ACID DML (INSERT/UPDATE/DELETE) • Only supported for ORC storage for now • Columnar storage, vectorized execution • Cost based optimizer (new) • HiveQL: SQL dialect, drives toward ANSI-92 compliance • Subqueries, joins, common table expressions • Lateral views (CROSS APPLY) • SELECT … FROM table LATERAL VIEW explode(column) • Windowing and analytical functions • LEAD, LAG, FIRST_VALUE, LAST_VALUE • RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE • OVER clause for aggregates • PARTITION BY, ORDER BY • WINDOW specification • SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3 FOLLOWING) • WINDOW clause • SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) • GROUPING SETs, CUBE, ROLLUP • XPath, custom UDF • TRANSFORM: arbitrary map_script, reduce_script
Hive engines • MapReduce • Default, widely available • Complex queries require stages -> stop-and-go • Always on disk shuffle • TEZ • Generalized MRR (DAG) • Pipelining • Memory shuffle • JOINs (bipartite) • Custom sort • HIVE can optimize plans for TEZ • Recommended engine • SPARK • HIVE-7292 • In development • Not to be confused with Shark or Spark-SQL
Hive pros and cons • Capable of handling PB scale • Decent performance • Fairly advanced SQL features • Integrates with Hadoop ecosystem • Share the data (HDFS) • Leverage existing clusters • High Availability • Disaster Recoverability • Partitioning, Clustering, Bucketing • Positive momentum, active development • No licensing costs • Not good for ad-hoc due to high latency (job submit time) • Topic is actively pursued by Hive/Tez/Yarn • ANSI-SQL gaps • Poor toolset (hivecli, ODBC drivers) • In-house RDBMS operating expertize does not translate to Hive • Outperformed by (costly) high-end proprietary solutions
How to use Hive • As a part of the ETL in transforming large data ingress (click-stream, mobile uploads, access log etc) into query able form • Alternative to PIG for those that favor SQL • Run one-off queries to analyze large unstructured data sets • Power of SQL to get insight into ‘collect everything’ • DW/BI • Can also be part of the ETL that loads the DW • Deploy on TEZ not on M/R • Use ORC or as storage Parquet format • Use recent releases (Hive 0.14 or later)
When to avoid Hive • Replace RDBMS/OLTP • Ad-hoc BI (latency still too high, will improve soon) • When the dataset is small • 512 GB RAM is cheap • If it fits in memory, is not Big data • When data changes frequently • If you have an infinite budget
Some alternatives to Hive • Columnar storage in a traditional RDBMS • MySQL: ICE, InfiniDB • PostgreSQL: cstore_fwd (Citus) • SQL Server columnstore • Amazon Red-shift • Azure SQL Database v12 • Impala • Presto • Spark SQL • Note that Impala and Spark SQL can share Hive’s metastore
Links • Hive Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual • Join strategies in Hive: https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf • Hive on Tez: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez • Hive on Spark: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark