170 likes | 213 Views
Join us for a deep dive into Hive - a data warehouse solution built over a Map-Reduce framework, enhancing querying and managing large datasets through SQL abstraction. Explore Hive's various engines, features, and optimization capabilities, and understand how it integrates with Hadoop ecosystem for scalable data processing. Discover its advanced SQL functionalities, such as windowing functions, JOIN optimizations, and more. Learn how organizations leverage Hive for ETL processes, BI analytics, and ad-hoc querying, and explore its pros and cons in handling petabyte-scale data. Whether you're a developer, data analyst, or tech enthusiast, this meetup is your gateway to mastering the power of Hive in the world of Big Data.
E N D
HIVE Bucharest Big Data Meetup March 10, 2015
whoami • Developer with SQL Server team since 2001 • Apache contributor • Hive • Hadoop core (security) • stackoverflow user 105929s • @rusanu
What is Hive • Datawarehouse for querying and managing large datasets • A query engine that use Hadoop MapReduce for execution • A SQL abstraction for creating MapReduce algorithms • SQL interface to HDFS data • Developed at FacebookVLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce Framework • ASF top project since September 2010
How does Hive work Beeswax JDBC Shell ODBC • SQL submitted via CLI or Hiveserver(2) • Metadata describing tables stored in RDBMS • Driver compiles/optimizes execution plan • Plan and execution engine submitted to Hadoop as job • MR invokes Hive execution engine which executes plan HCatalog Hive CLI Hiveserver2 Metastore RDBMS Driver Compiles, Optimizes Hadoop MapReduce HDFS Job Tracker Task Split Task Split
Hive Query execution • Compilation/Optimization results in an AST containing operators eg: • FetchOperator: scans source data (the input split) • SelectOperator: projects column values, computes • GroupByOperator: aggregate functions (SUM, COUNT etc) • JoinOperator:joins • The plan forms a DAG of MR jobs • The plan tree is serialized (Kryo) • Hive Driver dispatches jobs • Multiple stages can result in multiple jobs • Task execution picks up the plan and start iterating the plan • MR emits values (rows) into the topmost operator (Fetch) • Rows propagate down the tree • ReduceSinkOperator emits map output for shuffle • Each operator implements both a map side and a reduce side algorithm • Executes the one appropriate for the current task • MR does the shuffle, many operators rely on it as part of their algorithm • Eg. SortOperator, GroupByOperator • Multi-stage queries create intermediate output and the driver submits new job to continue next stage • TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later) • Vectorized execution mode emits batches of rows (1024 rows)
Hive features • Data types: • Numeric: tinyint, smallint, int, bigint, float, double, decimal(precision, scale) • Date/Time: timestamp, date • Character types: string, char(size), varchar(size) • Misc. types: Boolean, binary • Complex types: ARRAY<type>, MAP<type, type>, STRUCT<name:type, name:type>, UNIONTYPE<type, type, type> • Storage formats: text, sequencefile, ORC, Parquet, RC, arbitrary SerDe • Data Load: INSERT, LOAD, external tables, dynamic partitioning • Bucketized tables • JOIN optimizations: MapJoin, SMB join • ACID DML (INSERT/UPDATE/DELETE) • Only supported for ORC storage for now • Columnar storage, vectorized execution • Cost based optimizer (new) • HiveQL: SQL dialect, drives toward ANSI-92 compliance • Subqueries, joins, common table expressions • Lateral views (CROSS APPLY) • SELECT … FROM table LATERAL VIEW explode(column) • Windowing and analytical functions • LEAD, LAG, FIRST_VALUE, LAST_VALUE • RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE • OVER clause for aggregates • PARTITION BY, ORDER BY • WINDOW specification • SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3 FOLLOWING) • WINDOW clause • SELECT SUM(b) OVER w FROM t WINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING) • GROUPING SETs, CUBE, ROLLUP • XPath, custom UDF • TRANSFORM: arbitrary map_script, reduce_script
Hive engines • MapReduce • Default, widely available • Complex queries require stages -> stop-and-go • Always on disk shuffle • TEZ • Generalized MRR (DAG) • Pipelining • Memory shuffle • JOINs (bipartite) • Custom sort • HIVE can optimize plans for TEZ • Recommended engine • SPARK • HIVE-7292 • In development • Not to be confused with Shark or Spark-SQL
Hive pros and cons • Capable of handling PB scale • Decent performance • Fairly advanced SQL features • Integrates with Hadoop ecosystem • Share the data (HDFS) • Leverage existing clusters • High Availability • Disaster Recoverability • Partitioning, Clustering, Bucketing • Positive momentum, active development • No licensing costs • Not good for ad-hoc due to high latency (job submit time) • Topic is actively pursued by Hive/Tez/Yarn • ANSI-SQL gaps • Poor toolset (hivecli, ODBC drivers) • In-house RDBMS operating expertize does not translate to Hive • Outperformed by (costly) high-end proprietary solutions
How to use Hive • As a part of the ETL in transforming large data ingress (click-stream, mobile uploads, access log etc) into query able form • Alternative to PIG for those that favor SQL • Run one-off queries to analyze large unstructured data sets • Power of SQL to get insight into ‘collect everything’ • DW/BI • Can also be part of the ETL that loads the DW • Deploy on TEZ not on M/R • Use ORC or as storage Parquet format • Use recent releases (Hive 0.14 or later)
When to avoid Hive • Replace RDBMS/OLTP • Ad-hoc BI (latency still too high, will improve soon) • When the dataset is small • 512 GB RAM is cheap • If it fits in memory, is not Big data • When data changes frequently • If you have an infinite budget
Some alternatives to Hive • Columnar storage in a traditional RDBMS • MySQL: ICE, InfiniDB • PostgreSQL: cstore_fwd (Citus) • SQL Server columnstore • Amazon Red-shift • Azure SQL Database v12 • Impala • Presto • Spark SQL • Note that Impala and Spark SQL can share Hive’s metastore
Links • Hive Language Manual: https://cwiki.apache.org/confluence/display/Hive/LanguageManual • Join strategies in Hive: https://cwiki.apache.org/confluence/download/attachments/27362054/Hive+Summit+2011-join.pdf • Hive on Tez: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Tez • Hive on Spark: https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark