670 likes | 886 Views
The Elephant in the Room. A DBA’s Guide to Hadoop & Big Data. Purpose. Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more. Caveats. Focus is vendor-specific Hortonworks Hadoop
E N D
The Elephant in the Room A DBA’s Guide to Hadoop & Big Data
Purpose Rosetta Stone presentation High level overview of Hadoop & Big Data NOT a deep dive NOT a demo session Mostly theory & vocabulary Where to learn more
Caveats Focus is vendor-specific • Hortonworks Hadoop • Microsoft SQL Server Don’t consider myself a Hadoop expert (yet)
About Me Manage DBA’s for financial services company Former Data Architect, DBA, developer Linchpin People TeamMate AtlantaMDF Chapter Leader Infrequent blogger: http://codegumbo.com
About You Assume that • SQL experience • exposure to database admin & architecture • little to no experience with Big Data
Challenges... ..for the SQL Server DBA
Rapid Evolution SQL Server new version => 2-4 years New functionality; deprecations Hadoop “official” release => 6 months New functionality; deprecations Different components on separate cycles
DEVELOPERS DBAS
Ecosystems, not product Open-source; vendors add enhancements Official Hadoop is only four modules: • HDFS • Hadoop MapReduce • Hadoop YARN • Hadoop Common
Hadoop Ecosystem (Hortonworks) Hortonworks
Big Data is like teenage sex... Everyone talks about it, Nobody really knows how to do it, Everyone thinks everyone else is doing it, So everyone claims they are doing it… -Dan Ariely
The Four V’s of Big Data Volume - data is too big to scale out Velocity - decision window is small Variety - multiple formats challenge integration Variability - same data, different interpretations http://goo.gl/6icouZ
RDBMS versus Big Data RDBMS Primarily Scale-Up Strong Typing Normalization Default Mutable Mature Big Data Primarily Scale-Out Schemaless Default Immutable Evolving
Foundations “Gentlemen, this is a football…” - Vince Lombardi
Hadoop Scaleable, distributed processing framework Official Hadoop is only four modules: • HDFS • Hadoop MapReduce • Hadoop YARN • Hadoop Common
HDFS Hadoop Distributed File System Inspired by Google FileSystem (2002-2003) Cluster storage of large files across servers Yahoo - 10,000 core Hadoop cluster(s) Facebook - 100 PB+ (June, 2012) http://goo.gl/SpSN
HDFS File permissions and authentication. Rack aware fsck: find missing files or blocks. Scheduled Rebalancing Redundancy & Replication Built around MapReduce
Hadoop MapReduce “Developed” by Google; patent issued in 2004 Map - filtering and sorting Reduce - summarization Inherently distributed
Hadoop YARN Yet Another Resource Negotiator Splits resource management out of MapReduce Allows for the use of other processing types (e.g., graph, stream, etc).
Hadoop Common Shared libraries for Hadoop components (and vendor enhancements). Security objects are best example • Superusers • Service Level Authorization • HTTP Authentication
But Wait… There’s More! Hortonworks
Sqoop Data connector between RDBMS and HDFS Command line interface JDBC driver; BCP-like syntax Tutorial
Hive HiveQL - SQL like syntax DDL scripts define tables Query transformed into MapReduce jobs Performance increases with scalability Stinger initiative - Microsoft\Hortonworks
Hive create external table price_data (stock_exchange string, symbol string, trade_date string, open float, high float, low float, close float, volume int, adj_close float) row format delimited fields terminated by ',' stored as textfile location '/user/hue/nyse/nyse_prices'; select * from price_data where symbol = 'IBM';
HCatalog Tight integration with Hive, but supports all Hadoop data access protocols Define relational view into data (DDL) “Tables” can be reused by Hive, Pig, Storm... Tutorial
Pig Data abstraction language; Yahoo (2006) Based on Java; supports Python & Ruby Procedural (SQL is declarative) Allows for ETL Lazy evaluation
Pig ETL service; useful as “duct tape” Typical scenario: Load data into HDFS Use Pig to scrub data, and Pump to another “db” (e.g., MongoDB) Web service reads from destination
But Wait… There’s Too Much! Hortonworks
Big Data Administration The possession of facts is knowledge, the use of them is wisdom. – Thomas Jefferson
Big Data Use Cases Massive Size PB of info Data Warehouse Large clusters Increased Cost Complex Analytics Schemaless Investigational Single-node Low Cost
RDBMS PERFORMANCE APPLICATION GROWTH
BIG DATA PERFORMANCE APPLICATION GROWTH
PERFORMANCE APPLICATION GROWTH
Scale-Up Costs (SQL Server) Single Server Maximum RAM SAN Licenses Windows SQL Server Microsoft Support Personnel Developers DBA SAN Admin Network Admin Facilities Minimum Footprint
Scale-Out Costs (Hortonworks HDP) Multiple Servers Commodity Licenses Windows ($$$) Linux (0\Support $) HDP Support Personnel Developer HDP Admin Network Admin Facilities Power Space Air
Performance Tuning SYSTEM SYSTEM RDBMS HADOOP CODE CODE Performance Tuning Tips
Hadoop Ecosystem (Hortonworks) Hortonworks
Performance Architecture Nathan Marz - Twitter, Storm Lambda Architecture