1 / 25

CS525 : Big Data Analytics

CS525 : Big Data Analytics. MapReduce Languages Fall 2013 Elke A. Rundensteiner. Languages for Hadoop. Java: Hadoop’s Native Language Pig (Yahoo): Query and Workflow Language unstructured dat a Hive (Facebook): SQL-Based Language s tructured data/ data warehousing.

tynice
Download Presentation

CS525 : Big Data Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS525:Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner

  2. Languages for Hadoop • Java: Hadoop’s Native Language • Pig (Yahoo): Query and Workflow Language • unstructured data • Hive (Facebook): SQL-Based Language • structured data/ data warehousing

  3. Java is Hadoop’s Native Language • Hadoop itself is written in Java • Provides Java APIs: • For mappers, reducers, combiners, partitioners • Input and output formats • Other languages, e.g., Pig or Hive, convert their queries to Java MapReduce code

  4. Levels of Abstraction HBase Queries against tables More DB view Hive SQL-Like language Pig Query and workflow language Java Write map-reduce functions More map-reduce view

  5. Apache Pig

  6. What is Pig ? • High-level language and associated platform for expressing data analysis programs. • Compiles down to MapReduce jobs • Developed by Yahoo but open-source

  7. Pig Components • High-level language (Pig Latin) • Set of commands Two Main Components • Two execution modes • Local: reads/write to local file system • Mapreduce: connects to Hadoop cluster and reads/writes to HDFS • Interactive mode • Console Two modes • Batch mode • Submit a script

  8. Why Pig? • Common design patterns as key words (joins, distinct, counts) • Data flow analysis (script can map to multiple map-reduce jobs) • Avoid Java-level errors (for none-java experts) • Interactive mode (Issue commands and get results)

  9. Example The input format (text, tab delimited) Read file from HDFS Define run-time schema raw = LOAD 'excite.log' USINGPigStorage('\t') AS (user, id, time, query); clean1 = FILTER raw BYid > 20 AND id < 100; clean2 = FOREACHclean1 GENERATE user, time, org.apache.pig.tutorial.sanitze(query) as query; user_groups = GROUPclean2 BY (user, query); user_query_counts = FOREACHuser_groups GENERATEgroup, COUNT(clean2), MIN(clean2.time), MAX(clean2.time); STOREuser_query_countsINTO 'uq_counts.csv' USINGPigStorage(','); Filter the rows on predicates For each row, do some transformation Grouping of records Compute aggregation for each group Text, Comma delimited Store the output in a file

  10. Pig Language • Keywords • Load, Filter, Foreach Generate, Group By, Store, Join, Distinct, Order By, … • Aggregations • Count, Avg, Sum, Max, Min • Schema • Defines at query-time and not when files are loaded • Extension of Logic • UDFs • Data • Packages for common input/output formats

  11. A Parameterized Template Script can take arguments Define types of the columns Data are “ctrl-A” delimited A = load'$widerow' usingPigStorage('\u0001') as (name: chararray, c0: int, c1: int, c2: int); B = group A by name parallel10; C = foreachB generate group, SUM(A.c0) as c0, SUM(A.c1) as c1, AVG(A.c2) as c2; D = filter C by c0 > 100 and c1 > 100 and c2 > 100; storeD into'$out'; Specify the need of 10 parallel tasks

  12. Run independent jobs in parallel D1 = load'data1' … D2 = load'data2' … D3 = load 'data3' … C1 = join D1 by a, D2 by b C2 = join D1 by c, D3 byd C1 and C2 are two independent jobs that can run in parallel

  13. Pig Latin vs. SQL • Pig Latin is dataflow programming model (step-by-step) • SQL is declarative (set-based approach) SQL Pig Latin

  14. Pig Latin vs. SQL • In Pig Latin • An execution plan can be explicitly defined (user hints but no clever opt) • Data can be stored at any point during the pipeline • Schema and data types are lazily defined at run-time • Lazy evaluation (data not processed prior to STORE command) • In SQL: • Query plans are decided by the system (powerful opt) • Data not stored in the middle (or, at least not user-accessible) • Schema and data types are defined at creation time

  15. Logic Plan LOAD A=LOAD 'file1' AS (x, y, z); B=LOAD 'file2' AS (t, u, v); C=FILTERA by y > 0; D=JOIN C BY x, B BY u; E=GROUP D BY z; F=FOREACH E GENERATE group, COUNT(D); STORE F INTO 'output'; LOAD FILTER JOIN GROUP FOREACH STORE

  16. Physical Plan • Mostly 1:1 correspondence with the logical plan • Some optimizations available

  17. Hive

  18. Apache Hive (Facebook) • A data warehouse infrastructure built on top of Hadoop for providing data summarization, retrieval, and analysis • Hive Provides : • Structure • ETL • Access to different storage (HDFS or HBase) • Query execution via MapReduce • Key Principles : • SQL is a familiar language • Extensibility – Types, Functions, Formats, Scripts • Performance

  19. Hive Data Model : Structured 3-Levels: Tables  Partitions  Buckets • Table: maps to a HDFS directory • Table R: Users all over the world • Partition: maps to sub-directories under the table • Partition R: by country name • Bucket: maps to files under each partition • Divide a partition into buckets based on a hashfunction

  20. Hive DDL Commands • CREATE TABLE sample (foo INT, bar STRING) PARTITIONED BY (ds STRING); • SHOW TABLES '.*s'; • DESCRIBE sample; • ALTER TABLE sample ADD COLUMNS (new_col INT); • DROP TABLE sample; Schema is known at creation time (like DB schema) Partitioned tables have “sub-directories”, one for each partition Each table in HIVE is HDFS directory in Hadoop

  21. HiveQL: Hive DML Commands Load data from local file system Delete previous data from that table • LOAD DATA LOCAL INPATH './sample.txt' OVERWRITE INTO TABLEsample; • LOAD DATA INPATH '/user/falvariz/hive/sample.txt’ INTO TABLEpartitioned_samplePARTITION (ds='2012-02-24'); Augment to the existing data Load data from HDFS Must define a specific partition for partitioned tables

  22. Query Examples • SELECTMAX(foo) FROMsample; • SELECTds, COUNT(*), SUM(foo) FROMsample GROUP BYds; • FROMsample s INSERT OVERWRITE TABLEbar SELECTs.bar, count(*) WHERE s.foo > 0 GROUP BYs.bar; • SELECT* FROMcustomer c JOINorder_cust o ON(c.id=o.cus_id);

  23. User-Defined Functions

  24. Hadoop Streaming Utility • Hadoop streaming is a utility to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer • C, Python, Java, Ruby, C#, perl, shell commands • Map and Reduce classes written in different languages

  25. Summary : Languages • Java: Hadoop’s Native Language • Pig (Yahoo): Query/Workflow Language • unstructured data • Hive (Facebook): SQL-Based Language • structured data

More Related