250 likes | 336 Views
Cloud Computing. Other High-level parallel processing languages Keke Chen. Outline. sawzall Dryad and DraydLINQ (MS, abandoned) Hive. Sawzall. Simplify mapreduce programming Filters + aggregator. mapper. reducer. Example. reducers. Convert the input record to float. mappers. input.
E N D
Cloud Computing Other High-level parallel processing languages Keke Chen
Outline • sawzall • Dryad and DraydLINQ (MS, abandoned) • Hive
Sawzall • Simplify mapreduce programming • Filters + aggregator mapper reducer
Example reducers Convert the input record to float mappers
input • Sawzall program works on a single record • As a filter filtering through the data stream • Input can be parsed to • Values, e.g., float • Data structure x: float = input; (variable : type = input)
aggregators • definition • table agg_name of data_type/variable • Examples: • c: table collection of string; • S: table sample(100) of string; • S: table sum of {count: int, revenue: float} • More aggregators • Maximum, quantile, top, unique
Indexed aggregators • similar to “group by”, the index is group id • Example t1: table sum[country: string] of int country: string = input; Emit t1[country] <- 1;
More example Proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; Log_record: queryLogProto = input; Loc: Location = locationinfo(log_record.ip); Emit queries_per_degree[int(loc.lat)][int(loc.lon)]<-1
Performance Single-CPU speed, Also 51 times slower than compiled C++
Dryad and DryadLINQ • Dryad provides a low-level parallel data flow processing interface • Acyclic data flow graphs • Data communication methods include pipes, file-based, message, shared-memory • DryadLINQ • A high level language for app developers • It hides the data flow details
Job = Directed Acyclic Graph Outputs Processing vertices Channels (file, pipe, shared memory) Inputs
V V V Runtime • Services • Name server • Daemon • Job Manager • Centralized coordinating process • User application to construct graph • Linked with Dryad libraries for scheduling vertices • Vertex executable • Dryad libraries to communicate with JM • User application sees channels in/out • Arbitrary application code, can use local FS
Hive • Developed by facebook (open source) • Mimic SQL language • Built on hadoop/mapreduce
Hive data model: table etc. • Table • Similar to DB table • stored in hadoop directories • Builtin compression, serialization/deserialization • Partitions • Groups in the table • Subdirectory in the table directory • Buckets • Files in the partition directory • Key (column) based partition • /table/partition/bucket1
Hive data model: Column type • integers, floating point numbers, generic strings, dates and booleans • nestable collection types: array and map.
Metastore stores the schema of databases. It uses non HDFS data store Architecture
Query processing • Steps (similar to DBMS) • Parse • Semantic analyzer • Logical plan generator (algebra tree) • Optimizer • Physical plan generator (to mapreduce jobs)
Operations: DDL and DML • HiveQL: SQL like, with slightly different syntax • User defined filtering and aggregation functions • Java only • Map/reduce plugin for streaming process • Implemented with any language
Example • Facebook status updates • Table: status_updates(userid int, status string,ds string) • profiles(userid int,school string,gender int) • Operations • Load data LOAD DATA LOCAL INPATH `/logs/status_updates‘ INTO TABLE status_updates PARTITION (ds='2009-03-20') • Count status updates by school and by gender