Cloud Computing

Cloud Computing Other High-level parallel processing languages Keke Chen

Outline • sawzall • Dryad and DraydLINQ (MS, abandoned) • Hive

Sawzall • Simplify mapreduce programming • Filters + aggregator mapper reducer

Example reducers Convert the input record to float mappers

input • Sawzall program works on a single record • As a filter filtering through the data stream • Input can be parsed to • Values, e.g., float • Data structure x: float = input; (variable : type = input)

aggregators • definition • table agg_name of data_type/variable • Examples: • c: table collection of string; • S: table sample(100) of string; • S: table sum of {count: int, revenue: float} • More aggregators • Maximum, quantile, top, unique

Indexed aggregators • similar to “group by”, the index is group id • Example t1: table sum[country: string] of int country: string = input; Emit t1[country] <- 1;

More example Proto “querylog.proto” queries_per_degree: table sum[lat: int][lon:int] of int; Log_record: queryLogProto = input; Loc: Location = locationinfo(log_record.ip); Emit queries_per_degree[int(loc.lat)][int(loc.lon)]<-1

Performance Single-CPU speed, Also 51 times slower than compiled C++

Performance

Dryad and DryadLINQ • Dryad provides a low-level parallel data flow processing interface • Acyclic data flow graphs • Data communication methods include pipes, file-based, message, shared-memory • DryadLINQ • A high level language for app developers • It hides the data flow details

Job = Directed Acyclic Graph Outputs Processing vertices Channels (file, pipe, shared memory) Inputs

V V V Runtime • Services • Name server • Daemon • Job Manager • Centralized coordinating process • User application to construct graph • Linked with Dryad libraries for scheduling vertices • Vertex executable • Dryad libraries to communicate with JM • User application sees channels in/out • Arbitrary application code, can use local FS

Graph operators

Hive • Developed by facebook (open source) • Mimic SQL language • Built on hadoop/mapreduce

Hive data model: table etc. • Table • Similar to DB table • stored in hadoop directories • Builtin compression, serialization/deserialization • Partitions • Groups in the table • Subdirectory in the table directory • Buckets • Files in the partition directory • Key (column) based partition • /table/partition/bucket1

Hive data model: Column type • integers, floating point numbers, generic strings, dates and booleans • nestable collection types: array and map.

Metastore stores the schema of databases. It uses non HDFS data store Architecture

Query processing • Steps (similar to DBMS) • Parse • Semantic analyzer • Logical plan generator (algebra tree) • Optimizer • Physical plan generator (to mapreduce jobs)

Operations: DDL and DML • HiveQL: SQL like, with slightly different syntax • User defined filtering and aggregation functions • Java only • Map/reduce plugin for streaming process • Implemented with any language

Example • Facebook status updates • Table: status_updates(userid int, status string,ds string) • profiles(userid int,school string,gender int) • Operations • Load data LOAD DATA LOCAL INPATH `/logs/status_updates‘ INTO TABLE status_updates PARTITION (ds='2009-03-20') • Count status updates by school and by gender

More query examples

Query examples

Query examples – using hadoopstreaming

Cloud Computing

Cloud Computing

Presentation Transcript

cloud computing

Cloud Computing

Cloud Computing

CLOUD COMPUTING

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

CLOUD COMPUTING

Cloud Computing

Cloud Computing

Cloud Computing

Cloud Computing

cloud computing