Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman , Zoheb Vacheri , AnHai Doan @ WalmartLabs

Muppet Scalable MapUpdate data-stream processing Wang Lam, Lu Liu, STS Prasad, AnandRajaraman, ZohebVacheri, AnHai Doan@WalmartLabs

Road Map • Motivation • The MapUpdate framework • An example data-stream computation • Muppet implementation

The challenge • Growing numbers of large, fast data streams • 300+ million Twitter status updates daily • 5+ million Foursquare checkins daily • 3+ billion Facebook Likes and comments daily • Streams never stop • Growing numbers of applications for data streams • Computations need to scale with the data • Applications need to stay up-to-date (“What’s going on now?”) • Machines fail

The wish list • Deliver low-latency processing • Application stays near real-time with its input stream • Computed data can be queried live • Scale up on commodity hardware with computation and stream rate • Easy to program • Simple model to enable rapid development of many applications • Ideally resemble widely adopted MapReduce

Data-stream computation • Big data: MapReduce(Hadoop) • Map and Reduce steps • Batch process large input (e.g., from HDFS) • Hadoop distributes computation • Fast data: MapUpdate (Muppet) • Map and Update steps • Continuously process streaming input (e.g., from network) • Muppet maintains computation and manages memory/storage

The MapReduce framework (Hadoop) • Event • A <key, value> pair of data • Map • A function that performs (stateless) computation on incoming events • Reduce • A function that combines all input for a particular key • Application • Map -> Reduce

The MapUpdateframework (Muppet) • Event • A <key, value> pair of data • Map • A function that performs (stateless) computation on incoming events • Update • A function that updates a slate using incoming events • Application • A directed graph of Mappers and Updaters

A MapUpdate application

An example Muppet application Checkin counts on Foursquare • Identify Foursquare checkins at various retailers • Maintain a live count of retailer checkins • Enable a display of the current counts at any time

An example Muppet application Checkin counts on Foursquare • Source: Read Foursquare stream and create key-value-pair events. • Map: For each checkin event, identify a retailerand publish if found. • Update: For each retailer checkin, increment appropriate count. Updater slates hold live retailer check-in counts.

An example Muppet application • Source: Read Foursquare stream and create key-value-pair events. Input (excerpt): { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } }

An example Muppet application • Source: Read Foursquare stream and create key-value-pair events. Output: 453407, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } }

An example Muppet application • Map: For each checkin event, identify a retailerand publish if found. Input: 453407, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } }

An example Muppet application • Map: For each checkin event, identify a retailerand publish if found. Output: Walmart.1288052100, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" } }

An example Muppet application • Update: For each retailer checkin, increment appropriate count. Input: Walmart.1288052100, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" } }

An example Muppet application • Update: For each retailer checkin, increment appropriate count. Slate: Walmart.1288052100, { "retailer": "Walmart", "timeslot": 1288052100, "interval": 900, "count": 1 }

The Source (stream receiver) while ($checkin = <$sock>) { $checkin =~ s/^[^{]*//; next if ($checkin eq ""); $checkin_count++; my $event; eval { $event = decode_json($checkin); }; if ($@ or (!defined($event->{checkin}))) { $invalid_count++; } else { $event = $event->{checkin}; my $checkin_time = $event->{created}; my $venue = $event->{venue}->{id}; $self->publish("FoursquareCheckin", $event, $venue); } }

The Map (Foursquare::CheckinMapper) sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); } }

The Update (Foursquare::RetailerUpdater) use Muppet::Updater; package Foursquare::RetailerUpdater; @ISA = qw( Muppet::Updater ); use strict; sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1; return $slate; } 1;

The application configuration (flow graph) { "performer" : "foursquare_mapper", "type" : "perl", "class" : "Foursquare::CheckinMapper", "muppet_type" : "Mapper", "subscribes_to" : [ "FoursquareCheckin" ], "publishes_to" : [ "FoursquareRetailerCheckin" ] }, { "performer" : "foursquare_retailer", "type" : "perl", "class" : "Foursquare::RetailerUpdater", "muppet_type" : "Updater", "workers" : 4, "slate_cache_max" : 10000, "slate_cache_write_after" : 1, "subscribes_to" : [ "FoursquareRetailerCheckin" ] }

Example results

Implementation

Implementation • Slate management • Slates are cached for performance • Cache is sharded by key for load distribution across machines • Slates are written to distributed key-value store for durability • Event flow • Event queues buffer transient load spikes within an application • Host failover remaps load away from an unresponsive machine

Challenges • Host failover • Hotspots (uneven load) • Parallelization • Slate caching • Overload stability

Hotspots • Some key distributions are highly nonuniform (e.g., Zipfian) • Keys based on natural-language word usage • Keys based on a set of varying popularity • Mappers: Run any event anywhere. • Updaters: Popular keys need access to the same slate. • Split associative and commutative computations • Split computation parallelizes partial results. • Propagate partial results to final result. • Reduce slate serialization/deserialization overhead

Usage • Time • Running since mid-2010 • Developers • More than a dozen developers at WalmartLabs have used Muppet to develop their applications • Data • Billions of events, tens of millions of slates processed

Related work • MapReducework toward incremental batch runs of MapReduce, rather than continuous event processing in a revised framework (e.g., MapUpdate) • MapReduce Online (Condie et al.) • Nova (Olston et al.) • Event-flow systemssystems that focus on the dispatch of events, leaving application state and storage (cf. MapUpdate slates) as a problem for the application developer • S4 (Neumeyer et al.) • Storm (Marz et al.) • Streaming-query systemssystems that run and optimize queries in a prescribed query language (contrast low-level, general-purpose MapUpdate operators) • Aurora (StreamBase Systems) (Zdonik et al.) • SPADE for System S (InfoSphere Streams) (Gedik et al.)

Conclusion Big Data : MapReduce :: Fast Data : MapUpdate Create soft-real-time applications on a simple programming model. Distributed stream-processing infrastructure scales computation across cores.

Muppet Scalable data-stream processing Big Fast Data @WalmartLabs

Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman , Zoheb Vacheri , AnHai Doan @ WalmartLabs