1 / 29

Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman , Zoheb Vacheri , AnHai Doan @ WalmartLabs

Muppet Scalable MapUpdate data-stream processing. Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman , Zoheb Vacheri , AnHai Doan @ WalmartLabs. Road Map. Motivation The MapUpdate framework An example data-stream computation Muppet implementation. The challenge.

yannis
Download Presentation

Wang Lam, Lu Liu, STS Prasad, Anand Rajaraman , Zoheb Vacheri , AnHai Doan @ WalmartLabs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Muppet Scalable MapUpdate data-stream processing Wang Lam, Lu Liu, STS Prasad, AnandRajaraman, ZohebVacheri, AnHai Doan@WalmartLabs

  2. Road Map • Motivation • The MapUpdate framework • An example data-stream computation • Muppet implementation

  3. The challenge • Growing numbers of large, fast data streams • 300+ million Twitter status updates daily • 5+ million Foursquare checkins daily • 3+ billion Facebook Likes and comments daily • Streams never stop • Growing numbers of applications for data streams • Computations need to scale with the data • Applications need to stay up-to-date (“What’s going on now?”) • Machines fail

  4. The wish list • Deliver low-latency processing • Application stays near real-time with its input stream • Computed data can be queried live • Scale up on commodity hardware with computation and stream rate • Easy to program • Simple model to enable rapid development of many applications • Ideally resemble widely adopted MapReduce

  5. Data-stream computation • Big data: MapReduce(Hadoop) • Map and Reduce steps • Batch process large input (e.g., from HDFS) • Hadoop distributes computation • Fast data: MapUpdate (Muppet) • Map and Update steps • Continuously process streaming input (e.g., from network) • Muppet maintains computation and manages memory/storage

  6. The MapReduce framework (Hadoop) • Event • A <key, value> pair of data • Map • A function that performs (stateless) computation on incoming events • Reduce • A function that combines all input for a particular key • Application • Map -> Reduce

  7. The MapUpdateframework (Muppet) • Event • A <key, value> pair of data • Map • A function that performs (stateless) computation on incoming events • Update • A function that updates a slate using incoming events • Application • A directed graph of Mappers and Updaters

  8. A MapUpdate application

  9. An example Muppet application Checkin counts on Foursquare • Identify Foursquare checkins at various retailers • Maintain a live count of retailer checkins • Enable a display of the current counts at any time

  10. An example Muppet application Checkin counts on Foursquare • Source: Read Foursquare stream and create key-value-pair events. • Map: For each checkin event, identify a retailerand publish if found. • Update: For each retailer checkin, increment appropriate count. Updater slates hold live retailer check-in counts.

  11. An example Muppet application • Source: Read Foursquare stream and create key-value-pair events. Input (excerpt): { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } }

  12. An example Muppet application • Source: Read Foursquare stream and create key-value-pair events. Output: 453407, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } }

  13. An example Muppet application • Map: For each checkin event, identify a retailerand publish if found. Input: 453407, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } } }

  14. An example Muppet application • Map: For each checkin event, identify a retailerand publish if found. Output: Walmart.1288052100, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" } }

  15. An example Muppet application • Update: For each retailer checkin, increment appropriate count. Input: Walmart.1288052100, { "checkin": { "created": 1288052432, "venue": { "id": 453407, "name": "Walmart Neighborhood Market" } }, "kosmix": { "timeslot": 1288052100, "interval": 900, "retailer": "Walmart" } }

  16. An example Muppet application • Update: For each retailer checkin, increment appropriate count. Slate: Walmart.1288052100, { "retailer": "Walmart", "timeslot": 1288052100, "interval": 900, "count": 1 }

  17. The Source (stream receiver) while ($checkin = <$sock>) { $checkin =~ s/^[^{]*//; next if ($checkin eq ""); $checkin_count++; my $event; eval { $event = decode_json($checkin); }; if ($@ or (!defined($event->{checkin}))) { $invalid_count++; } else { $event = $event->{checkin}; my $checkin_time = $event->{created}; my $venue = $event->{venue}->{id}; $self->publish("FoursquareCheckin", $event, $venue); } }

  18. The Map (Foursquare::CheckinMapper) sub map { my $self = shift; my $event = shift; my $checkin = $event->{checkin}; my $timeslot = int($checkin->{created} / 900) * 900; $event->{kosmix}->{timeslot} = $timeslot; $event->{kosmix}->{interval} = 900; my $venue_name = $checkin->{venue}->{name}; my $retailer = 0; $retailer = 'ToysRUs' if ($venue_name =~ /toys.*r.*us/i); $retailer = 'Walmart' if ($venue_name =~ /wal.*mart/i); $retailer = 'SamsClub' if ($venue_name =~ /sam.*club/i); if ($retailer) { $event->{kosmix}->{retailer} = $retailer; $self->publish("FoursquareRetailerCheckin", $event, $retailer.".".$timeslot); } }

  19. The Update (Foursquare::RetailerUpdater) use Muppet::Updater; package Foursquare::RetailerUpdater; @ISA = qw( Muppet::Updater ); use strict; sub update { my $self = shift; my $event = shift; my $slate = shift; my $config = shift; my $key = shift; $slate->{timeslot} = $event->{kosmix}->{timeslot}; $slate->{interval} = $event->{kosmix}->{interval}; $slate->{retailer} = $event->{kosmix}->{retailer}; $slate->{count} += 1; return $slate; } 1;

  20. The application configuration (flow graph) { "performer" : "foursquare_mapper", "type" : "perl", "class" : "Foursquare::CheckinMapper", "muppet_type" : "Mapper", "subscribes_to" : [ "FoursquareCheckin" ], "publishes_to" : [ "FoursquareRetailerCheckin" ] }, { "performer" : "foursquare_retailer", "type" : "perl", "class" : "Foursquare::RetailerUpdater", "muppet_type" : "Updater", "workers" : 4, "slate_cache_max" : 10000, "slate_cache_write_after" : 1, "subscribes_to" : [ "FoursquareRetailerCheckin" ] }

  21. Example results

  22. Implementation

  23. Implementation • Slate management • Slates are cached for performance • Cache is sharded by key for load distribution across machines • Slates are written to distributed key-value store for durability • Event flow • Event queues buffer transient load spikes within an application • Host failover remaps load away from an unresponsive machine

  24. Challenges • Host failover • Hotspots (uneven load) • Parallelization • Slate caching • Overload stability

  25. Hotspots • Some key distributions are highly nonuniform (e.g., Zipfian) • Keys based on natural-language word usage • Keys based on a set of varying popularity • Mappers: Run any event anywhere. • Updaters: Popular keys need access to the same slate. • Split associative and commutative computations • Split computation parallelizes partial results. • Propagate partial results to final result. • Reduce slate serialization/deserialization overhead

  26. Usage • Time • Running since mid-2010 • Developers • More than a dozen developers at WalmartLabs have used Muppet to develop their applications • Data • Billions of events, tens of millions of slates processed

  27. Related work • MapReducework toward incremental batch runs of MapReduce, rather than continuous event processing in a revised framework (e.g., MapUpdate) • MapReduce Online (Condie et al.) • Nova (Olston et al.) • Event-flow systemssystems that focus on the dispatch of events, leaving application state and storage (cf. MapUpdate slates) as a problem for the application developer • S4 (Neumeyer et al.) • Storm (Marz et al.) • Streaming-query systemssystems that run and optimize queries in a prescribed query language (contrast low-level, general-purpose MapUpdate operators) • Aurora (StreamBase Systems) (Zdonik et al.) • SPADE for System S (InfoSphere Streams) (Gedik et al.)

  28. Conclusion Big Data : MapReduce :: Fast Data : MapUpdate Create soft-real-time applications on a simple programming model. Distributed stream-processing infrastructure scales computation across cores.

  29. Muppet Scalable data-stream processing Big Fast Data @WalmartLabs

More Related