1 / 59

Scalable stream processing with Storm

Scalable stream processing with Storm. Brian Johnson Luke Forehand. Our Ambition: Marketing Decision Platform. Brand Health. Choice & Experience. Brand & Category Environment. Budgets. Equity. Image & Personality. Perceptions & Associations. Choice. Purchase Funnel. Budget Planning.

cree
Download Presentation

Scalable stream processing with Storm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scalable stream processing with Storm Brian Johnson Luke Forehand

  2. Our Ambition: Marketing Decision Platform Brand Health Choice & Experience Brand & Category Environment Budgets Equity Image & Personality Perceptions & Associations Choice Purchase Funnel Budget Planning Value Chain Unmet Needs Brand Roles & Portfolio Complements Substitutes Competitive Positioning Experience & Usage Engagement Category Trends Laws & Regulations External Forces (i.e. economy) Product Lifecycle Marketing & Media Mix Core Benefit & Differentiation Position Loyalty Sales Management Product Development Features & Functions Design Packaging Quality Cost Structure Global & Local Demand Planning Market Management Channel Management Sales Analysis Advertising Advertising Content Relationship Marketing Influencing Public Relations Influence & Advocacy Endorsers & Spokespeople Partnerships Sponsorships CRM Engagement Owned Social Engagement Message Naming & Taglines Damage Control Buzz Generation Digital Marketing & Advertising Consumer Promotion Traditional Advertising Direct Coupon Social Ad/ Display Ad Radio Tracking & Attribution Out of Home Owned Media Search TV Print Email Mobile Consumer Segmentation & Targeting Retailer Management Channel Trends Distribution In-Store Promotion Behavioral & Attitudinal Demos & Geos Lifestyles Price & Costs Loyalty Program Brand & Category Lifestages Feature Promotion Pricing Price/Value Perception Price Management Category Management Own Stores E-Commerce Price Justification Price Change Response Competitive Pricing Price Optimization Assortment Promotion & Co-marketing Price Owned Stores Online

  3. Big Data Analytics • What is “Big Data” to Networked Insights? • Almost exclusively social media posts and metadata • Twitter (~67%), Forums, Blogs, Facebook, etc. • Total index ~60 Billion documents, ~500 TB in production • New documents of 2 Billion/month, increasing • Historical data going back to 2009 Data Information Thematic Clustering (Doppler)

  4. Utilizing Social Media Data • We do two things: 1) Filter data; 2) Analyze data • Our filtering technology must accomodate two scenarios • We analyze 2 types of information: Implicit & Explicit

  5. Implicit Information Mining Example • Gender Classification– List of methods and features • Author name / author ID analysis: compare both fields list of first names from US Census • Twitter summary field analysis • Post content features: analyze the content for certain clues or common characteristics that one gender has over another • Text formality – males tend to have more formality than females • Suffix preferences – many suffixes show up more in female posts than male • Word classes – 23 different groups of words that reflect certain topics or emotions that skew towards one gender more than another • Lexical words & phrases – certain words/phrases that are giveaways like “my husband” • POS sequences – certain part of speech patterns for unigram, bigram, trigram, and quadgram phrases

  6. Lots of data, lots of routing Timberlake? Bieber? Jay Z? Meta Data World War Z? Monsters U? White House Down? Taco Bell? McDonald’s? Subway? Spam Classifiers iPhone? Samsung? BlackBerry? Etc. Gender Analysis Original Documents Topical Categorization Age Classification Sentiment Age Classification Reporting Layer SocialSense Application Layer = Storm

  7. Storm Agenda • Overview • Architecture • Working Example • Spout API / Reliability • Bolt API / Scalability • Topology Demo • Monitoring

  8. Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs

  9. Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed

  10. Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently

  11. Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently • Components can be written in any language

  12. Overview • Storm is a realtime distributed processing system • Think of Hadoop but in realtime • Data can be transformed and grouped in complexways using simple constructs • Storm is reliable and fault tolerant • Message delivery is guaranteed • Storm is easy to configure and scale • Each component can be scaled independently • Components can be written in any language • Written in Clojure (functional language), driven by ZeroMQ

  13. Architecture • Components

  14. Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors

  15. Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors • Supervisor • Manages a set of workers (JVMs) on each storm node • Receives work assignments from Nimbus

  16. Architecture • Nimbus • “Master” • Uses Zookeeper to communicate with Supervisors • Responsible for assigning work to supervisors • Supervisor • Manages a set of workers (JVMs) on each storm node • Receives work assignments from Nimbus • Worker • Managed by Supervisor • Responsible for receiving, executing, and emitting datainside a storm topology

  17. Working Example

  18. Working Example

  19. Working Example • Topology • Defines the logical components of a data flow

  20. Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams

  21. Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology

  22. Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples

  23. Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples • Stream is a flow of tuples from one component toanother, there can be many

  24. Working Example • Topology • Defines the logical components of a data flow • Composed of Spouts, Bolts, Streams • Spout is a special component that emits data tuples into a topology • Bolt processes tuples emitted from upstreamcomponents and produces zero or many outputtuples • Stream is a flow of tuples from one component toanother, there can be many • Tuple is a single record containing a named list of values

  25. Working Example

  26. Spout API ISpout void declareOutputFields(OutputFieldsDeclarer declarer) void open(Map conf, TopologyContext context, SpoutOutputCollector collector) void nextTuple() void close() ISpoutOutputCollector List<Integer> emit(String streamId, List<Object> tuple, Object messageId)

  27. Reliability • Each Storm component acknowledges that a tuplehas been processed

  28. Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout

  29. Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout

  30. Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout • Spouts can control the number of “pending” tuplesthat are in memory in the topology

  31. Reliability • Each Storm component acknowledges that a tuplehas been processed • An ACK is sent to the upstream component, eventuallypropagating back to the emitting spout • The emitting spout will replay the tuple if ACK is notreceived within a configured timeout • Spouts can control the number of “pending” tuplesthat are in memory in the topology • Spouts need to transact properly with an upstream data source when a tuple is fully acknowledged

  32. Reliability ISpout void ack(Object msgId) void fail(Object msgId)

  33. Reliability • MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology • Be careful not to artificially decrease throughput!

  34. Reliability • MAX_SPOUT_PENDING is the parameter to controlhow many pending tuples a spout can emit intoa topology • Be careful not to artificially decrease throughput! • Batching operations with reliability turned on can alsocreate issues

  35. Reliability • If max_spout_pending is smaller thanbatch size, topo will collapse • If interruption in tuple flow, batch may never fill

  36. Reliability • Solution: time based batching with TickTuple • TickTuple exercises the component to prompt a batch commit on a specified interval

  37. Reliability • Questions?

  38. Bolt API

  39. Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types.

  40. Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types. • Shuffle grouping – tuples are randomly distributedacross the instances of a bolt

  41. Bolt API • Stream Groupings defines how bolts receive streamsas input, we’ll talk about the two basic types. • Shuffle grouping – tuples are randomly distributedacross the instances of a bolt • Fields grouping – stream is partitioned by fields specifiedin the grouping, so tuples with a particular named valuewill always flow to the same bolt instance

  42. Bolt API

  43. Bolt API IBolt void declareOutputFields(OutputFieldsDeclarer declarer) void prepare(Map stormConf, TopologyContext context, OutputCollector collector) void execute(Tuple input) void cleanup() IOutputCollector List<Integer> emit(String streamId, Collection<Tuple> anchors, List<Object> tuple) void ack(Tuple input) void fail(Tuple input)

  44. Bolt API • You can also build the components of your topology inother languages public class MyPythonBoltextends ShellBolt { public MyPythonBolt() { super("python", "mybolt.py"); } ... }

  45. Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow

  46. Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout)

  47. Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout) • Increase the number of workers assigned to a topology

  48. Scalability • The goal should be to scale components accordingly inorder to keep up with realtime data flow • Scalability is easy and can happen in several ways • Increase the number of executors (threads) that work within a component (bolt or spout) • Increase the number of workers assigned to a topology • Increase total workers available in cluster

  49. Scalability Example Topology increasing number of executors per component

  50. Scalability Example Topology increasing number of workers in the topology 2 workers, MySpout with 2 executors, MyBolt with 4 executors • Work will always be spread • evenly across the workers • when possible 4 workers, MySpout with 2 executors, MyBolt with 4 executors

More Related