1.23k likes | 1.45k Views
Lsd. Distributed Systems Laboratory. StreamCloud: an Elastic Parallel-Distributed Stream Processing Engine. Ph.D. Student Vincenzo Massimiliano Gulisano Director: Ricardo Jiménez Peris Co-director: Patrick Valduriez December 20, 2012. StreamCloud in a Nutshell.
E N D
Lsd Distributed Systems Laboratory StreamCloud:an Elastic Parallel-Distributed Stream Processing Engine Ph.D. Student Vincenzo Massimiliano Gulisano Director: Ricardo Jiménez Peris Co-director: Patrick Valduriez December 20, 2012
StreamCloud in a Nutshell • Sample data streaming application(fraud detection) • Pioneer Stream Processing Engines (SPEs) • Contributions
Background - Pioneer SPEs Centralized SPE
Background - Pioneer SPEs Centralized SPE 100% CPU
Background - Pioneer SPEs Distributed SPE
Background - Pioneer SPEs Distributed SPE
Background - Pioneer SPEs Distributed SPE
Background - Pioneer SPEs Distributed SPE 100% CPU
Contributions - StreamCloud 1 Parallelization … …
Contributions - StreamCloud + … … +
Contributions - StreamCloud - … … -
Contributions - StreamCloud 2 Elasticity … …
Contributions - StreamCloud 3 Fault Tolerance … …
Contributions - StreamCloud 4 Integrated DevelopmentEnvironment … …
Agenda • Introduction • Motivation • System Model • Parallelization • Elasticity • Fault tolerance • Integrated Development Environment • Conclusions
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions Motivation • Financial applications, sensor networks monitoring, … require • Continuous processing of data streams • Real Time fashion • Store and process is not feasible • high-speed networks, nanoseconds to handle a packet • ISP router: gigabytes of headers every hour,… • Data Streaming: • In memory • Bounded resources • Efficient one-pass analysis
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model • Data Stream: unbounded sequence of tuples • Example: Call Description Record (CDR) time
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model OP • Operators: OP • Stateless • 1 input tuple1 output tuple • Stateful • 1+ input tuple(s) • 1 output tuple
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model OP Agg • Operators: • Continuous Query: graph operators/streams OP • Stateless • 1 input tuple1 output tuple • Stateful • 1+ input tuple(s) • 1 output tuple Map Filter Convert € $ Only > 10$ Count callsmade by eachCaller number
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model • Infinite sequence of tuples / bounded memory windows • Example: 1 hour windows time [8:00,9:00) [8:20,9:20) [8:40,9:40)
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model • Infinite sequence of tuples / bounded memory windows • Example: 1 hour windows Counter: 1 time [8:00,9:00) 8:05
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model • Infinite sequence of tuples / bounded memory windows • Example: 1 hour windows Counter: 2 time [8:00,9:00) 8:15 8:05
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model • Infinite sequence of tuples / bounded memory windows • Example: 1 hour windows Counter: 3 time [8:00,9:00) 8:15 8:22 8:05
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model • Infinite sequence of tuples / bounded memory windows • Example: 1 hour windows Counter: 4 time [8:00,9:00) 8:15 8:22 8:45 8:05
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model • Infinite sequence of tuples / bounded memory windows • Example: 1 hour windows Counter: 4 time [8:00,9:00) 8:15 8:22 8:45 8:05 9:05 Output: 4
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions System Model • Infinite sequence of tuples / bounded memory windows • Example: 1 hour windows Counter: 3 time 8:15 8:22 8:45 8:05 9:05 [8:20,9:20)
Agenda • Introduction • Motivation • System Model • Parallelization • Elasticity • Fault tolerance • Integrated Development Environment • Conclusions
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • Building blocks: • Parallelization of data streaming operators • Parallelization and Distribution strategy
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • General approach OPA OPB
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • General approach LB: Load BalancerIM: Input Merger OPA OPB OPA OPA IM IM LB LB Node m Node 1 …
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • General approach LB: Load BalancerIM: Input Merger Subcluster A OPA OPB OPA OPA IM IM LB LB Node m Node 1 …
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • General approach LB: Load BalancerIM: Input Merger Subcluster A Subcluster B OPA OPB OPB OPB OPA OPA IM IM IM IM LB LB LB LB Node m Node n Node 1 Node 1 … …
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • General approach LB: Load BalancerIM: Input Merger Subcluster A Subcluster B OPA OPB OPA OPA OPB OPB … … … … IM IM IM IM LB LB LB LB Node 1 Node 1 … … Node m Node n
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization Agg1 Agg2 Agg3 • Stateful operators: Semantic awareness • Aggregate: count within last hour, group-by caller number Caller A … IM IM IM LB LB … … … … Previous Subcluster
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization Agg1 Agg2 Agg3 • Stateful operators: Semantic awareness • Aggregate: count within last hour, group-by caller number Caller A … IM IM IM LB LB … … … … Previous Subcluster
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • Depending on the stateful operator semantic: • Partition input stream into buckets • Each bucket is processed by 1 node • # buckets >> # nodes
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • Depending on the stateful operator semantic: • Partition input stream into buckets • Each bucket is processed by 1 node • # buckets >> # nodes B A Agg1 Agg2 Agg3 D C F E Keys domain
Introduction Parallelization Elasticity Fault Tolerance IDE Conclusions StreamCloud - Parallelization • Depending on the stateful operator semantic: • Partition input stream into buckets • Each bucket is processed by 1 node • # buckets >> # nodes B A Agg1 Agg2 Agg3 D C F E Keys domain