1 / 38

Modeling Stream Processing Applications for Dependability Evaluation

Modeling Stream Processing Applications for Dependability Evaluation. Gabriela Jacques-Silva †‡ , Zbigniew Kalbarczyk † , Bugra Gedik ‡ , Henrique Andrade ‡ , Kun-Lung Wu ‡ , Ravishankar K. Iyer †. † University of Illinois at Urbana-Champaign ‡ IBM Research – T. J. Watson Research Center.

regis
Download Presentation

Modeling Stream Processing Applications for Dependability Evaluation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Stream Processing Applications for Dependability Evaluation Gabriela Jacques-Silva†‡, Zbigniew Kalbarczyk†, Bugra Gedik‡, Henrique Andrade‡, Kun-Lung Wu‡, Ravishankar K. Iyer† †University of Illinois at Urbana-Champaign ‡IBM Research – T. J. Watson Research Center

  2. Outline • Streaming applications • Modeling a streaming application • Stream operator, stream connections and tuples • Representation of faults and error propagation • Extending model to include fault tolerance techniques • Evaluation

  3. Extract knowledge from live data streams on-the-fly. data streams Percentage of positive feedback 9.57% stream operators 5.42% 3.16% 2.52% 1.28% tuples 2

  4. Different approaches to fault tolerance have different resource consumption and performance impact. • Some techniques aims at providing no data loss an no data duplication guarantees Percentage of positive feedback 9.57% 5.42% 3.16% 2.52% 1.28%

  5. Different approaches to fault tolerance have different resource consumption and performance impact. partial fault tolerance • Decreases time to achieve stable output as compared to no recovery • Achieves approximate results, which are tolerable by some streaming applications Percentage of positive feedback 9.32% 5.11% 2.84% 2.27% 1.09% 4

  6. An evaluation framework helps to understand the relative merits of different techniques. • Previous approaches focus on performance evaluation • Fault injection may be expensive, mainly when evaluating the application under different setups and parameters Checkpoint Partial graph replication

  7. Summary of contributions • Modeling framework for evaluating streaming applications under faults that lead to data loss and data corruption • Considers consequences of error propagation • Based on generic models specified via Stochastic Activity Networks (SAN) • Abstractions for stream operators, stream connections, and tuples • Modeled three fault tolerance techniques • Checkpointing, partial replication, and full replication

  8. Modeling framework uses Stochastic Activity Network formalism. • SANs can express the non-deterministic behavior and parallel execution of streaming application • Nomenclature • Place  container for a natural number • Activity  transition between places • Token  item in a place • Input gate  enforce condition to activity • Output gate  executes function after activity

  9. Framework is based on the abstraction of three key components of a SPA. • Stream operator  state transition model • Captures arity, selectivity and processing time IG1 F1 Waiting for input int < 9

  10. Framework is based on the abstraction of three key components of a SPA. • Stream operator  state transition model • Captures arity, selectivity and processing time Processing tuple IG1 F1 Waiting for input int < 9

  11. Framework is based on the abstraction of three key components of a SPA. • Stream operator  state transition model • Captures arity, selectivity and processing time Processing tuple IG1 F1 Waiting for input int < 9

  12. Framework is based on the abstraction of three key components of a SPA. • Stream operator  state transition model • Captures arity, selectivity and processing time input stream connections Processing tuple IG1 F1 Waiting for input int < 9

  13. Framework is based on the abstraction of three key components of a SPA. • Stream operator  state transition model • Captures arity, selectivity and processing time input stream connections Processing tuple IG1 F1 Waiting for input int < 9 Sending output OG1 output buffer

  14. Framework is based on the abstraction of three key components of a SPA. • Stream operator  state transition model • Captures arity, selectivity and processing time input stream connections Processing tuple IG1 F1 Waiting for input int < 9 output stream connections Sending output OG1 output buffer OG2

  15. Framework is based on the abstraction of three key components of a SPA. • Stream connections  state sharing between output and input streams

  16. Framework is based on the abstraction of three key components of a SPA. • Stream connections  state sharing between output and input streams

  17. Framework is based on the abstraction of three key components of a SPA. • Tuples  tokens flying through input and output streams • Representation of tuple sizes, but no attribute values

  18. Stream operator failure model considers crashes and SDCs. • Crash  data loss for partial fault tolerance techniques 9.32% 5.11% 2.84% 2.27% 1.09%

  19. Stream operator failure model considers crashes and SDCs. • Crash  data loss for partial fault tolerance techniques • Silent data corruption corruption of attribute values 9.53% 5.42% 3.14% 2.52% 1.28%

  20. Base model is augmented to represent error propagation. • Once a failure occurs, operators may generate inaccurate data • Represented via tainted tuples and tainted stream connections input stream connection Processing tuple Waiting for input output stream connection Sending output

  21. Base model is augmented to represent error propagation. • Once a fault occurs, operators may generate inaccurate data • Represented via tainted tuples and tainted stream connections Processing tainted tuple input stream connection tainted input stream connection Processing tuple Waiting for input output stream connection Sending output tainted output stream connection is tainted

  22. Stateless operators do not generate tainted tuples after crash and restore. No crash Crash 10 5 6 3 10 5 F1 F1 X X int < 9 int < 9 6 3 Once operator recovers, the data is accurate

  23. Stateful operators generate tainted tuples after crash and restore. No crash Crash – after restore 1 2 8 7 6 5 10 6 9 8 16 10 F1 F1 2 3 3 4   X X 6 5 4 7 After recovery, operator produces tainted tuples until its internal state refreshes

  24. Stateful operators generate tainted data upon crash of any operator in the upstream set. No crash Change in internal state 1 7 6 5 4 3 6 5 10 6 F1 F1 2 3  int < 9 4

  25. Stateful operators generate tainted data upon crash of any operator in the upstream set. Crash Internal state is unchanged 2 9 8 7 9 8 16 10 F1 F1 3 4  X X int < 9 6 5 7 After crashed operator recovers, operator produces tainted tuples until its internal state refreshes

  26. Checkpoint of Operator State No crash Crash – after restore • Model is parameterized to capture how long it takes to produce good results after a failure 1 2 8 7 6 5 10 6 9 8 16 10 F1 F1 2 3 3 4   X X 6 5 4 7 G. Jacques-Silva et al. “Language Level Checkpointing Support for Stream Processing Applications”. DSN 2009.

  27. Partial Graph Replication • Replicated operators and stream connections on composed application model • Extra logic in replicated operators to perform replica failover failover active op1,A op1,B deactivate backup G. Jacques-Silva et al. “Language Level Checkpointing Support for Stream Processing Applications”. DSN 2009.

  28. Full graph replication • Extra logic for operators to perform de-duplication on tuples coming from redundant streams • Aims at no tuple loss and non duplicate delivery J.-H. Hwang et al. “Fast and highly-available stream processing over wide area networks”. ICDE 2008.

  29. Checkpoint vs. Partial Replication Under Crashes • Target  Bargain Discovery • Stateless - source, sink, 4 filters • Stateful – aggregate and join • Operator MTTF - 30, 50, 70 and 90 min • Model parameters taken from application executing in IBM System S Partial replication + Checkpoint Checkpoint f(x)  f(x) 2 f(x)2 f(x)2 f(x) f(x)2 1 f(x) f(x)1 f(x)1 f(x)1 f(x)

  30. Evaluation Metrics • Availability • All operators are alive and are not producing tainted data • Total number of tainted tuples • Total number of tainted tuples stored by the sink operator • Percentage of tainted tuples • Fraction of tainted tuples stored by the sink over total number of tuples produced by the golden run

  31. Partial replication provides better availability than checkpoint.

  32. Partial replication produces less tainted tuples than checkpoint.

  33. Impact of SDC on Full Replication Technique • Target  Bargain Discovery • Operator MTTF – 120 min 2 f(x)2 f(x)2 f(x)2 2 1 f(x)1 f(x)1 1 f(x)1 f(x)1 f(x)2

  34. Impact of SDC on application availability is small. 120 min

  35. Percentage of tainted tuples is small when compared to golden run. 120 min

  36. SDC breaks non-duplication guarantee promised by full replication technique. tainted tuples + non-tainted tuples > non-tainted tuples of golden run + confidence interval 120 min

  37. Summary • Modeling framework to evaluate the dependability provided by different techniques • Assemble applications by composing stream operators, stream connections and tuples • Demonstrated framework with three fault tolerance techniques • Validation by comparing results with real fault injections and application executing in IBM System S • Future • Automatic model composition based on application source code and physical deployment

  38. Modeling Stream Processing Applications for Dependability Evaluation Gabriela Jacques-Silva†‡, Zbigniew Kalbarczyk†, Bugra Gedik‡, Henrique Andrade‡, Kun-Lung Wu‡, Ravishankar K. Iyer† †University of Illinois at Urbana-Champaign ‡IBM Research – T. J. Watson Research Center

More Related