Scaling Data Stream Processing Systems on Multicore Architectures

Scaling Data Stream Processing Systemson Multicore Architectures Shuhao Zhang Shuhao.zhang@comp.nus.edu.sg

Importance of Data Stream Processing • Data stream processing (DSP) has attracted much attention for real-timeanalysis applications. • Many DSP systems have been proposed recently. … Twitter Heron Apache Storm Apache Flink 2

DSPS on Modern Hardware • DSPS are mostly build for scale-out. • Multicore architectures are attractive platform for DSPSs. • However, fully exploiting its computation power can be challenging. Saber, SIGMOD’16 StreamBox, ATC’17 … Out-of-order arrival This thesis CPU+GPU Multicore-awareness 3

Scaling Data Stream Processing on Multicores 4

Agenda • Revisiting the Design of Data Stream Processing Systems on Multi-Core Processors, ICDE’17 Shuhao Zhang1,2, Bingsheng He1, Daniel Dahlmeier2, Amelie Chi Zhou3, Thomas Heinze2 1 2 3 5

Common Designs of Recent DSP Systems • Existing systems mainly focus on scaling out using a cluster of commodity machines. • Three common design aspects • Pipelined processing with message passing • On-demand data parallelism • JVM based implementation 6

Common Designs of Recent DSP Systems • Existing systems mainly focus on scaling out using a cluster of commodity machines. • Three common design aspects • Pipelined processing with message passing • On-demand data parallelism • JVM based implementation • Turns out that JVM (e.g., GC) involves a minor overhead during stream processing 7

Design Aspect 1: Pipelined Processing with Message Passing Data Source Split Count Sink Word-count application Input tuples Aim to achieve low processing latency. 8

Design Aspect 2: On-demand Data Parallelism • Modern DSP systems such as Storm and Flink are also designed to support data parallelism. Word-count application Sink Split Count Data Source Aim to achieve high throughput. 9

Can DSP systems Perform well on Scale-up Architecture? http://www.tweaktown.com/news/41273/sgi-demonstrates-30-million-iops-beast-with-intel-p3700-s-at-sc14/index.html A single large machine with 100s or 1000s of cores A cluster of commodity machines 10

Scale-up Architecture is Complex Socket 1 (8 Cores) DRAM (128 GB) Socket 0 (8 Cores) 16 GB/S • Non-uniform memory access (NUMA) brings performance issues • Complex memory subsystem and deep execution pipelines Socket 1 (8 Cores) DRAM (128 GB) Instruction Fetch Units Socket 0 (8 Cores) Instruction Length Decoder (ILD) 51.2 GB/S Front end L1-I cache ITLB Instruction Queue (IQ) Memory Instruction Decoders 1.5k D-ICache L2 cache Instruction Decode Queue (IDQ) LLC cache Renamer There is a lack of detailed studies on profiling the aforementioned common design aspects of DSP systems on scale-up architectures. L1-D cache Back end DTLB Scheduler Execution Core Retirement 11

Benchmark Design • There has been no standard benchmark for DSP systems, especially on scale-up architectures. • We design our benchmark according to the four criteria proposed by Jim Gray [1]. 1: J. Gray, Benchmark Handbook: For Database and Transaction Processing Systems. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1992. 12

Scalability on Varying Number of Cores/Sockets (4sockets) (4sockets) (2sockets) (2sockets) (a) Storm (b) Flink Scale well on a single socket Scale poorly on multiple sockets Overhead > Additional resource benefits 13

Are there any Problems when Running on a Single Socket? Processor stalls (a) Storm (b) Flink 70% of the execution times are spent in processor stalls. 14

Are there any Problems when Running on a Single Socket? Front-end stalls (a) Storm (b) Flink Front-end stalls is a major bottleneck. 15

Instruction Footprint L1I-Cache: 32KB L2-Cache: 256KB (i) Common range of their instruction footprints is between 1KB to 10MB and 1KB to 1MB (ii) 30~50% and 20~40% of the instruction footprints are larger than the L1-ICache Mainly caused by the pipelined-processing design. 16

Are there any Problems when Running on Multiple Sockets? • Operators may be scheduled at different CPU sockets. Table: LLC miss stalls when running Storm with four CPU sockets Up to 24% of the total execution time are wasted due to remote memory access Mainly caused by the message passing design. 17

Key Findings (recap) • Unmanaged massive pipelined processing • large instruction footprint between two consecutive invocations of the same function • significant L1-ICache misses. • NUMA-oblivious message passing design • further performance degradation due to significant remote memory access overhead. 18

Agenda • BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures, SIGMOD’19 • Shuhao Zhang*1, Jiong He2, Amelie Chi Zhou3, Bingsheng He1 3 1 2 *Work done while as research trainee at SAP Singapore. 20

Outline • Motivation • Performance Model • Algorithm Design • Experimental Results 21

NUMA Servers (a) Server A: HUAWEI KunLun Cores (w/o HT) @ 1.2GHz (b) Server B: HP ProLiant DL980 Cores (w/o HT) @ 2.27 GHz Different NUMA topology 22

Word Count (WC) as an Example • Each operator can be scaled to multiple instances (called replicas). • Each replica can be independently scheduled. (a) Logical view of WC. (b) One example execution plan of WC. Three CPU sockets are used. Given a NUMA machine (limited HW resources), what is the optimal deployment plan? Focus on placement optimization in this talk. 23

Zoom into the System Design e.g., Counter e.g., Splitter Producer Consumer • Relative location affects the processing behavior of consumer. • Each operator (except Spout) is a consumer of its upstream operator. 4 2 3 1 Pass value queue Pass reference x Memory Step x Socket x? Socket 0 24

The Performance Model Bolt Sink … Bolt • : input rate – depends on ( of upstream operators). • : output rate – depends on (processing speed) and of upstream operators). • The model tries to estimate throughput (), which is Sink’s . 26

Estimating of an Operator can be estimate as: #tuples processed () / time needed to process them () • Consider an arbitrary observation time t, • = total aggregated input tuples arrived during t; • = aggregated time spend on processing all of tuples (assume ). • stands for the average time spent on handling each tuple under an execution plan 27

Estimating Time Spend on Each Tuple • :Actual function execution time assuming the operator has the input data. • :Time required to fetch (local or remotely) the input data from its producers. Estimated as follows. This is why varies under different execution plans... varies under different plans. 28

NUMA-aware Placement is Tricky • Stochasticity is introduced into the problem. • Objective value (e.g., throughput) or weight (e.g., resource demand) of each operator is no longer constant. • The placement decisions may conflict with each other and ordering is introduced into the problem. We apply Branch and Bound technique 30

Algorithm Running Example S0 S1 • Allocate four operators into two sockets (choices). • Three operators cannot be allocated at the same socket due to resource constraint.

Algorithm Running Example 32

Outline • Motivation • Performance Model • Algorithm Design • Experimental Results 144 cores (w/o HT) 33

Experimental Evaluation • Applications: • WC: word-count; FD: fault-detection; SD: spike-detection; LR: linear roadbenchmark • Much higher throughput

Evaluation of Scalability • (1) Scales much better (144 cores) (2) Unable to linearly scaleup • Stream compression? [TerseCades, ATC’18] Linear Road Benchmark Tray 2 Tray 1

Recap • BriskStream scales stream computation towards hundred of cores under NUMA effect even without the tedious tuning process. • We demonstrated that relative-location awareness (or varying processing capability awareness) is the key to address NUMA effect in optimizing stream computation on modern multicore architectures. 36

Agenda • Scaling Consistent Stateful Stream Processing on Shared-Memory Multicore Architectures • Shuhao Zhang1, Yingjun Wu2, Feng Zhang3, Bingsheng He1 1 2 3 38

Outline • Motivation • Our system Designs • Experimental Results 39

Data Stream Processing Systems • Recent efforts have demonstrated ultra-fast stream processing on large-scale multicore architectures. • However, a potential weakness is the inadequate support of consistent stateful stream processing. 40

Linear Road Benchmark as an Example • Road Speed and Vehicle Cntmaintain and update road congestion status. • Toll Notification computes ``toll” of each vehicle depends on road congestion status. • Road congestion status is shared application states among streaming operators. 41

The Current System Design • Recall: • Pipelined processing with message passing • On-demand data parallelism • (1) Concurrent control • Key-based stream partition • Lock 42

Key-based Stream Partition • Can not handle general case No Conflict! AYE Instance 1 AYE AYE Pay extra penalty if AYE is empty! copy PIE PIE Instance 2 PIE AYE 43

Lock-based State Sharing • Poor performance AYE Instance 1 AYE AYE & PIE PIE PIE Instance 2 44

The Current System Design • Recall: • Pipelined processing with message passing • On-demand data parallelism • (1) Concurrent control • Key-based stream partition • Can not handle general case • Lock-based • Poor performance • (2) Access Ordering • Buffer and sort • Performs poorly 45

Consistent Stateful Stream Processing • Maintaining shared states consistency through employing transactional schematics. • State transaction (): set of state access triggered by processing of a single input event at one executor. • Consistent property (): state transaction schedule must be conflict equivalent to … 46

Existing Solutions Revisited • Limited parallelism opportunities. • Can we explore more parallelism opportunities? • Large synchronization overhead for processing every event. • Can we reduce such overhead?

Outline • Motivation • Our system Designs • Experimental Results 48

System Overview • To explore more parallelism opportunities • Three-step procedure abstraction • Punctuation Signal Slicing Scheduling • To reduce state access synchronization overhead • Fine-grained Parallel State Access • Up to 6.8 times higher throughput with similar or even lower processing latency! 49

Three-step Procedure Process step State access step Post-process step 50

Scaling Data Stream Processing Systems on Multicore Architectures

Scaling Data Stream Processing Systems on Multicore Architectures

Presentation Transcript

Data Stream Processing (Part IV)

Data Storage and Data Processing Architectures

Survey of multicore architectures

Massively LDPC Decoding on Multicore Architectures

Data Processing Architectures

MultiCore Processing Workshop

Parallel Multidimensional Scaling Performance on Multicore Systems

Applying Control Theory to Data Stream Processing Systems

Multicore Systems

Parallel Skyline Computation on Multicore Architectures

parallel data mining on multicore and clusters Systems

Data Stream Processing and Analytics

Continuous Data Stream Processing

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures

parallel data mining on multicore and clusters Systems

Adaptive Processing in Data Stream Systems

Data collection, data processing and scaling

Data Storage and Data Processing Architectures

Software Enablement for Multicore Architectures

Data Stream Management Systems

Continuous Data Stream Processing

Data Processing Systems