460 likes | 585 Views
On-the-Fly Pipeline Parallelism. I-Ting Angelina Lee * , Charles E. Leiserson * , Tao B. Schardl * , Jim Sukha † , and Zhunping Zhang *. SPAA 2013. MIT CSAIL * Intel Corporation †. Dedup PARSEC Benchmark [BKS08 ].
E N D
On-the-Fly Pipeline Parallelism I-Ting Angelina Lee*, Charles E. Leiserson*, Tao B. Schardl*, Jim Sukha†, and Zhunping Zhang* SPAA 2013 MIT CSAIL* Intel Corporation†
Dedup PARSEC Benchmark [BKS08] Dedupcompresses a stream of data by compressing unique elements and removing duplicates. intfd_out = open_output_file(); booldone = false; while(!done) { chunk_t*chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); } } Stage 0: While there is more data, read the next chunk from the stream. Stage 1: Check for duplicates. Stage 2: Compress first-seen chunk. Stage 3: Write tooutput file.
Parallelism in Dedup • Let’s model Dedup’s execution as a • pipeline dag. • A node denotes the execution of a stage in an iteration. • Edges denote dependencies between nodes. while(!done) { chunk_t*chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); } } Stage 0 Stage 1 : cross edge Stage 2 Stage 3 i0 i1 i2 i3 i4 i5 ... Stage 0 ... Stage 1 Dedup exhibits pipeline parallelism. ... Stage 2 ... Stage 3
Pipeline Parallelism We can measure parallelism in terms of work and span [CLRS09]. Example: = weight 1 = weight 8 WorkT1: The sum of the weights of the nodes in the dag. T1 = 75 SpanT∞: The length of a longest path in the dag. T∞ = 20 ParallelismT1/ T∞: The maximum possible speedup. T1/ T∞= 3.75
Executing a Parallel Pipeline To execute Dedup in parallel, we must answer two questions. while(!done) { chunk_t*chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); } } Stage 0 • How do we encode the parallelism in Dedup? Stage 1 Stage 2 Stage 3 i0 i1 i2 i3 i4 i5 • How do we assign work to parallel processors to execute this computation efficiently? ... Stage 0 ... Stage 1 ... Stage 2 ... Stage 3
On-the-Fly Pipeline Parallelism SPAAJuly 24, 2013 I-Ting Angelina Lee*, Charles E. Leiserson*, Tao B. Schardl*, Jim Sukha†, and Zhunping Zhang* MIT CSAIL* Intel Corporation†
Construct-and-Run Pipelining tbb::pipeline pipeline; GetChunk_Filter filter1(SERIAL, item); Deduplicate_Filterfilter2(SERIAL); Compress_Filterfilter3(PARALLEL); WriteToFile_Filterfilter4(SERIAL, out_item); pipeline.add_filter(filter1); pipeline.add_filter(filter2); pipeline.add_filter(filter3); pipeline.add_filter(filter4); pipeline.run(pipeline_depth); A construct-and-run pipeline specifies the stages and their dependencies a priori before execution. Ex: TBB [MRR12], StreamIt[GTA06], GRAMPS [SLY+11]
On-the-Fly Pipelining of X264 I P P P I P P I P P P Not easily expressible using TBB's pipeline construct [RCJ11]. An on-the-flypipeline is constructed dynamically as the program executes.
On-the-Fly Pipeline Parallelism in Cilk-P • We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system, named Cilk-P, which features: • simple linguistics for specifying on-the-fly pipeline parallelism that are composable with Cilk's existing fork-join primitives; and • PIPER, a theoretically sound randomized work-stealing schedulerthat handles both pipeline and fork-join parallelism. • We hand-compiled 3 applications with pipeline parallelism (ferret, dedup, and x264 from PARSEC [BKS08]) to run on Cilk-P. • Empirical results indicate that Cilk-P exhibits low serial overhead and good scalability.
Outline • On-the-Fly Pipeline Parallelism • The Pipeline Linguistics in Cilk-P • The PIPER Scheduler • Empirical Evaluation • Concluding Remarks
The Pipeline Linguistics in Cilk-P intfd_out = open_output_file(); booldone = false; while(!done) { chunk_t*chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { chunk->is_dup = deduplicate(chunk); if(!chunk->is_dup) compress(chunk); write_to_file(fd_out, chunk); } }
The Pipeline Linguistics in Cilk-P Loop iterations may execute in parallelin a pipelined fashion, where stage 0 executes serially. intfd_out = open_output_file(); booldone = false; pipe_while(!done){ chunk_t*chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk); pipe_wait(3); write_to_file(fd_out, chunk); } } End the current stage, advance to stage 1, and wait for the previous iteration to finish stage 1. End the current stage and advance to stage 2.
The Pipeline Linguistics in Cilk-P pipe_while(!done){ chunk_t*chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk); pipe_wait(3); write_to_file(fd_out, chunk); } } These keywords denote the logical parallelism of the computation. : cross edge The pipe_while enforces that stage 0 executes serially. ... Stage 0 ... Stage 1 The pipe_wait(1) enforces cross edges across stage 1. ... Stage 2 The pipe_wait(3) enforces cross edges across stage 3. ... Stage 3
The Pipeline Linguistics in Cilk-P intfd_out = open_output_file(); booldone = false; pipe_while(!done){ chunk_t*chunk = get_next_chunk(); if(chunk == NULL) { done = true; } else { pipe_wait(1); chunk->is_dup = deduplicate(chunk); pipe_continue(2); if(!chunk->is_dup) compress(chunk); pipe_wait(3); write_to_file(fd_out, chunk); } } These keywords have serial semantics — when elided or replaced with its serial counterpart, a legal serial code results, whose semantics is one of the legal interpretation of the parallel code [FLR98].
On-the-Fly Pipelining of X264 The program controls the execution of pipe_wait and pipe_continue statements, thus supporting on-the-fly pipeline parallelism. I P P P I P P I P P P • Program control can thus: • Skip stages; • Make cross edges data dependent; and • Vary the number of stages across iterations. We can pipeline the x264 video encoder using Cilk-P.
Pipelining X264 with Pthreads The scheduling logics are embedded in the application code. I P P P I P P I P P P The main control thread pthread_create pthread_join
Pipelining X264 with Pthreads The scheduling logics are embedded in the application code. I P P P I P P I P P P pthread_mutex_lock update my_varpthread_cond_broadcast pthread_mutex_unlock pthread_mutex_lock while(my_var < value) { pthread_cond_wait } pthread_mutex_unlock
Pipelining X264 with Pthreads The scheduling logics are embedded in the application code. I P P P I P P I P P P The cross-edge dependencies are enforced via data synchronization with locks and conditional variables.
X264 Performance Comparison Speedup over serial execution Number of processors (P) Cilk-P achieves comparable performance to Pthreads on x264 without explicit data synchronization.
Outline • On-the-Fly Pipeline Overview • The Pipeline Linguistics in Cilk-P • The PIPER Scheduler • A Work-Stealing Scheduler • Handling Runaway Pipeline • Avoiding Synchronization Overhead • Concluding Remarks
Guarantees of a Standard Work-Stealing Scheduler [BL99,ABP01] Definition. TP— execution time onPprocessors T1— workT∞— spanT1 /T∞— parallelism SP— stack space on P processorsS1— stack space of a serial execution Given a computation dag with fork-join parallelism, it achieves: • Time bound: TP ≤ T1/ P + O(T∞ + lgP)expected time linear speedup whenP≪T1/ T∞ • Space bound: SP ≤ PS1 The Work-First Principle [FLR98].Minimize the scheduling overhead borne by the work path (T1) and amortize it against the steal path (T∞).
A Work-Stealing Scheduler (Based on [BL99,ABP01]) i0 i2 i1 i3 i4 i5 Each worker maintains its own setof ready nodes. ... 9 13 17 21 1 5 • If executing a node enables: • two nodes: mark one ready and execute the other one; • one node: execute the enabled node; • zero nodes: execute a node in its ready set. ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 If the ready set is empty, steal from a randomly chosen worker. Execute 17 P P P : done : not done : ready : executing
A Work-Stealing Scheduler (Based on [BL99,ABP01]) i0 i2 i1 i3 i4 i5 Each worker maintains its own setof ready nodes. ... 9 13 17 21 1 5 • If executing a node enables: • two nodes: mark one ready and execute the other one; • one node: execute the enabled node; • zero nodes: execute a node in its ready set. ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 If the ready set is empty, steal from a randomly chosen worker. Execute 18 P P P : done : not done : ready : executing
A Work-Stealing Scheduler (Based on [BL99,ABP01]) i0 i2 i1 i3 i4 i5 Each worker maintains its own setof ready nodes. ... 9 13 17 21 1 5 • If executing a node enables: • two nodes: mark one ready and execute the other one; • one node: execute the enabled node; • zero nodes: execute a node in its ready set. ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 If the ready set is empty, steal from a randomly chosen worker. Steal! P P P P A node has at most two outgoing edges, so a standard work-stealing scheduler just works ... well, almost.
Outline • On-the-Fly Pipeline Overview • The Pipeline Linguistics in Cilk-P • The PIPER Scheduler • A Work-Stealing Scheduler • Handling Runaway Pipeline • Avoiding Synchronization Overhead • Concluding Remarks
Runaway Pipeline i0 i2 i1 i3 i4 i5 ... 9 13 1 5 17 21 A runaway pipeline: where the scheduler allows many new iterations to be started before finishing old ones. ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 P P Problem: Unbounded space usage!
Runaway Pipeline i0 i2 i1 i3 i4 i5 ... 9 13 1 5 17 21 A runaway pipeline: where the scheduler allows many new iterations to be started before finishing old ones. ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 K = 4 Steal! P P Problem: Unbounded space usage! Cilk-P automatically throttles pipelines by inserting a throttling edge between iterationiand iteration i+K, where K is the throttling limit.
Outline • On-the-Fly Pipeline Overview • The Pipeline Linguistics in Cilk-P • The PIPER Scheduler • A Work-Stealing Scheduler • Handling Runaway Pipeline • Avoiding Synchronization Overhead • Concluding Remarks
Synchronization Overhead i0 i2 i1 i3 i4 i5 ... 9 13 1 5 17 21 If two predecessors of a node are executed by different workers, synchronizationis necessary — whoever finishes last enables the node. ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 P P
Synchronization Overhead i0 i2 i1 i3 i4 i5 ... 9 13 1 5 17 21 If two predecessors of a node are executed by different workers, synchronizationis necessary — whoever finishes last enables the node. ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 At pipe_wait(j), iteration i must check left to see if stage jis done in iteration i-1. check right! check left! P P At the end of a stage, iteration i must check right to see if it enabled a node in iteration i+1. Cilk-P implements “lazy enabling” to mitigate the check-right overheadand “dependency folding” to mitigate the check-left overhead.
Lazy Enabling i0 i2 i1 i3 i4 i5 ... 9 13 1 5 17 21 Idea: Be really really lazy about the check-right operation. ... 2 6 10 14 18 22 ... 3 7 11 15 19 23 ... 4 8 12 16 20 24 check i2? Steal! Punt the responsibility of checking right onto a thief stealing or until the worker runs out of nodes to execute in its iteration. P P P Lazy enabling is in accordance with the work-first principle [FLR98].
PIPER's Guarantees Definition.TP— execution time onPprocessors T1— workT∞— span of the throttled dagT1/ T∞— parallelism SP— stack space on P processorsS1— stack space of a serial execution K — throttling limitf — maximum frame size D — depth of nested pipelines • Time bound: TP≤ T1/ P + O(T∞ + lgP)expected time linear speedup whenP ≪T1/ T∞ • Space bound: SP ≤ P(S1 + fDK)
Outline • On-the-Fly Pipeline Parallelism • The Pipeline Linguistics in Cilk-P • The PIPER Scheduler • Empirical Evaluation • Concluding Remarks
Experimental Setup • All experiments were ran on an AMD Opteron system with 4 quad-core 2GHz CPU’s having a total of 8 GBytes of memory. • Code compiled using GCC (or G++ for TBB) 4.4.5 using –O3 optimization (except for x264 which uses –O4 by default). • The Pthreaded implementation of ferret and dedup employ the oversubscription method that creates more than one thread per pipeline stage. We limit the number of cores used by the Pthreaded implementations using taskset but experimented to find the best configuration. • All benchmarks are throttled similarly. • Each data point shown is the average of 10 runs, typically with standard deviation less than a few percent.
Ferret Performance Comparison Speedup over serial execution Number of processors (P) Throttling limit = 10P No performance penalty incurred for using the more general on-the-fly pipeline instead of a construct-and-run pipeline.
Dedup Performance Comparison Speedup over serial execution Number of processors (P) Throttling limit = 4P Measured parallelism for Cilk-P (and TBB)’s pipeline is merely 7.4. The Pthreaded implementation has more parallelism due to unordered stages.
X264 Performance Comparison Speedup over serial execution Number of processors (P) Cilk-P achieves comparable performance to Pthreads on x264 without explicit data synchronization.
On-the-Fly Pipeline Parallelism in Cilk-P We have incorporated on-the-fly pipeline parallelism into a Cilk-based work-stealing runtime system, named Cilk-P, which features: • simple linguistics that: • composable with Cilk's fork-join primitives; • specifies on-the-fly pipelines; • has serial semantics; and • allow users to synchronize via control constructs. • the PIPER scheduler that: • supports both pipeline and fork-join parallelism; • asymptotically efficient; • uses bounded space; and • empirically demonstrates low serial overhead and good scalability. AND Intel has created an experimental branch of its Cilk Plus runtime with support for on-the-fly pipelines based on Cilk-P: https://intelcilkruntime@bitbucket.org/intelcilkruntime/intel-cilk-runtime.git
Impact of Throttling We automatically throttle to save space, but the user shouldn’t worry about throttling affecting performance. How does throttling a pipeline computation affect its performance? ... ... ... ... If the dag is regular, then the work and span of the throttled dag asymptotically match that of the unthrottled dag.
Impact of Throttling We automatically throttle to save space, but the user shouldn’t worry about throttling affecting performance. How does throttling a pipeline computation affect its performance? T11/3 + 1 T11/3 + 1 ... (T12/3 + T11/3)/2 ... ... If the dag is irregular, then there are pipelines where no throttling scheduler can achieve speedup.
Dedup Performance Comparison Speedup over serial execution Number of processors (P) Throttling limit = 4P Modified Cilk-P uses a single worker thread for writing out output, like the Pthreaded implementation, which helps performance as well.
The Cilk Programming Model The named child function may execute in parallel with the parent caller. intfib(intn) { if(n < 2) { return n; } int x = cilk_spawnfib(n-1); inty = fib(n-2); cilk_sync; return (x + y); } Control cannot pass this point until all spawned children have returned. Cilk keywords grant permissionfor parallel execution. They do not commandparallel execution.
Pipelining with TBB Execute. Create a pipeline object. … tbb::parallel_pipeline( INNER_PIPELINE_NUM_TOKENS, tbb::make_filter< void, one_chunk* > ( tbb::filter::serial, get_next_chunk) & & tbb::make_filter< one_chunk*, one_procd_chunk* > ( tbb::filter::parallel, deduplicate) & tbb::make_filter< one_procd_chunk*, one_procd_chunk* > ( tbb::filter::parallel, compress) & tbb::make_filter< one_procd_chunk*, void > ( tbb::filter::serial, write_to_file) ); …
Pipelining with Pthreads Encode each stage in its own thread. Assign threads to workers. Execute. void *Deduplicate(void *targs) { … chunk = buf_remove(&recv_buf); is_dup = deduplicate(chunk); if (!is_dup) buf_insert( &send_buf_compress, chunk); else buf_insert( &send_buf_reorder, chunk); … } void *Fragment(void *targs) { … chunk = get_next_chunk(); buf_insert(&send_buf, chunk); … } void *Compress(void *targs) { … chunk = buf_remove(&recv_buf); compress(chunk); buf_insert(&send_buf, chunk); … } void *Reorder(void *targs) { … chunk = buf_remove(&recv_buf); write_or_enqueue(chunk); … }
Pipelining X264 with Pthreads The cross-edge dependencies are enforced via data synchronization with locks and conditional variables. I P P P I P P I P P P Encoding a video with 512 frames on 16 processors: Total # of invocations for: pthread_mutex_lock: 202776 pthread_cond_broadcast: 34816 pthread_cond_wait: 10068 in application code.