590 likes | 599 Views
This article explores the use of multi-core processors in parallel applications, with a focus on the Cell/B.E. architecture. Case studies such as Sweep3D and CellSort are discussed, along with performance optimizations and lessons learned.
E N D
Parallel Applications for Multi-core ProcessorsAna Lucia Vârbănescu TUDelft / Vrije Universiteit Amsterdam with acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA Alexander van Amesfoort @ TUD Rob van Nieuwpoort @ VU/ASTRON
Outline • One introduction • Cell/B.E. case-studies • Sweep3D, Marvel, CellSort • Radioastronomy • An Empirical Performance Checklist • Alternatives • GP-MC, GPUs • Views on parallel applications • … and multiple conclusions
The history: STI Cell/B.E. • Sony: main processor for PS3 • Toshiba: signal processing and video streaming • IBM: high performance computing
The architecture • 1 x PPE 64-bit PowerPC • L1: 32 KB I$+32 KB D$ • L2: 512 KB • 8 x SPE cores: • Local store: 256 KB • 128 x 128 bit vector registers • Hybrid memory model: • PPE: Rd/Wr • SPEs: Async DMA
The Programming • Thread-based model, with push/pull data flow • Thread scheduling by user • Memory transfers are explicit • Five layers of parallelism to be exploited: • Task parallelism (MPMD) • Data parallelism (SPMD) • Data streaming parallelism (DMA double buffering) • Vector parallelism (SIMD – up to 16-ways) • Pipeline parallelism (dual-pipelined SPEs)
Sweep3D application • Part of the ASCI benchmark • Solves a three-dimensional particle transport problem • It is a 3D wavefront computation IPDPS 2007:Fabrizio Petrini, Gordon Fossum, Juan Fernández, Ana Lucia Varbanescu, Michael Kistler, Michael Perrone: Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine
Sweep3D computation SUBROUTINE sweep() DO iq=1,8 ! Octant loop DO m=1,6/mmi ! Angle pipelining loop DO k=1,kt/mk ! K-plane loop RECV W/E ! Receive W/E I-inflows RECV N/S ! Receive N/S J-inflows ! JK-diagonals with MMI pipelining DO jkm=1,jt+mk-1+mmi-1 ! I-lines on this diagonal DO il=1,ndiag ! Solve Sn equation IF .NOT. do_fixups DO i=1,it ENDDO ! Solve Sn equation with fixups ELSE DO i=1,it ENDDO ENDIF ENDDO ! I-lines on this diagonal ENDDO ! JK-diagonals with MMI SEND W/E ! Send W/E I-outflows SEND N/S ! Send N/S J-outflows ENDDO ! K-plane pipelining loop ENDDO ! Angle pipelining loop ENDDO ! Octant loop
Application parallelization • Process Level Parallelism • inherits wavefront parallelism implemented in MPI • Thread-level parallelism • Assign “chunks” of I-lines to SPEs • Data streaming parallelism • Thread use double buffering, for both RD and WR • Vector parallelism • SIMD-ize the loops • E.g., 2-ways for double precision, 4-ways for single precision • Pipeline parallelism • SPE dual-pipeline => multiple logical threads of vectorization
Experiments • Run on SDK2.0, Q20 (prototype) blade • 2 Cell processors, 16 SPEs available • 3.2GHz, 1GB RAM
Sweep3D lessons: • Essential SPE-level optimizations: • Low-level parallelization • Communication • SIMD-ization • Dual-pipelines • Address alignment • DMA grouping • Aggressive low-level optimizations = Algorithm tuning!!
Generic CellSort • Based on bitonic merge/sort • works on 2K array elements • Sorts 8-byte patterns from an input string, • Keeps track of the original position
KEY INDEX X X KEYS INDEXES Data “compression” • Memory limitations: • SPE LS=256KB => 128KB data (16K-Keys) + 64KB indexes • Avoid branches (sorting is about if’s … ) • SIMD-ization with (2Keys x 8B) per 16B vector is replaced by
Re-implementing the if’s • if (A>B) • Can be replaced with 6 SIMD instructions for comparing inline int sKeyCompareGT16(SORT_TYPE A, SORT_TYPE B) { VECTORFORMS temp1, temp2, temp3, temp4; temp1.vui = spu_cmpeq( A.vect.vui, B.vect.vui ); temp2.vui = spu_cmpgt( A.vect.vui, B.vect.vui); temp3.vui = spu_slqwbyte( temp2.vui, 4); temp4.vui = spu_and(temp3.vui, temp1.vui); temp4.vui = spu_or(spu_or(temp4.vui, temp2.vui), temp1.vui); return (spu_extract(spu_gather(temp4.vui),0) >= 8); }
The good results • input data: 256KB string • running on: • One PPE on Cell blade, • The PPEon a PS3 • PPE+16xSPEs on the same a Cell blade • 16 SPEs => speed-up ~46
The bad results • Non-standard key types • A lot of effort for implementing basic operations efficiently • SPE-to-SPE communication wastes memory • A larger local SPE sort was more efficient • The limitation of 2k elements is killing performance • Another basic algorithm may be required • Cache-troubles • PPE cache is “polluted” by SPEs accesses • Flushing is not trivial
Lessons from CellSort • Some algorithms do not fit the Cell/B.E. • It pays off to look for different solutions at the higher level (i.e., different algorithm) • Hard to know in advance • SPE-to-SPE communication may be expensive • Not only time-wise, but memory-wise too! • SPE memory is *very* limited • Double buffering wastes memory too! • Cell does show cache-effects
Multimedia Analysis & Retrieval MARVEL: • Machine tagging, searching and filtering of images & video • Novel Approach: • Semantic models by analyzing visual, audio & speech modalities • Automatic classification of scenes, objects, events, people, sites, etc. • http://www.research.ibm.com/marvel
MARVEL case-study • Multimedia content retrieval and analysis Compares the image features with the model features and generates an overall confidence score Extracts the values for 4 features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram
CH CD CC EH TX MarCell = MARVEL on Cell • Identified 5 kernels to port on the SPEs: • 4 feature extraction algorithms • ColorHistogram (CHExtract) • ColorCorrelogram(CCExtract) • Texture (TXExtract) • EdgeHistogram (EHExtract) • 1 common concept detection, repeated for each feature
Detect & isolate kernels to be ported 1 Replace kernels with C++ stubs 2 Implement the data transfers and move kernels on SPEs 3 Iteratively optimize SPE code 4 MarCell – Porting ICPP 2007: A.L. Varbanescu, H.J. Sips, K.A. Ross, Q. Liu, A. Natsev, J.R. Smith, L.-K. Liu, An Effective Strategy for Porting C++ Applications on Cell.
Experiments • Run on a PlayStation3 • 1Cell processor, 6 SPEs available • 3.2GHz, 256MB RAM • Double-checked with a Cell blade Q20 • 2 Cell processors, 16 SPEs available • 3.2GHz, 1GB RAM • SDK2.1
Data parallelism – setup • AllSPEs execute the same kernel => SPMD • Requires SPE reconfiguration: • Thread re-creation • Overlays • Kernels scale, overall application doesn’t !!
Combined parallelism – setup • Different kernels span over multiple SPEs • Load balancing • CC and TX ideal candidates • But we verify all possible solutions
Combined parallelism - Cell blade [1/2] CCPE 2008: A.L. Varbanescu, H.J. Sips, K.A. Ross, Q. Liu, A. Natsev, J.R. Smith, L.-K. Liu, Evaluating Application Mapping Scenarios on the Cell/B.E.
MarCell lessons: • Mapping and scheduling: • High-level parallelization • Essential for “seeing” the influence of kernel optimizations • Platform-oriented • MPI-inheritance may not be good enough • Context switches are expensive • Static scheduling can be replaced with dynamic (PPE-based) scheduling
Radioastronomy • Very large radiotelescopes • LOFAR, ASKAP, SKA, etc. • Radioastronomy features • Very large data sets • Off-line (files) and On-line processing (streaming) • Simple computation kernels • Time constraints • Due to streaming • Due to storage capability • Radioastronomy data processing is ongoing research • Multi-core processors are a challenging solution
Getting the sky image • The signal path from the antenna to the sky image • We focus on imaging
Data imaging • Two phases for building a sky image • Imaging: gets measured visibilities and creates dirty image • Deconvolution “cleans” the dirty image into a sky model. • The more iterations, the better the model • But more iterations = more measured visibilities
Gridding/Degridding (u,v)-tracks sampled data (visibilities) Gridded data (all baselines) V(b(ti)) gridding degridding • Dj(b(ti)) contributes to a certain region in the final grid. • V(b(ti)) = data read at time ti on baseline b • Both gridding and degridding are performed by convolution
The code forall (j =0..Nfreq;i=0..Nsamples−1) // for all samples //the kernel position in C compute cindex=C_Offset((u,v,w)[i],freq[j]); //the grid region to fill compute gindex=G_Offset((u,v,w)[i],freq[j]); //for all points in the chosen region for(x=0;x<M;x++) // sweep the convolution kernel if (gridding) G[gindex+x]+=C[cindex+x]V[i,j]; if (degridding) V’[i,j]+=G[gindex+x]C[cindex+x]; • All operations are performed with complex numbers !
Memory HDD k = 1.. m x m Compute C_ind,G_ind Read SC[k], SG[k] Compute SG[k]+D x SC[k] Write SG[k] to G Read (u,v,w)(t,b) V(t,b,f) Samples x baselines x frequency_channels The computation • Computation/iteration: M * (4ADD + 4MUL) = 8 * M • Memory transfers/iteration: RD: 2* M * 8B ; WR: M * 8B • Arithmetic intensity [FLOPs/byte]: 1/3 => memory intensive app! • Two consecutive data points “hit” different regions in C/G => dynamic!
The data • Memory footprint: • C: 4MB ~ 100MB • V: 3.5GB for 990 baselines x 1 sample/s x 16 fr.channels • G: 4MB • For each data point: • Convolution kernel: from 15 x 15 up to 129 x 129
1 2 3 4 5 6 7 8 9 10 11 12 1 4 5 2 11 7 8 9 12 4 8 12 3 11 7 6 2 10 1 5 9 3 6 Data distribution • “Round-robin” • “Chunks” • Queues
Memory Parallelization • A master-worker model • “Scheduling” decisions on the PPE • SPEs concerned only with computation HDD DMA DMA k = 1.. m x m Compute C_ind,G_ind Rd SC[k], SG[k] Compute SG[k]+D x SC[k] Wr SG[k] to localG Add localG to finalG Read (u,v,w)(t,b) V(t,b,f) Samples x baselines x frequency_channels
Optimizations • Exploit data locality • PPE: fill the queues in a “smart” way • SPEs: avoid unnecessary DMA • Tune queue sizes • Increase queue filling speed • 2 or 4 threads on the PPE • Sort queues • By g_ind and/or c_ind
Experiments set-up • Collection of 990 baselines • 1 baseline • Multiple baselines • Run gridding and degridding for: • 5 different support sizes • Different core/thread configurations • Report: • Execution time / operation (i.e., per gridding and per degridding): Texec/op = Texec/(NSamples x NFreqChans x KernelSize x #Cores)
Lessons from Gridding • SPE kernels have to be as regular as possible • Dynamic scheduling works on the PPE side • Investigate data-dependent optimizations • To spare memory accesses • Arithmetic intensity is a very important metric • Aggressively optimizing the computation part only pays off when communication-to-computation is small!! • I/O can limit the Cell/B.E. performance
Performance checklist [1/2] • Low-level • No dynamics on the SPEs • Memory alignment • Cache behavior vs. SPE data contention • Double-buffering • Balance computation optimization with the communication • Expect impact on the algorithmic level
Performance checklist [2/2] • High-level • Task-level parallelization • Symmetrical/asymmetrical • Static mapping if possible; dynamic only on the PPE • Address data locality also on the PPE • Moderate impact on algorithm • Data-dependent optimizations • Enhance data locality
Outline • One introduction • Cell/B.E. case-studies • Sweep3D, Marvel, CellSort • Radioastronomy • An Empirical Performance Checklist • Alternatives • GP-MC, GPUs • Views on parallel applications • … and multiple conclusions
Other platforms • General Purpose MC • Easier to program (SMP machines) • Homogeneous • Complex, traditional cores, multi-threaded • GPU’s • Hierarchical cores • Harder to program (more parallelism) • Complex memory architecture • Less predictable
A Comparison • Different strategies are required for each platform • Core-specific optimization are the most important for GPP • Dynamic job/data allocation are essential for Cell/B.E. • Memory management for high data parallelism is critical for GPU