1 / 59

with acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA

Parallel Applications for Multi-core Processors Ana L u cia Vârbănescu T U Delft / Vrije Universiteit Amsterdam. with acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA Alexander van Amesfoort @ TUD Rob van Nieuwpoort @ VU/ASTRON. Outline. One introduction

manchu
Download Presentation

with acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Applications for Multi-core ProcessorsAna Lucia Vârbănescu TUDelft / Vrije Universiteit Amsterdam with acknowledgements to: The Multicore Solutions Group @ IBM TJ Wantson, NY, USA Alexander van Amesfoort @ TUD Rob van Nieuwpoort @ VU/ASTRON

  2. Outline • One introduction • Cell/B.E. case-studies • Sweep3D, Marvel, CellSort • Radioastronomy • An Empirical Performance Checklist • Alternatives • GP-MC, GPUs • Views on parallel applications • … and multiple conclusions

  3. One introduction

  4. The history: STI Cell/B.E. • Sony: main processor for PS3 • Toshiba: signal processing and video streaming • IBM: high performance computing

  5. The architecture • 1 x PPE 64-bit PowerPC • L1: 32 KB I$+32 KB D$ • L2: 512 KB • 8 x SPE cores: • Local store: 256 KB • 128 x 128 bit vector registers • Hybrid memory model: • PPE: Rd/Wr • SPEs: Async DMA

  6. The Programming • Thread-based model, with push/pull data flow • Thread scheduling by user • Memory transfers are explicit • Five layers of parallelism to be exploited: • Task parallelism (MPMD) • Data parallelism (SPMD) • Data streaming parallelism (DMA double buffering) • Vector parallelism (SIMD – up to 16-ways) • Pipeline parallelism (dual-pipelined SPEs)

  7. Sweep3D application • Part of the ASCI benchmark • Solves a three-dimensional particle transport problem • It is a 3D wavefront computation IPDPS 2007:Fabrizio Petrini, Gordon Fossum, Juan Fernández, Ana Lucia Varbanescu, Michael Kistler, Michael Perrone: Multicore Surprises: Lessons Learned from Optimizing Sweep3D on the Cell Broadband Engine

  8. Sweep3D computation SUBROUTINE sweep() DO iq=1,8 ! Octant loop DO m=1,6/mmi ! Angle pipelining loop DO k=1,kt/mk ! K-plane loop RECV W/E ! Receive W/E I-inflows RECV N/S ! Receive N/S J-inflows ! JK-diagonals with MMI pipelining DO jkm=1,jt+mk-1+mmi-1 ! I-lines on this diagonal DO il=1,ndiag ! Solve Sn equation IF .NOT. do_fixups DO i=1,it ENDDO ! Solve Sn equation with fixups ELSE DO i=1,it ENDDO ENDIF ENDDO ! I-lines on this diagonal ENDDO ! JK-diagonals with MMI SEND W/E ! Send W/E I-outflows SEND N/S ! Send N/S J-outflows ENDDO ! K-plane pipelining loop ENDDO ! Angle pipelining loop ENDDO ! Octant loop

  9. Application parallelization • Process Level Parallelism • inherits wavefront parallelism implemented in MPI • Thread-level parallelism • Assign “chunks” of I-lines to SPEs • Data streaming parallelism • Thread use double buffering, for both RD and WR • Vector parallelism • SIMD-ize the loops • E.g., 2-ways for double precision, 4-ways for single precision • Pipeline parallelism • SPE dual-pipeline => multiple logical threads of vectorization

  10. Experiments • Run on SDK2.0, Q20 (prototype) blade • 2 Cell processors, 16 SPEs available • 3.2GHz, 1GB RAM

  11. Optimization techniques

  12. Performance comparison

  13. Sweep3D lessons: • Essential SPE-level optimizations: • Low-level parallelization • Communication • SIMD-ization • Dual-pipelines • Address alignment • DMA grouping • Aggressive low-level optimizations = Algorithm tuning!!

  14. Generic CellSort • Based on bitonic merge/sort • works on 2K array elements • Sorts 8-byte patterns from an input string, • Keeps track of the original position

  15. KEY INDEX X X KEYS INDEXES Data “compression” • Memory limitations: • SPE LS=256KB => 128KB data (16K-Keys) + 64KB indexes • Avoid branches (sorting is about if’s … ) • SIMD-ization with (2Keys x 8B) per 16B vector is replaced by

  16. Re-implementing the if’s • if (A>B) • Can be replaced with 6 SIMD instructions for comparing inline int sKeyCompareGT16(SORT_TYPE A, SORT_TYPE B) { VECTORFORMS temp1, temp2, temp3, temp4; temp1.vui = spu_cmpeq( A.vect.vui, B.vect.vui ); temp2.vui = spu_cmpgt( A.vect.vui, B.vect.vui); temp3.vui = spu_slqwbyte( temp2.vui, 4); temp4.vui = spu_and(temp3.vui, temp1.vui); temp4.vui = spu_or(spu_or(temp4.vui, temp2.vui), temp1.vui); return (spu_extract(spu_gather(temp4.vui),0) >= 8); }

  17. The good results • input data: 256KB string • running on: • One PPE on Cell blade, • The PPEon a PS3 • PPE+16xSPEs on the same a Cell blade • 16 SPEs => speed-up ~46

  18. The bad results • Non-standard key types • A lot of effort for implementing basic operations efficiently • SPE-to-SPE communication wastes memory • A larger local SPE sort was more efficient • The limitation of 2k elements is killing performance • Another basic algorithm may be required • Cache-troubles • PPE cache is “polluted” by SPEs accesses • Flushing is not trivial

  19. Lessons from CellSort • Some algorithms do not fit the Cell/B.E. • It pays off to look for different solutions at the higher level (i.e., different algorithm) • Hard to know in advance • SPE-to-SPE communication may be expensive • Not only time-wise, but memory-wise too! • SPE memory is *very* limited • Double buffering wastes memory too! • Cell does show cache-effects

  20. Multimedia Analysis & Retrieval MARVEL: • Machine tagging, searching and filtering of images & video • Novel Approach: • Semantic models by analyzing visual, audio & speech modalities • Automatic classification of scenes, objects, events, people, sites, etc. • http://www.research.ibm.com/marvel

  21. MARVEL case-study • Multimedia content retrieval and analysis Compares the image features with the model features and generates an overall confidence score Extracts the values for 4 features of interest: ColorHistogram, ColorCorrelogram, Texture, EdgeHistogram

  22. CH CD CC EH TX MarCell = MARVEL on Cell • Identified 5 kernels to port on the SPEs: • 4 feature extraction algorithms • ColorHistogram (CHExtract) • ColorCorrelogram(CCExtract) • Texture (TXExtract) • EdgeHistogram (EHExtract) • 1 common concept detection, repeated for each feature

  23. Detect & isolate kernels to be ported 1 Replace kernels with C++ stubs 2 Implement the data transfers and move kernels on SPEs 3 Iteratively optimize SPE code 4 MarCell – Porting ICPP 2007: A.L. Varbanescu, H.J. Sips, K.A. Ross, Q. Liu, A. Natsev, J.R. Smith, L.-K. Liu, An Effective Strategy for Porting C++ Applications on Cell.

  24. Experiments • Run on a PlayStation3 • 1Cell processor, 6 SPEs available • 3.2GHz, 256MB RAM • Double-checked with a Cell blade Q20 • 2 Cell processors, 16 SPEs available • 3.2GHz, 1GB RAM • SDK2.1

  25. MarCell – kernels speed-up

  26. Task parallelism – setup

  27. Task parallelism – on Cell blade

  28. Data parallelism – setup • AllSPEs execute the same kernel => SPMD • Requires SPE reconfiguration: • Thread re-creation • Overlays • Kernels scale, overall application doesn’t !!

  29. Combined parallelism – setup • Different kernels span over multiple SPEs • Load balancing • CC and TX ideal candidates • But we verify all possible solutions

  30. Combined parallelism - Cell blade [1/2] CCPE 2008: A.L. Varbanescu, H.J. Sips, K.A. Ross, Q. Liu, A. Natsev, J.R. Smith, L.-K. Liu, Evaluating Application Mapping Scenarios on the Cell/B.E.

  31. Combined parallelism - Cell blade [2/2]

  32. MarCell lessons: • Mapping and scheduling: • High-level parallelization • Essential for “seeing” the influence of kernel optimizations • Platform-oriented • MPI-inheritance may not be good enough • Context switches are expensive • Static scheduling can be replaced with dynamic (PPE-based) scheduling

  33. Radioastronomy • Very large radiotelescopes • LOFAR, ASKAP, SKA, etc. • Radioastronomy features • Very large data sets • Off-line (files) and On-line processing (streaming) • Simple computation kernels • Time constraints • Due to streaming • Due to storage capability • Radioastronomy data processing is ongoing research • Multi-core processors are a challenging solution

  34. Getting the sky image • The signal path from the antenna to the sky image • We focus on imaging

  35. Data imaging • Two phases for building a sky image • Imaging: gets measured visibilities and creates dirty image • Deconvolution “cleans” the dirty image into a sky model. • The more iterations, the better the model • But more iterations = more measured visibilities

  36. Gridding/Degridding (u,v)-tracks sampled data (visibilities) Gridded data (all baselines) V(b(ti)) gridding degridding • Dj(b(ti)) contributes to a certain region in the final grid. • V(b(ti)) = data read at time ti on baseline b • Both gridding and degridding are performed by convolution

  37. The code forall (j =0..Nfreq;i=0..Nsamples−1) // for all samples //the kernel position in C compute cindex=C_Offset((u,v,w)[i],freq[j]); //the grid region to fill compute gindex=G_Offset((u,v,w)[i],freq[j]); //for all points in the chosen region for(x=0;x<M;x++) // sweep the convolution kernel if (gridding) G[gindex+x]+=C[cindex+x]V[i,j]; if (degridding) V’[i,j]+=G[gindex+x]C[cindex+x]; • All operations are performed with complex numbers !

  38. Memory HDD k = 1.. m x m Compute C_ind,G_ind Read SC[k], SG[k] Compute SG[k]+D x SC[k] Write SG[k] to G Read (u,v,w)(t,b) V(t,b,f) Samples x baselines x frequency_channels The computation • Computation/iteration: M * (4ADD + 4MUL) = 8 * M • Memory transfers/iteration: RD: 2* M * 8B ; WR: M * 8B • Arithmetic intensity [FLOPs/byte]: 1/3 => memory intensive app! • Two consecutive data points “hit” different regions in C/G => dynamic!

  39. The data • Memory footprint: • C: 4MB ~ 100MB • V: 3.5GB for 990 baselines x 1 sample/s x 16 fr.channels • G: 4MB • For each data point: • Convolution kernel: from 15 x 15 up to 129 x 129

  40. 1 2 3 4 5 6 7 8 9 10 11 12 1 4 5 2 11 7 8 9 12 4 8 12 3 11 7 6 2 10 1 5 9 3 6 Data distribution • “Round-robin” • “Chunks” • Queues

  41. Memory Parallelization • A master-worker model • “Scheduling” decisions on the PPE • SPEs concerned only with computation HDD DMA DMA k = 1.. m x m Compute C_ind,G_ind Rd SC[k], SG[k] Compute SG[k]+D x SC[k] Wr SG[k] to localG Add localG to finalG Read (u,v,w)(t,b) V(t,b,f) Samples x baselines x frequency_channels

  42. Optimizations • Exploit data locality • PPE: fill the queues in a “smart” way • SPEs: avoid unnecessary DMA • Tune queue sizes • Increase queue filling speed • 2 or 4 threads on the PPE • Sort queues • By g_ind and/or c_ind

  43. Experiments set-up • Collection of 990 baselines • 1 baseline • Multiple baselines • Run gridding and degridding for: • 5 different support sizes • Different core/thread configurations • Report: • Execution time / operation (i.e., per gridding and per degridding): Texec/op = Texec/(NSamples x NFreqChans x KernelSize x #Cores)

  44. Results – overall evolution

  45. Lessons from Gridding • SPE kernels have to be as regular as possible • Dynamic scheduling works on the PPE side • Investigate data-dependent optimizations • To spare memory accesses • Arithmetic intensity is a very important metric • Aggressively optimizing the computation part only pays off when communication-to-computation is small!! • I/O can limit the Cell/B.E. performance

  46. Performance checklist [1/2] • Low-level • No dynamics on the SPEs • Memory alignment • Cache behavior vs. SPE data contention • Double-buffering • Balance computation optimization with the communication • Expect impact on the algorithmic level

  47. Performance checklist [2/2] • High-level • Task-level parallelization • Symmetrical/asymmetrical • Static mapping if possible; dynamic only on the PPE • Address data locality also on the PPE • Moderate impact on algorithm • Data-dependent optimizations • Enhance data locality

  48. Outline • One introduction • Cell/B.E. case-studies • Sweep3D, Marvel, CellSort • Radioastronomy • An Empirical Performance Checklist • Alternatives • GP-MC, GPUs • Views on parallel applications • … and multiple conclusions

  49. Other platforms • General Purpose MC • Easier to program (SMP machines) • Homogeneous • Complex, traditional cores, multi-threaded • GPU’s • Hierarchical cores • Harder to program (more parallelism) • Complex memory architecture • Less predictable

  50. A Comparison • Different strategies are required for each platform • Core-specific optimization are the most important for GPP • Dynamic job/data allocation are essential for Cell/B.E. • Memory management for high data parallelism is critical for GPU

More Related