1 / 27

Jesús Labarta, Jordi Caubet, Judit Gimenez Sergi Girona, Francesc Escale CEPBA-UPC

OpenMP Performance Visualization with Paraver. Jesús Labarta, Jordi Caubet, Judit Gimenez Sergi Girona, Francesc Escale CEPBA-UPC. PARAVER. (1992- ) Flexible performance visualization tool Functions of time Precedence relationships Quantitative, comparative Powerful / not trivial

erica
Download Presentation

Jesús Labarta, Jordi Caubet, Judit Gimenez Sergi Girona, Francesc Escale CEPBA-UPC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenMP Performance Visualization with Paraver Jesús Labarta, Jordi Caubet, Judit Gimenez Sergi Girona, Francesc Escale CEPBA-UPC

  2. PARAVER (1992- ) • Flexible performance visualization tool • Functions of time • Precedence relationships • Quantitative, comparative • Powerful / not trivial • You drive the analysis • MPI + OpenMP, System activity, performance counters,… • Distributed by CEPBA

  3. Process model Multithreaded + message passing + multiprogramming • Objects: • Thread • Task • Ptask (application)

  4. Tracefile • Instrumented codes • MPI + OpenMP • Java • Pthreads, shmem • Monitoring tools • System activity (SCPUs) • InfoPerfex • Simulators • Dimemas • Simplescalar • Filters • par2Paraver • UTE2Paraver • Records • State (Object, time_start, time_end, state) • Events: • Flag(Object, time, type, value) • Precedence(Object_src, Object_dst, time_src, time_dst, tag, size)

  5. Structure Filter Semantics Representation Visualization Analysis Textual Tracefile Reduced Tracefile Function of time (semantic value) Events Demand Driven evaluation

  6. Filter module • Events • by type • by value • Communications • by tag • by size • by source / destination • logical / physical

  7. Semantic module fcomp1 fPtask ftask ftask ftask fthread fthread fthread fthread fthread fthread • Semantic value: f(t) • f = fcomp2  fcomp1  fPtask  ftask  fthread • Semantic functions • fcomp2, fcomp1: sign, mod, div, in range • fPtask : add, average, max, select • ftask : add, average, max, select • fthread: in state, useful, given state, last event value, next event value, average next event value

  8. Visualization • Type of window • Ptask / Task / thread: one row per object of selected type • Object selection (scalability) • Representation • Color encoded / Gradient / Function of time • Multiple windows • Synchronised • Forward/backward animation • Precise time measurement • Within/between windows

  9. Textual • Textual detail of area around point within window • Semantic value and duration / flag / communication • Numeric / translated text (.pcf file)

  10. Analysis • Time and object range selected pointing on window • Analysis function applied to output of semantic module • Average semantic value • Average duration/variance/number of bursts (if within range) • Number of events • Number of communications • ...

  11. OpenMP instrumentation • Compiler instrumentation • NANOS compiler • Dynamic Interception • SGI native OpenMP (MP library) • Tracing of thread status • running • idle (busy wait) • scheduling • blocked

  12. OpenMP analysis • Application structure • Stamping code

  13. OpenMP analysis • Loop scheduling • Antena design

  14. OpenMP analysis

  15. OpenMP analysis How do bees see flowers?

  16. OpenMP analysis

  17. OpenMP analysis

  18. OpenMP analysis What bees don’t see Function A B C D Av. L2 misses/ms 62 52 163 14 FLOPS/ms 41K 21K 8K 1K Loads/ms 57K 52K 18K 100K

  19. Static vs. Dynamic Parallelism

  20. More on hardware counters Less misses, more time

  21. More on hardware counters More memory accesses per second Less coherence state changes

  22. MPI + OpenMP NAS FT Quantitative data: %MPI collective comm: 18% %OMP: fork/join 5% %non parallelized: 32% Avg. || Loop: 50ms # || loops: 38 # || loops < 5ms 6

  23. Other uses Average : 33 MFLOPS Peak: 60 MFLOPS • System activity • InfoPerfex • Pthreads

  24. Paraver on IBM • DPCL + PAPI : • Sequential programs • OpenMP • UTE • MPI • MPI+OpenMP

  25. UTE  Paraver • Filter • Thread states • Executing application code • Executing MPI Reveive • Executing MPI Send • Descheduled • Statistics

  26. UTE Analysis • Communication pattern • Exchanges 1 2 ; 3  4 • Load balance • More load on thread 1 • MPI implementation • Busy wait on receives • Scheduling • Thread 2 and 3 time sharing one CPU • Thread 4 time sharing one CPU with other processes • OS quantum: 10 ms.

  27. More information http://www.cepba.upc.es/paraver cepbatools@cepba.upc.es

More Related