1 / 20

What to do in case of compiler analysis failure?

What to do in case of compiler analysis failure?. Costin Iancu, Parry Husbands, Paul Hargrove Lawrence Berkeley National Laboratory. Hiding communication latency : important optimization for parallel programs

gerik
Download Presentation

What to do in case of compiler analysis failure?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What to do in case of compiler analysis failure? Costin Iancu, Parry Husbands, Paul Hargrove Lawrence Berkeley National Laboratory

  2. Hiding communication latency : important optimization for parallel programs Non-blocking communication: overlap comm with computation or comm; init_transfer()/sync_transfer() pair Optimization strategies: Schedule independent work in between the init/sync pair (coarse grained overlap) Decompose transfer and interleave communication and computation on sub-transfers (fine grained overlap) Motivation

  3. Optimizer (compiler, programmer) needs to be able to identify “reschedulable” work Limitations: Static “off-line” approach: optimization parameters (transfer size, computation) are dynamic, can’t estimate well overlap and access pattern Interleaving is hard: non-trivial program transformations, explicit communication management is cumbersome Practical: multi-language, third-party libraries, whole program optimization Enough reason to believe that applications contain unexploited overlap What about run-time support? Optimizations

  4. Run-time in charge of managing communication: transparently find and use the idle time (overlap) present in applications Finding Neverland? (Not on clusters...) Critical Word First (CWF): data transferred in order used by application Immediate Application Delivery (IAD): data delivered to CPU as soon as it arrives Demand Driven Completion (DDS): sync point induced by the application data usage Building Neverland (mapping to current NIC/OS services) : Demand Driven Completion Critical Word First Immediate Application Delivery communication decomposition and scheduling use virtual memory support Run-Time Support

  5. User level implementation using existing OS and NIC mechanisms Ignore explicit sync calls, use virtual memory support for implicit completion h = init_read(dest, src, N); ... sync(h); for(i=0; i < N; i++) ... = dest[i]...; h = init_read(dest, src, N); mprotect(dest, N, PROT_NONE); ... sync(h); for(i=0; i < N; i++) ... = dest[i]...; segfault() { mprotect(dest,N,PROT_ALL); sync(h); } Execution Trace init -> mprotect -> start compute -> segfault -> sync -> compute init -> sync -> compute Demand Driven Completion (DDS) DDS adds runtime overhead: mprotect and segfault

  6. Mitigate between networking layer semantics and application level semantics: approximate CWF and IAD using DDS Combine Optimizations: Communication Decomposition (strip-mining): create opportunity for finer grained overlap between communication and computation on bulk application level data transfers Communication Scheduling: maintain global view of outstanding communication operations and retire operations whenever possible (CPU idle, barriers ...) Implementation opportunistically adds execution time overhead Performance determined by hardware and application characteristics Application Level Data Delivery

  7. Hardware Parameters Performance parameters: • communication initiation overhead (o): ~ 2  s : 5  s • inverse bandwidth (G): ~ 1.2  s/KB : 4.7  s/KB • network round-trip time (RTT): ~ 8  s : 25  s • processor interrupt time (I): ~ 2  s : 37  s • mprotect time (1 page): 1  s Transfer time is the dominant component

  8. 1: upc_memget(dest[0],src[0],N) 2: upc_memget(dest[1], src[1],N) decompose? NO; piggyback? YES p1 .... THREADS: .... piggyback pT m1 | m2 | m3 | m4 Example for(i=0; i<THREADS; i++) upc_memget(dest[i], src[i], N); for(i=0; i<THREADS; i++) for(j=0; j<N; j++) ...= dest[i][j] ... segfault m1: retire m1, try retire m2 protect m2 boundary .... segfault m2: retire m2, try retire m3, retire ahead, protect p1 boundary .....

  9. Heuristics to match application behavior: FLOPS/byte, application level blocking vs. non-blocking, multi-message, multiplexed streams Static vs. dynamic “Static” performance model for worst case scenarios Allow for dynamic: heuristics change based on history or user control Main heuristics: Decomposition Strategies(Iancu et al, Message Strip-Mining Heuristics, VECPAR04) static strip size (precomputed tables or generating functions) dynamic strip size (multiplicative increase) Message Scheduling: multi-message, piggyback distance For optimal performance: event ordering contract between application and run-time. Correctness: enforce/change contract Matching the Application Behavior

  10. Runtime for Unified Parallel C (UPC), uses one-sided communication (http://upc.lbl.gov) Preserve application level data consistency: inter-thread synchronization events need to take into account the modified behavior Performance: Specific vs. generic (lists vs. interval trees) Static vs. dynamic (tune static, allow for dynamic) Programmability: Expose heuristics with simple interface (application hints) set_piggy_thresh(n), set_decomp(f), set_multiplex() Tools for application behavior extraction Portability: implementation tuned for large class of networks (Quadrics, Myrinet, Infiniband) Implementation

  11. CPU/NIC/OS combos CPU: Opteron/Itanium/Xeon/PPC970/Alpha NIC: Quadrics/Infiniband/Myrinet OS: Linux/Tru64/OSX Benchmarks: Worst case scenario micro-benchmarks used also for off-line performance tuning Application kernels: NAS FT/IS/MG class B (http://upc.gwu.edu) Kernels chosen for scenario coverage: MG - point to point communication with varying sizes FT/IS - gather/all-to-all with with hard to match communication/application access patterns -> scalability? Evaluation

  12. DDS Overhead Blocking: upc_memget() upc_memget() Non-blocking: upc_nb_memget() upc_nb_memget() sync2() DDS: upc_memget () upc_memget() Interrupt overhead directly influences performance In general: need I/RTT+1 independent messages for DDS payoff

  13. Message Decomposition Benchmark: upc_memget(dst,src,N); memcpy(local, dst, N); Worst case scenario Transfer size threshold for decomposition payoff 30KB : 200KB

  14. Application (NAS FFT) FT : blocking FT-nb : gather FT-md: blocking with message decomp FT-H:message decomp & scheduling Message scheduling improves scalability

  15. Do real programs exhibit the “proper” characteristics? Message sizes - bimodal (small, large) (Vetter et al, Communication Characteristics of Large Scale Scientific Applications for Contemporary Cluster Architectures, IPDPS 2002; Oliker et al, Analyzing Ultra-Scale Application Requirements for a Reconfigurable Hybrid Interconnect, SC05) “Sequential” frame of mind of parallel programmers: monotonic access sequence(Paek et al, Simplification of Array Access Patterns for Compiler Optimizations, PLDI98) Generality of implementation and performance tuning: OS and NIC interfaces do not expose enough functionality Open problems: Impact of page granularity and false sharing Irregular access patterns Using application level hints? (decomposition and scheduling, multiplexed streams) Is simple good enough? Discussion

  16. Program optimizations: “HPF” communication optimizations - aggregation and coalescing Communication scheduling - trade-off between favoring overlap over reduction in messages contention - NP hard (Chakrabarti et al, Global Communication Analysis and Optimization, PLDI 96) DSMs and Subpages: use VM for enforcing consistency, generic solution, solve false sharing (Multiview/Millipede) Intelligent runtimes: Charm - message driven execution, schedule threads around communication, not clear interplay between granularity of decomposition and latency hidden MPI early release(Ke et al, Tolerating Message Latency Through the Early Release of Blocked Receives, Europar 05) Related Work

  17. Promising performance results, I < RTT determining factor, likely to hold true in next generation hardware Programmability and portability: Manage communication and synchronization on behalf of the user Exploit the dynamic behavior of applications Add non-blocking behavior to legacy applications Tune the run-time instead of tuning the application Should non-blocking communication be a first class citizen of programming languages? Conclusion

  18. The End!

  19. Common idiom: performance = bulk • Previously: aggregate (increase communication granularity) • Now: aggregation and interleaving

  20. Performance parameter values (RTT,G,o,I,mprotect) outside implementation control, often expensive operations Need to minimize the additional work (segfaults,mprotects) - optimize for the application data access pattern Dynamic behavior: 50% operations are dynamic or dynamically analizable (Faraj et al, Communication Characteristics in the NAS Parallel Benchmarks, PDCS 2002) Optimize the implementation: mprotect = segfault, wait and see approach Lazy Protection: protect only boundaries, protect only for the immediate future Completion Piggybacking: greedy retirement of communication operations Minimizing Overhead

More Related