200 likes | 297 Views
What to do in case of compiler analysis failure?. Costin Iancu, Parry Husbands, Paul Hargrove Lawrence Berkeley National Laboratory. Hiding communication latency : important optimization for parallel programs
E N D
What to do in case of compiler analysis failure? Costin Iancu, Parry Husbands, Paul Hargrove Lawrence Berkeley National Laboratory
Hiding communication latency : important optimization for parallel programs Non-blocking communication: overlap comm with computation or comm; init_transfer()/sync_transfer() pair Optimization strategies: Schedule independent work in between the init/sync pair (coarse grained overlap) Decompose transfer and interleave communication and computation on sub-transfers (fine grained overlap) Motivation
Optimizer (compiler, programmer) needs to be able to identify “reschedulable” work Limitations: Static “off-line” approach: optimization parameters (transfer size, computation) are dynamic, can’t estimate well overlap and access pattern Interleaving is hard: non-trivial program transformations, explicit communication management is cumbersome Practical: multi-language, third-party libraries, whole program optimization Enough reason to believe that applications contain unexploited overlap What about run-time support? Optimizations
Run-time in charge of managing communication: transparently find and use the idle time (overlap) present in applications Finding Neverland? (Not on clusters...) Critical Word First (CWF): data transferred in order used by application Immediate Application Delivery (IAD): data delivered to CPU as soon as it arrives Demand Driven Completion (DDS): sync point induced by the application data usage Building Neverland (mapping to current NIC/OS services) : Demand Driven Completion Critical Word First Immediate Application Delivery communication decomposition and scheduling use virtual memory support Run-Time Support
User level implementation using existing OS and NIC mechanisms Ignore explicit sync calls, use virtual memory support for implicit completion h = init_read(dest, src, N); ... sync(h); for(i=0; i < N; i++) ... = dest[i]...; h = init_read(dest, src, N); mprotect(dest, N, PROT_NONE); ... sync(h); for(i=0; i < N; i++) ... = dest[i]...; segfault() { mprotect(dest,N,PROT_ALL); sync(h); } Execution Trace init -> mprotect -> start compute -> segfault -> sync -> compute init -> sync -> compute Demand Driven Completion (DDS) DDS adds runtime overhead: mprotect and segfault
Mitigate between networking layer semantics and application level semantics: approximate CWF and IAD using DDS Combine Optimizations: Communication Decomposition (strip-mining): create opportunity for finer grained overlap between communication and computation on bulk application level data transfers Communication Scheduling: maintain global view of outstanding communication operations and retire operations whenever possible (CPU idle, barriers ...) Implementation opportunistically adds execution time overhead Performance determined by hardware and application characteristics Application Level Data Delivery
Hardware Parameters Performance parameters: • communication initiation overhead (o): ~ 2 s : 5 s • inverse bandwidth (G): ~ 1.2 s/KB : 4.7 s/KB • network round-trip time (RTT): ~ 8 s : 25 s • processor interrupt time (I): ~ 2 s : 37 s • mprotect time (1 page): 1 s Transfer time is the dominant component
1: upc_memget(dest[0],src[0],N) 2: upc_memget(dest[1], src[1],N) decompose? NO; piggyback? YES p1 .... THREADS: .... piggyback pT m1 | m2 | m3 | m4 Example for(i=0; i<THREADS; i++) upc_memget(dest[i], src[i], N); for(i=0; i<THREADS; i++) for(j=0; j<N; j++) ...= dest[i][j] ... segfault m1: retire m1, try retire m2 protect m2 boundary .... segfault m2: retire m2, try retire m3, retire ahead, protect p1 boundary .....
Heuristics to match application behavior: FLOPS/byte, application level blocking vs. non-blocking, multi-message, multiplexed streams Static vs. dynamic “Static” performance model for worst case scenarios Allow for dynamic: heuristics change based on history or user control Main heuristics: Decomposition Strategies(Iancu et al, Message Strip-Mining Heuristics, VECPAR04) static strip size (precomputed tables or generating functions) dynamic strip size (multiplicative increase) Message Scheduling: multi-message, piggyback distance For optimal performance: event ordering contract between application and run-time. Correctness: enforce/change contract Matching the Application Behavior
Runtime for Unified Parallel C (UPC), uses one-sided communication (http://upc.lbl.gov) Preserve application level data consistency: inter-thread synchronization events need to take into account the modified behavior Performance: Specific vs. generic (lists vs. interval trees) Static vs. dynamic (tune static, allow for dynamic) Programmability: Expose heuristics with simple interface (application hints) set_piggy_thresh(n), set_decomp(f), set_multiplex() Tools for application behavior extraction Portability: implementation tuned for large class of networks (Quadrics, Myrinet, Infiniband) Implementation
CPU/NIC/OS combos CPU: Opteron/Itanium/Xeon/PPC970/Alpha NIC: Quadrics/Infiniband/Myrinet OS: Linux/Tru64/OSX Benchmarks: Worst case scenario micro-benchmarks used also for off-line performance tuning Application kernels: NAS FT/IS/MG class B (http://upc.gwu.edu) Kernels chosen for scenario coverage: MG - point to point communication with varying sizes FT/IS - gather/all-to-all with with hard to match communication/application access patterns -> scalability? Evaluation
DDS Overhead Blocking: upc_memget() upc_memget() Non-blocking: upc_nb_memget() upc_nb_memget() sync2() DDS: upc_memget () upc_memget() Interrupt overhead directly influences performance In general: need I/RTT+1 independent messages for DDS payoff
Message Decomposition Benchmark: upc_memget(dst,src,N); memcpy(local, dst, N); Worst case scenario Transfer size threshold for decomposition payoff 30KB : 200KB
Application (NAS FFT) FT : blocking FT-nb : gather FT-md: blocking with message decomp FT-H:message decomp & scheduling Message scheduling improves scalability
Do real programs exhibit the “proper” characteristics? Message sizes - bimodal (small, large) (Vetter et al, Communication Characteristics of Large Scale Scientific Applications for Contemporary Cluster Architectures, IPDPS 2002; Oliker et al, Analyzing Ultra-Scale Application Requirements for a Reconfigurable Hybrid Interconnect, SC05) “Sequential” frame of mind of parallel programmers: monotonic access sequence(Paek et al, Simplification of Array Access Patterns for Compiler Optimizations, PLDI98) Generality of implementation and performance tuning: OS and NIC interfaces do not expose enough functionality Open problems: Impact of page granularity and false sharing Irregular access patterns Using application level hints? (decomposition and scheduling, multiplexed streams) Is simple good enough? Discussion
Program optimizations: “HPF” communication optimizations - aggregation and coalescing Communication scheduling - trade-off between favoring overlap over reduction in messages contention - NP hard (Chakrabarti et al, Global Communication Analysis and Optimization, PLDI 96) DSMs and Subpages: use VM for enforcing consistency, generic solution, solve false sharing (Multiview/Millipede) Intelligent runtimes: Charm - message driven execution, schedule threads around communication, not clear interplay between granularity of decomposition and latency hidden MPI early release(Ke et al, Tolerating Message Latency Through the Early Release of Blocked Receives, Europar 05) Related Work
Promising performance results, I < RTT determining factor, likely to hold true in next generation hardware Programmability and portability: Manage communication and synchronization on behalf of the user Exploit the dynamic behavior of applications Add non-blocking behavior to legacy applications Tune the run-time instead of tuning the application Should non-blocking communication be a first class citizen of programming languages? Conclusion
Common idiom: performance = bulk • Previously: aggregate (increase communication granularity) • Now: aggregation and interleaving
Performance parameter values (RTT,G,o,I,mprotect) outside implementation control, often expensive operations Need to minimize the additional work (segfaults,mprotects) - optimize for the application data access pattern Dynamic behavior: 50% operations are dynamic or dynamically analizable (Faraj et al, Communication Characteristics in the NAS Parallel Benchmarks, PDCS 2002) Optimize the implementation: mprotect = segfault, wait and see approach Lazy Protection: protect only boundaries, protect only for the immediate future Completion Piggybacking: greedy retirement of communication operations Minimizing Overhead