70 likes | 144 Views
4.1.2. Run time. Technology Drivers Scale, variance (uncertainty in characterization of apps and resource availability), Heterogeneity of resources, hierarchical structure of systems and applications, Latency Alternative R&D strategy Flat vs hierarchical Recommender research agenda
E N D
4.1.2. Run time • Technology Drivers • Scale, variance (uncertainty in characterization of apps and resource availability), Heterogeneity of resources, hierarchical structure of systems and applications, Latency • Alternative R&D strategy • Flat vs hierarchical • Recommender research agenda • Heterogeneity. data transfer. Scheduling. • Hierarchical multiple levels / flat ? Hybrid (Interaction between levels)? • Asynchrony: • Run time dependence analysis, JIT compilation, Interaction with compiler • Scheduling: Dynamic, predictive • Basic mechanisms for thread handling and communication. Reduce overhead and latency, (Interaction arch.) • Optimize usage of communication infrastructure: routes, mapping, overlap communication/computation • Scheduling for parallel efficiency: computation time, Load balance, granularity control, Malleability • Scheduling for memory efficiency: Locality handling • Shared address space • Memory management. • Application/area specific run times • Crosscutting considerations • Resilience: run time to implement fine grain mechanisms and fire coarse grain mechanisms • Power management: drive knobs, interact with job scheduler • Performance: run time instrumentation. Report suggestions to user. Interact with job scheduler • Programmability:
Heterogeneity Key challenges Summary of research direction Support execution of same program on different heterogeneous platforms Optimize utilization of resource and execution time Different granularities supported by platforms Unified/transparent accelerator run time models Address heterogeneity of nodes and interconnects in cluster. Scheduling for latency tolerance and bandwidth minimization Adaptive granularity Potential impact on software component Potential impact on usability, capability, and breadth of community Broaden the portability of programs Hide specificities of accelerator from programmer
Load Balance Key challenges Summary of research direction Adapt to variability in time and space (processes) of applications and systems Optimize resource utilization, reduce execution time General purpose self tuned run times: Detect imbalance and reallocate resources (cores, storage, DVFS, BW,…) within/across level(s). Application specific load balancing run times Minimization of impact of temporary resource shortage (OS noise, external urgent needs, …) Potential impact on software component Potential impact on usability, capability, and breadth of community Drastically reduce the effort needed to ensure efficient resource utilization and thus let programmers focus on functionality. Only use resources that can be profitably used. Maximize ratio of achieved performance to power 5 years Self tuned runtimes Crosscut: Perf. Analysis, Job scheduling
Flat model Key challenges Summary of research direction Resource requirements (computing power, memory, network) by runtime implementation Overcome limitations deriving from global synchronizing calls (barriers, collectives,….) Optimize usage of communication resources Keep memory requirements small and constant. Thread based MPI (rank per thread). Introduction of high levels of asynchrony: MPI Collectives, APGAS, Data-flow,… Adapt communication subsystems (routing, mapping, RDMA, …) to application characteristics Improve the performance of basic process management and synchronization mechanism Potential impact on software component Potential impact on usability, capability, and breadth of community Increased scalability MPI: Leverage current applications 2-5 years
Hierarchical/hybrid Key challenges Summary of research direction Match between model semantics at the different levels Match platform structure, efficient usage of resources Constrain size of name spaces Hierarchical integration of runtimes (MPI+PGAS, MPI+threaded+Accelerator, MPI+accelerator, PGAS+Accelerator,…) Modularity, reusability. Libraries compatibility. Dimensioning of processes/threads. Scheduling, mapping to nodes Memory placement and thread affinity Potential impact on software component Potential impact on usability, capability, and breadth of community Better match to hardware (i.e. shared memory within node) Interaction with Load balance and Job scheduling Enable smooth migration path Improved performance 5 years
4.1.2. Run time Dynamic Memory association ?? Resilience scheduling for locality Your Metric Async/Overlap hierarchy Load balance Memory efficiency Heterogeneity Power management job sched 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
4.1.2. Run time: what & why Assume responsibility for matching algorithm characteristics/demands to available resources, optimizing their usage So that we can finally rest Run times performing, dynamic memory association (work arrays, renaming,…), tolerating functional noise, Machines will fail more than we do The alternative will be to use current machines Run times tolerating injection rate of 10 errors/hour Demonstration that automatic locality aware scheduling can get a factor on 5x in highly NUMA memory hierarchies Dynamicity, decoupling algorithm form resources A target if we want to get there General purpose Run time automatically achieving load balance, optimized network usage, power minimization, malleability, tolerance to performance noise, … on heterogeneous system Demonstrate that asynchrony can get for both flat and hybrid systems 3x strong scalability Run the “same” source on 2 different heterogeneoussystems Do it for a couple of kernels and real applications By this time EVERYBODY will be fed up with writing the same application again and again Fighting variance is a lost battle, learn to live with it 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019