170 likes | 320 Views
Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst. Mustafa M. Tikir Jeffrey K. Hollingsworth. Introduction. Cache-coherent SMPs are widely used High performance computing Large-scale applications Client-server computing cc-NUMA is the dominant architecture
E N D
Locality Optimizations in cc-NUMA Architectures UsingHardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth
Introduction • Cache-coherent SMPs are widely used • High performance computing • Large-scale applications • Client-server computing • cc-NUMA is the dominant architecture • Allows construction of large servers • Data locality is an important consideration • Faster access to local memory units
Data Placement • Memory intensive applications on cc-NUMA servers • May have significant non-local memory accesses • Possible optimization to increase locality • First-touch placement of memory pages • Commonly used in modern systems • May not place pages local to the processors accessing them most • Dynamic page placement/migration • Page access frequencies at runtime
Our Page Migration Approach • User-level dynamic page migration • Profiling and page migration during the same run • Application Profiling • Gathers data from hardware counters • Sample the interconnect transactions • Transaction Type + Physical Address + Processor ID • Identifies preferred locations of memory pages • Memory unit local to the processor that accesses most • Page Placement • Kernel moves memory pages to their preferred locations • At fixed time intervals • Pages are frozen for a while if recently migrated • Eliminates ping-ponging of memory pages
Address Bus System Board 2 System Board 1 Memory Unit Memory Unit Processor 1 Processor 1 Processor 2 Processor 2 Transaction Sampling Instrumentation Software Processor 3 Processor 3 Physical Page Physical Page Processor 4 Processor 4 Explicit binding (processor_bind) Sun Fire 6800 Virtual to Physical Mapping (meminfo) Page Migration using move-on-next-touch feature (madvise) Thread1 Threadj Virtual Page Hardware/Software Components Sun Fire Link Hardware Counters Application
Instrumentation Code Insertion • Instrumentation using Dyninst • Entry point of main • Loads a shared library • Creates two helper threads • One for address transaction sampling • Other for actual migrations of the pages • Exit point(s) of thr_create • Calls processor_bind • Binds new threads to available processors • Helper threads are bound to dedicated processors • Entry point of exithandle • Termination detection • Clean-up hardware counters
Preliminary Experiment • Impractical to record all transactions • Interval sampling • Sampling at every Nth transaction • Continuous sampling • Sampling at the maximum speed of the instrumentation software • Are samples representative of transactions?
SAll PA PS SSample Representative Sampling Technique • Potential sampling error • How much do sampled transactions deviate from all transactions? • Distance between two sets • SALL and SSAMPLE • Ratio of transactions requested by a processor, P
Sampling Error for CG • Interval sampling is more representative • Interval used also has an impact • Continuous sampling is less representative due to difference between the rates • Transaction samples are taken • Processor requests transactions
Page Migration Experiments • Applications • OpenMP C implementation of NAS Parallel Benchmark suite • BT(B), CG(C), EP(C), FT(B), LU(C), MG(B), SP(C) • Optimized to support parallelized code • Platform • 24 processor Sun Fire 6800 • 24 GB main memory • Execution • 12 threads • 2 threads on each system board • Page migration at every 5 seconds • Interval sampling at every 1K transactions
SPECjbb2001 Results • Potential improvement? • Migration working at object granularity
Conclusions • Our dynamic page migration approach • Reduced non-local memory accesses by upto 90% • Improved the execution times by upto 16% • Potentially more effective on larger cc-NUMA servers • Sun Fire 15K (latency ratio => 1:1.78) • User level page migration approach • Relies on the OS kernel to provide the actual migration mechanism.