1 / 10

What we need to be able to count to tune programs

What we need to be able to count to tune programs. Mustafa M. Tikir Bryan R. Buck Jeffrey K. Hollingsworth. Cache Scope. Measurement library Uses Dyninst to instrument program Insert calls to Initialize, start measurement, replace allocation calls Uses hardware monitors in Itanium II

teryl
Download Presentation

What we need to be able to count to tune programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What we need to be able to count to tune programs Mustafa M. Tikir Bryan R. Buck Jeffrey K. Hollingsworth

  2. Cache Scope • Measurement library • Uses Dyninst to instrument program • Insert calls to • Initialize, start measurement, replace allocation calls • Uses hardware monitors in Itanium II • Event Address Registers for L1D cache misses & FP loads • Interrupt on overflow • Perfmon randomization feature to access registers • Objects grouped into “stat buckets” • Each variable assigned own stat bucket • User can name explicitly for heap allocations • Results sorted by latency (rather than misses) • View by stat bucket or function • Filter by function or stat bucket

  3. Dynamic Page Migration • User-level dynamic page migration • Profiling and page migration during the same run • Application Profiling • Gathers data from Sun Fire Link hardware monitors • Samples the interconnect transactions • Transaction Type + Physical Address + Processor ID • Identifies preferred locations of memory pages • Memory local to the processor that accesses most • Page Placement • Kernel moves pages to their preferred locations • At fixed time intervals • Using the madvise system calls • Pages are frozen for a while if recently migrated • Eliminates ping-ponging of memory pages

  4. NUMA-Aware Java Heaps • NUMA-Aware generation • Divided into segments for locality groups on the system • Each segment is local to its locality group • NUMA-Aware young and old generations • For initial object allocation • Dynamic object migrations • Data from hardware monitors • Relate profiles to heap objects • Identify preferred locations of objects • Evaluation using a hybrid execution simulator • Underlying memory management libraries • Representative parallel workloads • From actual runs of server applications • Using data from hardware monitors • Sampling memory accesses via hardware monitors

  5. What Else Could Monitors Do? • Previous use • Information about the hardware components in processor • Hardware designers • Hand-tuning of the systems • Our use of hardware monitors • Data centric measurement of program behavior • Automatic tuning for memory access locality • To be more beneficial in automatic tuning • Information on the cause of the events • Cache eviction information • More specialized information • Address Translation Counters for dynamic page migration

  6. L1 cache CPU L2 cache and main memory virtual address to physical tag data ... ... virtual address of miss tag of evicted data performance monitors address of last miss miss count interrupt tag of last eviction Cache Eviction Information • Insight into interactions among data • Particularly useful for data layout optimizations • Physical address is available to hardware • Can calculate from tag of evicted cache line • Information in OS can map physical to virtual

  7. Valid Dirty ATC Counter, CE Virtual Address Physical Address TLB Entry, E Address Translation Counters (ATC) • Access frequencies to pages by a processor • A counter for each TLB entry, E • CE  0 • TLB entry is loaded due to TLB miss • TLB entry invalidation • Context switch • Cache coherency control operation • CE  CE + 1 • Virtual to physical address translation

  8. Gathering Information from ATC • Sampling TLB content • Via system calls • At fixed time intervals • List of valid TLB entries • Virtual Address + ATC value • Low overhead traps by the OS • At TLB entry eviction or invalidation • Processor ID + Virtual Address + ATC value • Additional fields in page table entries • Page table update at context switch • Counter for each processor for each page

  9. Conclusions • Current hardware monitors are good at • Counting events (result of problem) • Would like to • Count cause of events (why the problem happened) • Gather more specialized information • Future hardware monitors • For automatic tuning of programs • Must be sufficiently simple to get implemented • Collaboration for future monitors • Application developers • System software designers • Processor architects

  10. References • Data Centric Cache Measurement on the Intel Itanium 2 Processor • Buck and Hollingsworth, SC'04. • Using Hardware Counters to Automatically Improve Memory Performance • Tikir and Hollingsworth, SC'04. • NUMA-Aware Java Heaps for Server Applications • Tikir and Hollingsworth, IPDPS’05. • Data Centric Cache Measurement Using Hardware And Software Instrumentation • Bryan R. Buck, PhD Thesis • Using Hardware Monitors to Generate Parallel Workloads • Tikir and Hollingsworth, Under review for EuroPar’05.

More Related