110 likes | 215 Views
Perseus Design. Platform models define number of cores and cache characteristics. Behavioral “signatures” are extracted from a baseline execution. Configuration plans define application-triggered system control (affinity & power).
E N D
Platform models define number of cores and cache characteristics Behavioral “signatures” are extracted from a baseline execution Configuration plans define application-triggered system control (affinity & power) The plethora of variables presents a huge solution space ideal for Genetic Algorithm approaches Prototype will focus on support for x86 binaries on a Linux platform Second phase of instrumentation “hooks” configuration plan into application Architecture Lockheed Martin and Government Use Only
Behavioral Analysis Sub-system • Behavioral analysis is performed is split in two • TEG (Temporal Execution Graph) • TMAM (Temporal Memory Access Map) • Precise data is collected for on a per-thread, per-call site basis • Binary instrumentation is facilitated by Dyninst (University Wisconsin Madison) • Accurate counting (e.g., processor cycles) and timing is facilitated through PAPI (University Tennessee) Lockheed Martin and Government Use Only
TEG Collection • TEG collects information about how much time the application spent executing different functions in the application. Both cycle count and timestamps are collected so that potential for “slow-downs” can be identified • Per-thread, per-call site timing and cycle count information is collected for selected function calls • Results provide timing distributions for functions as opposed to averages and counts (e.g., gprof, callgrind) • Overhead is dependent upon density of instrumentation (i.e., number of functions + calls) ~ in most cases negligible Lockheed Martin and Government Use Only
TMAM Collection • All application reads and writes to memory are captured via probes instrumented at the binary level. This data is essential for cache false-sharing identification • Data is collected via a shared memory logger • Overhead is very expensive - O(x100) slower • At these levels we have to be careful not to affect normal behavior. Dynamic probe placement and sampling could be used to alleviate this problem • Massive volumes of data result (e.g., 20 second program can generate 100 Gb +) • Two modes of operation: off-line analysis, real-time analysis Lockheed Martin and Government Use Only
Platform Analysis • Micro-benchmarks implemented as part of current solution empirically measure data concerning • Number of processors, number (and values) of frequency steppings • Cost of thread migration (i.e. affinity change) • Ratios of power-to-cycles at different frequencies • Cost (in cycles) of frequency modulation • Core topology Lockheed Martin and Government Use Only
Example Platform Information • Example data empirically collected through fine-grained on-chip timing and micro-benchmark program Data collected from Dual-processor Quad-core Xeon running Debian Linux. Each matrix element is shaded according to measured latencies of the migration (darker is slower). Lockheed Martin and Government Use Only
Design Optimization Engine Lockheed Martin and Government Use Only
Example Deployment Data • Deployment results are made up of a trigger locations and auto-generated trigger source code #include <pthread.h> #include "affinity.h" #include "fvctrl.h" #include "triggeraux.h" void Init_Frequency() { modulate_cpu(0, 1, 0); modulate_cpu(1, 1, 0); modulate_cpu(2, 1, 0); modulate_cpu(3, 0, 0); modulate_cpu(4, 0, 0); modulate_cpu(5, 0, 0); modulate_cpu(6, 0, 0); modulate_cpu(7, 1, 0); } void Before_CS_8048D92() { switch(GetThreadInstanceId()) { case 1: { affinize_thread(0, pthread_self()); break; } case 2: { affinize_thread(3, pthread_self()); break; } case 3: { affinize_thread(1, pthread_self()); break; } case 4: { affinize_thread(1, pthread_self()); break; } } } libControl.so 8048C07,Before_CS_8048C07 8048C98,Before_CS_8048C98 8048D92,Before_CS_8048D92 8048DB0,Before_CS_8048DB0 Lockheed Martin and Government Use Only
Power Measurement Server-style ATX power feeds two 12V lines to each processor. Data is streamed to a host via USB. Lockheed Martin and Government Use Only