100 likes | 227 Views
Cross-stack Energy Optimization: Fact or Fiction?. Kevin Skadron University of Virginia Dept. of Computer Science. Flavors of X-Stack. “Up” the stack Circuits Microarchitecture HWSW eg , sensorsthrottling Ideally, application itself can adapt (algorithm, precision, QoS, etc.) …
E N D
Cross-stack Energy Optimization: Fact or Fiction? Kevin Skadron University of Virginia Dept. of Computer Science
Flavors of X-Stack • “Up” the stack • CircuitsMicroarchitecture • HWSW • eg, sensorsthrottling • Ideally, application itself can adapt (algorithm, precision, QoS, etc.) • … • “Down” the stack • Often overlooked, but OS, HW can benefit from application knowledge • SWHW • eg, access patterns, thread priorities, private/shared, etc. • GPU example: texture (APIdriverHW) • eg, reconfigurable hardware
Up: Dymaxion: Index Transformation • SIMD/SIMT: Because SIMD requires contiguous access for efficiency, data layout/traversal needs to be transformed • Usermiddleware(device driver)(hardware) feature’[transform(index)] feature[index] 8
Code Example Original Version DEVICE __global__ kmeans_kernel_orig(float*feature_d, ...){ inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /* ... */ for (intl = 0; l < nclusters; l++) { index = point_id * nfeatures + l; ...feature_d[index]... } } DEVICE __global__ kmeans_kernel_map(float*feature_remap, ...){ inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /* ... */ for (intl = 0; l < nclusters; l++) { index = point_id * nfeatures + l; ...feature_remap[transform_row2col(index, npoints, nfeatures)]... } } HOST cudaMemcpy(feature_d, feature, …); kmeans_kernel_orig<<<dimGrid,dimBlock>>>( feature_d, ... ); HOST map_row2col(feature_remap, feature, …); kmeans_kernel_map<<<dimGrid,dimBlock>>>( feature_remap, ... ); Dymaxion Version
Down: Lack of Sensors and Actuators • Feedback control: sensors and actuators • Chicken and egg problem • Lack of sensors is a big problem now • Can’t control what we can’t measure • Performance monitors not designed for this • Too coarse-grained, can’t monitor enough • Moving in the right direction • Need more actuators, too • Currently mainly have just DVFS and scheduling/placement • Some HDDs offer DRPM • Reconfiguration is a form of actuation, too
Wish List • Sensors/constraint communication • Up: Structure occupancies, interval behavior, fine-grained/instruction-level responsiveness, physical location, etc. • Expand perf-counter system, add informing loads (ISCA ~00), allow HW to query microarchitectural state, expose chip/rack/datacenter/geographic location, etc. • Down: Access patterns, private/shared, priority/performance expectations, etc. • Requires new programming constructs and new (possibly privileged) instructions • Actuators • Many system components hard to control • e.g., HDDs, DRAM, power supply • Control memory behavior, light sleep modes • Ordering/buffering/prefetching/contention • More reconfigurability, coarse-grained architectures • Why use cache when you can use scratchpad; registers, routed network when you can do direct producer-consumer, etc.?
Summary • Turn fiction into non-fiction! • Some good ideas already in papers • Revisit: why weren’t they adopted? • New ideas: • Imagine ideal sensing and actuation • Show a promising control/adaptation/reconfiguration algorithm • Propose plausible sensors/actuators
What is “Cross Stack”? • Layer X adapts based on information in Layer Y • Example: OS uses hardware info • e.g., temp sensors, structure occupancies, # pending cache misses guide thread co-location • Or hardware uses OS info • e.g., thread priorities, task deadlines guide hardware DVFS policy • Important—leverage information across layers to make globally efficient decisions • Ultimately: break down costly interfaces • Unnecessary copies, extra state, redundant computation • Different than energy optimization happening independently in multiple layers • e.g., hardware DVFS (based on instruction flow)+ OS DVFS (based on task deadlines) • Risky—control loops can fight
Fact or Fiction • Should be fact! • But mostly fiction • Can’t measure power/energy effectively in many systems and components • Control options are typically high-overhead • DVFS, task migration, etc. • Most solutions are single-layer • Baby steps • Cluster/datacenter front end monitors per-node activity, temperature—schedules accordingly • Autotuning • Reducing copies