1 / 10

Cross-stack Energy Optimization: Fact or Fiction?

Cross-stack Energy Optimization: Fact or Fiction?. Kevin Skadron University of Virginia Dept. of Computer Science. Flavors of X-Stack. “Up” the stack Circuits Microarchitecture HWSW eg , sensorsthrottling Ideally, application itself can adapt (algorithm, precision, QoS, etc.) …

roddy
Download Presentation

Cross-stack Energy Optimization: Fact or Fiction?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cross-stack Energy Optimization: Fact or Fiction? Kevin Skadron University of Virginia Dept. of Computer Science

  2. Flavors of X-Stack • “Up” the stack • CircuitsMicroarchitecture • HWSW • eg, sensorsthrottling • Ideally, application itself can adapt (algorithm, precision, QoS, etc.) • … • “Down” the stack • Often overlooked, but OS, HW can benefit from application knowledge • SWHW • eg, access patterns, thread priorities, private/shared, etc. • GPU example: texture (APIdriverHW) • eg, reconfigurable hardware

  3. Up: Dymaxion: Index Transformation • SIMD/SIMT: Because SIMD requires contiguous access for efficiency, data layout/traversal needs to be transformed • Usermiddleware(device driver)(hardware) feature’[transform(index)] feature[index] 8

  4. Code Example Original Version DEVICE __global__ kmeans_kernel_orig(float*feature_d, ...){ inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /* ... */ for (intl = 0; l < nclusters; l++) { index = point_id * nfeatures + l; ...feature_d[index]... } } DEVICE __global__ kmeans_kernel_map(float*feature_remap, ...){ inttid = BLOCK_SIZE * blockIdx.x + threadIdx.x; /* ... */ for (intl = 0; l < nclusters; l++) { index = point_id * nfeatures + l; ...feature_remap[transform_row2col(index, npoints, nfeatures)]... } } HOST cudaMemcpy(feature_d, feature, …); kmeans_kernel_orig<<<dimGrid,dimBlock>>>( feature_d, ... ); HOST map_row2col(feature_remap, feature, …); kmeans_kernel_map<<<dimGrid,dimBlock>>>( feature_remap, ... ); Dymaxion Version

  5. Down: Lack of Sensors and Actuators • Feedback control: sensors and actuators • Chicken and egg problem • Lack of sensors is a big problem now • Can’t control what we can’t measure • Performance monitors not designed for this • Too coarse-grained, can’t monitor enough • Moving in the right direction • Need more actuators, too • Currently mainly have just DVFS and scheduling/placement • Some HDDs offer DRPM • Reconfiguration is a form of actuation, too

  6. Wish List • Sensors/constraint communication • Up: Structure occupancies, interval behavior, fine-grained/instruction-level responsiveness, physical location, etc. • Expand perf-counter system, add informing loads (ISCA ~00), allow HW to query microarchitectural state, expose chip/rack/datacenter/geographic location, etc. • Down: Access patterns, private/shared, priority/performance expectations, etc. • Requires new programming constructs and new (possibly privileged) instructions • Actuators • Many system components hard to control • e.g., HDDs, DRAM, power supply • Control memory behavior, light sleep modes • Ordering/buffering/prefetching/contention • More reconfigurability, coarse-grained architectures • Why use cache when you can use scratchpad; registers, routed network when you can do direct producer-consumer, etc.?

  7. Summary • Turn fiction into non-fiction! • Some good ideas already in papers • Revisit: why weren’t they adopted? • New ideas: • Imagine ideal sensing and actuation • Show a promising control/adaptation/reconfiguration algorithm • Propose plausible sensors/actuators

  8. Backup

  9. What is “Cross Stack”? • Layer X adapts based on information in Layer Y • Example: OS uses hardware info • e.g., temp sensors, structure occupancies, # pending cache misses guide thread co-location • Or hardware uses OS info • e.g., thread priorities, task deadlines guide hardware DVFS policy • Important—leverage information across layers to make globally efficient decisions • Ultimately: break down costly interfaces • Unnecessary copies, extra state, redundant computation • Different than energy optimization happening independently in multiple layers • e.g., hardware DVFS (based on instruction flow)+ OS DVFS (based on task deadlines) • Risky—control loops can fight

  10. Fact or Fiction • Should be fact! • But mostly fiction • Can’t measure power/energy effectively in many systems and components • Control options are typically high-overhead • DVFS, task migration, etc. • Most solutions are single-layer • Baby steps • Cluster/datacenter front end monitors per-node activity, temperature—schedules accordingly • Autotuning • Reducing copies

More Related