Design of a Custom VEE Core in a Chip Multiprocessor

Design of a Custom VEE Core in a Chip Multiprocessor Dan Upton Masters Presentation Oct. 23, 2007

Why a VEE Core? • VEEs have become more common • One source of overhead is sharing execution resources (cycles, physical structures) è Move VEE onto a separate core

Why A Heterogeneous Core? • VEE different from other applications • Smaller hardware structures can save on power consumption • Better efficiency from CMPs with application-specific cores

What’s To Come • VEE characterization • Performance counter-based • SimpleScalar-based • Power study • SimpleScalar/Wattch-based • Design space

Background: CMPs • Multiple cores on a single die • Shared resources? • Homogeneous vs. heterogeneous CPU CPU CPU CPU L1D$ L1I$ L1D$ L1I$ L1D$ L1I$ L1D$ L1I$ L2 $ L2 $

Background: VEE Types System Process APP APP APP OS APP VEE VEE OS HW HW

System VEEs • Whole system (OS + apps) • Xen, VMWare, Transmeta, … • Hardware support: Intel VT, AMD SVM APP APP OS VEE HW

Process VEEs • Single application per instance • Finer-grained policy selection • Pin, Strata, Dynamo, DynamoRIO, … • Hardware support: this work (based on Pin) APP APP VEE OS HW

An Overview Of Process VEEs Injection Compile and instrument code context switch Code cache OVERHEAD! parallelize with running app? no longer necessary context switch branch target in cache? no yes

Overhead

Environment • Two environments for data collection • Hardware performance-monitoring: fast, gives real data, but can’t modify hardware • Architectural simulation: can modify the simulated hardware, but slow

Environment • Hardware performance counters • perfctr 2.6.25, PAPI 3.5, papiex 0.99rc9 • Xeon with HT, PIII • Modified Pin source to start/stop counters on VEE entry/exit

Environment • Architectural simulation • SimpleScalar-x86 • Allows for modifying architectural characteristics • 8-instruction guest application means most data is representative of VEE

Characterization • Based on architectural units that are commonly considered for removal, resizing, or sharing: • Floating-point pipeline • Cache hierarchy • Branch prediction hardware

Characterization: Floating Point

Characterization: Floating Point • At most .1% floating-point instructions • VEE core probably doesn’t need dedicated FP hardware • Could use conjoined-core approach and share FP with another nearby core on the die

Characterization: L1 Data Cache

Characterization: L1 Inst. Cache

Characterization: Branch Predictor

Characterization: Recap • Floating point: low utilization, so the VEE core can share with another core • L1 caches: smaller caches are sufficient • L2 caches: generally shared between cores • Branch predictor: smaller history table is sufficient

Power Consumption • Smaller structures can lead to a decrease in power consumption • Compare power between modern core and our VEE core design using Wattch

Power Savings (per cycle)

Power Savings (overall)

Power Consumption: Summary • Specialized design saves up to 14% power per cycle • Saves up to 5% over the total execution • but it can lead to higher consumption in some cases

Chip-level Design General- Purpose CPU 1 VEE Core General- Purpose CPU VEE core General- Purpose CPU VEE core General- Purpose CPU 2 (other specialized cores) General- Purpose CPU VEE core General- Purpose CPU VEE core Stand-alone VEE core Conjoined VEE core

Support Structures • Communication channel between application and VEE • Support for speculative compilation by the VEE • Channels to peek at application core structures • For instance, branch history for easily profiling hot paths

Related Work • Hardware support for VEEs • Trident • Codesigned VMs • Transmeta, DAISY, Kim & Smith • Java in hardware • picoJava, JOP

Future Work • Multicore simulation to measure interaction between multiple VEE instances • Requires multicore sim framework, multithreaded VEE • Investigate other opportunities arising from separating VEE and application

Conclusions • VEE differs from benchmark applications • VEE-specific core design can save power • Potential for reducing overhead by not sharing execution resources, or parallelizing compilation and execution

Questions?

Characterization: L2 Cache

Design of a Custom VEE Core in a Chip Multiprocessor

Design of a Custom VEE Core in a Chip Multiprocessor

Presentation Transcript

MP3 Core Chip

Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture

Functional Verification of the SiCortex Multiprocessor System-on-a-Chip

Algorithms in a Multiprocessor Environment

אימות רכיב מרובה מעבדים Validation of a Multi-Core Chip

Single-Chip Multiprocessor

Priority Based Fair Scheduling: A Memory Scheduler Design for Chip-Multiprocessor Systems

System-On-a-Chip Design

Scaling and Packing on a Chip Multiprocessor

Data Speculation Support for a Chip Multiprocessor (Hydra CMP)

A new ASICX design center at PSI Chip Design Core Group

Design of Adaptive On-Chip Multiprocessor Systems

System-On-a-Chip Design

Introduction to Multiprocessor System-on-Chip

Heterogeneous Chip Multiprocessor Design for Virtual Machines

Core-Selectability in Chip-Multiprocessors

Introduction to Multiprocessor System-on-Chip

A new ASICX design center at PSI Chip Design Core Group