360 likes | 571 Views
Tensilica based simulator for Smart Memories. Alex Solomatnikov Smart Memories meeting 6/17/03. Outline. Introduction to Tensilica system XTMP API Simulator architecture Simulation of “special” memory operations Current simulator status What’s next Summary. Introduction.
E N D
Tensilica based simulator for Smart Memories Alex Solomatnikov Smart Memories meeting 6/17/03
Outline • Introduction to Tensilica system • XTMP API • Simulator architecture • Simulation of “special” memory operations • Current simulator status • What’s next • Summary
Introduction • Tensilica system is a reconfigurable extensible embedded processor architecture: • variety of interfaces • a number of cache or local memory configurations • interrupt and memory mapping options • ISA is extensible with TIE language • complete hardware generation/simulation/ software development framework • Next generation of the system has even more flexibility
New Xtensa Architecture • 5 or 7 stage RISC pipeline • 1 or 2 cycles for memory access • Base RISC ISA: • 16/24 bit instructions (16 GPR) • optional 64 bit instructions (VLIW) • Predefined ISA extensions: • FP co-processor • multiplier and/or MAC • SIMD/VLIW unit (Vectra II) • multiple load/store units (but only one connected to cache) • User design TIE extensions: • user-defined state (co-processors and register files) • user-defined instructions • user-defined interfaces (TIEWire)
Core Peripheral Memory Core Peripheral Memory Peripheral Memory Core Standard or system-specific bus(es) External Bus Interface Processor Interface Xtensa Core Data Cache Instruction Cache ICache Interface Compute Core DCache Interface Instruction RAM IRAM Interface Data RAM Interface Data RAM Instruction ROM IROM Interface Data ROM Interface Data ROM XLMI Local Memories Shared Memories FIFOs Peripherals System with Xtensa core
Core Peripheral Memory Core Peripheral Memory Peripheral Memory Core Standard or system-specific bus(es) External Bus Interface Processor Interface Xtensa Core Object Data Cache Instruction Cache ICache Interface Compute Core DCache Interface Instruction RAM IRAM Interface Data RAM Interface Data RAM Instruction ROM IROM Interface Data ROM Interface Data ROM XLMI Local Memories Shared Memories FIFOs Peripherals Simulation view
XTMP API • Allows to simulate multiple Xtensa cores: • each core is simulated by independent thread • caches are simulated inside core • Cores can be connected to pre-defined or custom devices: • pre-defined: memories • custom: user designed models • Need to develop custom memory model including caches supporting cache coherence and other desired features
Custom devices in new XTMP • Custom devices communicate with Xtensa cores through callback functions • For every transaction (read or write) core calls post callback function of the device: • when transaction is handled device model calls XTMP_post • cannot stall core thread, i.e. cannot call XTMP_wait • Device can have ticker function which is called every simulation cycle to update device state • Device can stall core by either calling XTMP_setPortBusy or XTMP_setGlobalStall • Additional callback functions soon to be defined • User can define a new thread
Possible approaches • Multiple threads, e.g. separate thread per tile, cache controller • Single thread which simulates all memory system components • Multiple ticker function, e.g. per mat, tile • Combination of ticker(s) and thread(s) • Single ticker which is responsible for the whole memory system • Simplest and probably fastest
Quad Tile Processor thread Processor thread Processor thread Processor thread Tile Processor Processor Processor Processor port port port port port port port port Ticker thread memory ticker Tile Tile port port port port port port port port Processor Processor Processor Processor Processor thread Processor thread Processor thread Processor thread Proposed simulator architecture
Memory system model • Can be “functional”: • caches, SRAMs, FIFOs • Or can be “structural”: • memory mats • Or can be combination of both • Trade-off is between speed and accuracy
Chip Quad Tile Tile Tile Processor Processor Processor Processor Processor Processor inst. ram inst. ram data ram data ram FIFO FIFO data cache data cache inst. cache inst. cache inst. ram data ram FIFO data cache inst. cache Tile Processor Processor inst. ram data ram FIFO data cache inst. cache Cache Controller/Network Interface Memory Controller/Directory I/O Controller Off-chip memory I/O devices “Functional” model
Chip Quad Tile Tile Tile Processor Processor Processor Processor Processor Processor mat mat mat mat mat mat mat mat mat mat mat mat mat mat mat Tile Processor Processor mat mat mat mat mat Cache Controller/Network Interface Memory Controller/Directory I/O Controller Off-chip memory I/O devices “Structural” model
Processor Model • Single-issue in-order 7 stage pipeline • 2 cycle for access of local memory mats • Each processor can have N contexts and switch between them on cache miss • N instances of the core, only one is unstalled every cycle • Connected to the memory system through “port devices” • post callback function receives processor requests
Simulator hack • Normally core model simulates all memory ports and PIF in great detail: • ports are inconvenient to use for our purposes: • need to divide address space • need to create a separate “device” for each port • PIF buffers transaction until instruction commits and simulates PIF contention • Hack: • disable “memory modeling” in PIF and route all transactions through PIF • should become simulator “feature” soon
Pipeline commit point • Store buffer is still simulated even with disabled “memory modeling”: • stores are posted later than loads F1 F2 R E M1 M2 W post(fetch) post(load) post(store)
“Special” memory operations • Memory operations with side effects • Currently have: • synch_load – check F/E (full/empty) bit: • if empty stall until set to full by another processor • when full perform load and set F/E bit to empty • synch_store – check F/E bit: • if full stall until set to empty by another processor • when empty perform store and set F/E bit to full • future_load – check F/E (full/empty) bit: • if empty stall until set to full by another processor • when full perform load
“Special” memory operations • Other operations useful for initialization/release of locks: • reset_load: • perform load and set F/E bit to empty w/o stall • set_store: • perform store and set F/E bit to full w/o stall • Might want to have other operations: • load_fe_bit • store_fe_bit
“Special” memory operations • Might want to have more general/flexible definition of memory operations?: • synchronization • speculation • transactional memory support • …
How to simulate • Can define new loads and stores in TIE • Need to tell memory system that operation is special: • use address bits: • change address generation for “special” instructions • limit address space for application • use TIEWire to define new interface wires: • every new TIEWire interface is simulated through a separate callback function
Store buffer store posted • Stores are buffered in store buffer and issued only after commit point: • following synch_load is posted and completed before synch_store is even posted • Need to get around store buffer synch_store F1 F2 D E M1 M2 W synch_load F1 F2 D E M1 M2 W load posted load completed
Exceptions • Load may be issued and later squashed due to exception: • e.g. register window overflow/underflow • exception in previous instruction • Must change visible state only after instruction commits • But must stall when instruction is issued
Synchronized load TIEWire1 addr type isStore tag TIEWire2 tag synch_load F1 F2 D E M1 M2 W post addr XTMP_post data check F/E bit decide to stall return load data Change memory state (F/E bit)
Synchronized store TIEWire1 addr type isStore tag TIEWire2 tag synch_store F1 F2 D E M1 M2 W post addr data check F/E bit decide to stall Change memory state (data and F/E bit)
post, TIEWire committed issued 0 1 2 3 4 5 6 7 completed received ticker Transaction tag • Defined a new 3-bit tag register in TIE • Every special memory op: • increment tag register • send tag value in both E and W stages • Tag is used as an address in the circular transaction buffer
Problem • Special memory operations are not atomic: • stall in E stage • commit memory state change in W stage • Solution – block bit with every word: • set block bit to 1 in E stage • any other transaction stalls if block bit is 1 • reset block bit to 0 in W stage when transaction is complete
Deadlock problem • Simulator can easily deadlock if there are back-to-back special ops • Common for –O2 optimized programs • Single processor example: cycles 1 2 3 4 5 6 7 8 synch_store a2, addr F1 F2 D E M1 M2 W synch_load a3, addr F1 F2 D E M1 M2 W • block bit is set for addr in cycle 4 • synch_load is stalled in cycle 5 • synch_store is also stalled because pipeline is not elastic • simulator deadlocks because synch_store never commits
void MSlaveBarrier_SlaveEnter( MasterSlaveBarrier *bar, int numProcs) { int arrived; // updated number of threads arrived at barrier arrived = SYNCH_LOAD(&(bar->entered), 0); arrived++; SYNCH_STORE(arrived, &(bar->entered), 0); /* signal master if all slaves arrived */ if (arrived == numProcs) { SET_STORE(numProcs, &(bar->masterWait), 0); } /* block until master release barrier */ SYNCH_LOAD(&(bar->slaveWait), 0); } MSlaveBarrier_SlaveEnter: .frame a1, 32 entry a1, 32 synch_load a8, a2, 0 addi.n a10,a2, 4 addi.n a11,a2, 8 addi.n a8, a8, 1 synch_store a8, a2, 0 bne a3, a8, .Lt_0_1 set_store a3, a11,0 synch_load a9, a10,0 retw.n .Lt_0_1: addi.n a13, a2, 4 synch_load a12, a13,0 retw.n Multiprocessor example
Current simulator status • Multi-core simulator with special memory ops: • running simple test programs • running radix benchmark from SPLASH-2: • hacked runtime/ANL from Vicky • no compiler optimizations (xt-xcc –g) • radix deadlocks when compiled with –O2 • because of back-to-back special memory operations
Current simulator status • Multiple-context simulator with special memory ops and fast context switch: • running simple test programs • radix benchmark is still being debugged • Implemented debugging features: • instruction tracing (using Tensilica client) • memory transaction tracing
What’s next • Desired simulator features: • caches and cache coherence • on-chip network and off-chip memory • speculation support (???) • transactional memory (???) • streaming/SIMD/vector –like mode (???) • Need to decide how to divide simulator functionality
Layers of functionality Processor cores (Tensilica) + TIE extension XTMP API Processor interface layer (callback functions, ticker, processor stalls, fast context switches, etc) ??? Memory model (caches/cache coherence, on-chip network, off-chip memory, etc)
Layers of functionality • Advantages: • more modular and manageable simulator • can divide simulator development between people • Disadvantages: • interface imposes limitations • hard to define interface for yet undefined functionality: • speculation • streaming mode
To do list • Tensilica: • add “no memory modeling” feature (soon) • add callback functions for instruction commit/squash (soon) • fix memory leak in simulator (soon) • fix FPU for 7-stage pipeline (end of June) • double floating point is not supported in hardware
To do list • Debug multiple-context version (Alex) • Install first version of simulator at Stanford (Alex) • Runtime/pthreads/ANL (Vicky): • need to setup stack properly in _start • Run more pthread/ANL applications (Alex, Vicky, Amin, John, …) • Start design of memory system model
Summary • Next generation of Tensilica system • Simulator infrastructure • Special memory operation issues • Current simulator status • Plans