RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab, EECS UC Berkeley March 2010

Outline Overview RAMP Gold HW Architecture and Implementation RAMP Gold Software Infrastructure Usage Case and Live Demo Future work

Overview • Purpose of RAMP Gold • An FPGA-based simulator for shared-memory multicore target for Parlab • Usage case: Architecture, OS and applications • Highlight of RAMP Gold • Works on $750 Xilinx XUP v5 board • Written in systemverilog, no special CAD tools required, works with standard FPGA CAD flows (Synplify/ISE/Modelsim) • Two orders of magnitude faster than Simics+GEMS • Runtime configurable parameters without resynthesis • Full RTL verification environment and software infrastructure • BSD and GNU license

Simulation Jargon • Target vs. Host • Target: System/architecture being simulated, e.g. SPARC v8 CMP • Host : The platform on which the simulator runs, e.g. FPGAs • Functional model and timing model • Functional: compute instruction result • Timing: how long to compute the instruction

RAMP Gold Overall Setup • Both functional and timing models on FPGA • App server: control and service syscall/IO

Target Machine Template • 64-core SPARC v8 shared-memory machine • Configurable two-level cache + multichannel DRAM

RAMP Gold Performance vsSimics PARSEC parallel benchmarks running on a research OS >250x faster than full system simulator for a 64-core multiprocessor target

Timing Model Pipeline Functional Model Pipeline Arch State Timing State RAMP Gold Model Key Concepts • Decoupled functional/timing model, both in hardware • Enables many FPGA fabric friendly optimizations • Increase modeling efficiency and module reuse • Host multithreading of both functional and timing models • Hide emulation latencies and improve resource utilization • Time-multiplexed effect patched by the timing model

X Y IR PC 1 PC 1 PC 1 PC 1 Host multithreading CPU0 CPU1 CPU2 CPU3 Target Model Functional CPU model on FPGA ALU GPR1 GPR1 I$ DE GPR1 GPR1 D$ +1 Thread Select 2 2 2 • Example: simulating four independent CPUs

Functional Model • Full SPARC v8 support (FP, MMU, I/Os) • Pass the SPARC v8 certification test • Run Linux and research OS

Timing Model • Simple CPU timing but detailed memory timing model (i.e. every instruction takes 1 cycle except LD/ST) • Cache models: only store tags in BRAMs • Runtime configurable parameters: associativity, size, line size, # of banks, latency and etc • Model 3C but not 4C (coherent support soon) • DRAM model: bandwidth-delay pipe with optional QoS

Debugging and Simulation Configuration • Frontend app server • Reliable Gigabit Ethernet connection to FPGA • Periodically pulls the simulator to serve I/O requests • Transparent to target (no side effect on simulated timing) • 64-bit hardware performance counters to collect runtime stats • 657 counters in timing model + 10 host counters • Can be read by either target apps or the app server • Ring interconnect for counters (easy to add and remove)

Host Performance Timing synchronization is the largest overhead Tiny host $/TLBs are not on the performance critical path Host DRAM bandwidth is not a problem (<15% utilization)

Implementation Single FPGA: 64-core @ 90 MHz, 2 GB DDR2 SODIMM ~2 hours CAD turnaround time on a mid-range workstation BRAM bounded, but have logic resources to fit more pipelines

Software Tools • SPARC cross compiler with binutils/gcc/glibc • Support most of POSIX programs • Static & dynamic linking support • Built from GNU GCC (4.3.2) • Full software and HW debugging suite • Low-cost XUP boards sometimes do not work out-of-box • FPGA CAD tools are very bad

Target Software • Proxy Kernel: single-protection-domain application host • Runs programs statically linked against glibc • Forwards I/O system calls to x86/Linux host PC • Presents simple “hard-threads” API for multithreaded programs • Very easy to modify • ROS: UCB’s manycore research OS • Provides multiprogramming support • Sufficiently POSIX compliant to run many programs • Much easier to modify than linux • Run more than 64-cores

Infrastructure

Case studies • Parallel application studies for software programmers • Parallel OS for system researchers • Adding hardware performance counter for advanced debugging • Micro-architecture studies - adding features and modifying existing timing models • Adding new instructions – changing the functional model

Appserver 101 • Appserver command-line options: Usage: sparc_app [-f<conf>] [-p<nprocs>] [-s] <htif> <kernel> [binary] [args] • Platform memory test: • App server memory test: sparc_app –p64 hw memtest none • Proxykernel memory test (stress test) sparc_app –p64 hw pathlkernel.ramp path/memtest

For application programmers • Main usage scenario: use runtime configurable timing model without any FPGA hardware change • Use ‘hard-threads’ to write a parallel ‘hello world’ program running on the proxykernel • Compile the program using the cross toolchain sparc-ros-gcc –o hello hellp.cpp -lhart • Measure performance using performance counters sparc_app –s1 –p64 hw kernel.ramp hello • Change target machine configuration on the fly and rerun the experiment edit file ‘appserver.conf’

For OS Developer • Similar usage model like application programmers • Proxykernel is a good start to learn the bootstrapping process • ROS is a full functional kernel • Demo: Boot the ROS kernel using the appserver sparc_app –p64 –fappserver_ros.conf hw your_kernel none

Adding Hardware Performance Counters • Two types of counter interface • Global counter: <EN> • Local (per core) counter: <TID, EN> • Modify the verilog file to add more counters on the ring. perfctr_io #(.NLOCAL(num_of_local), .NGLOBAL(num_of_global)) gen_tm_counter(.gclk, .rst, .bus_out(io_out), .bus_in(io_in), .bus_sel(), //IO bus interface .global_inc(global_counter_inc), .local_inc(local_counter_inc), .local_tid(local_counter_tid)); • Modify the app server to support more counters: • Add your counter definition in ‘TestAppServer/perfcnt.h’

Adding Features to Timing Models • Timing models are much simpler than functional models • ~1000 LoCvs 35,000 LoC • Example 1: Changing the cache replacement policy • Example 2: Adding memory QoS • Lee et al. “Globally-Synchronized Frames for, Guaranteed Quality-of-Service in On-Chip Networks”, ISCA’08 • ~100 lines of code added in the timing model • A new DRAM model • Several memory mapped register added on the functional I/O bus for configuration purpose

Adding New Instructions • Adding instructions to a feed-through pipeline is straightforward • FPU instructions were added as “new” instructions within a week • Including: new register file, decode, exception/commit and microcode • Example: Adding new atomic instructions through microcode • 4 global scratchpad registers (not visible to programmer) in the main integer register file for temporary storage • Two write-port for supporting scratchpad registers update along with architecture register change

Steps of Adding Instructions • Add proper decoding logic in function “decode_dsp_add_logic“ of “regacc_dma.sv” • Update the writeback/exception stage in file “exception_dma.sv” to trap to microcode. • Edit function “decode_microcode_mode” to trap to microcode • Edit function “rd_gen” to write address to scratch register 0, and load data to scratch register 1 • Edit microcode ROM ‘Microcode.sv’ //----------SWAP*------- 9: begin uco.uend = '0; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {LDST, 5'b0, ST, REGADDR_SCRATCH_0 | UCI_MASK, 1'b1, 13'b0}; end 10: begin uco.uend = '1; uco.cwp_rs1 = '0; uco.cwp_rd = '0; uco.inst = {FMT3, 5'b0, IADD, REGADDR_SCRATCH_1 | UCI_MASK, 1'b1, 13'b0}; end

Future work Cache Coherence models (soon) Realistic interconnect model (soon) Better CPU core model (next major version) Support other ISAs (next major version)

Further References • Research papers • Usage case: A Case for FAME: FPGA Architecture Model Execution, ISCA’10 • RAMP Gold design: RAMP Gold: An FPGA-based Architecture Simulator for Multiprocessors, DAC’10 • Beta release http://sites.google.com/site/rampgold

Backup Slides

Functional/Timing Model Interface // FM -> TM typedefstruct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //cpu states bit replay; //this instruction needs to replay by FM bit retired; //retiring an instruction bit [31:0] inst; //the instruction that was retired bit [31:0] paddr; //load/store physical address bit [31:0] npc; //PC of next fetched insn }tm_cpu_ctrl_token_type; // TM -> FM typedefstruct { bit valid; //timing token between FM and TM. bit [5:0] tid; //thread ID bit run; //run bit }tm2cpu_token_type;

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors

Presentation Transcript

Mapping CMPs to Xilinx FPGAs

SimpleScalar

Chapter4 Multiprocessors

A High-Speed Inter-Process Communication Architecture for FPGA-based Hardware Acceleration of Molecular Dynamics

ECE 669 Parallel Computer Architecture Lecture 13 Shared Memory Multiprocessors

Reconfigurable Computing

Software architecture simulator

Octavo: An FPGA-Centric Processor Architecture

Development of Virtual FPGA lab and FPGA-based web browser

An FPGA Based Readout Scheme Using n-XYTER for CBM Experiment

The Microarchitecture of FPGA-Based Soft Processors

RAMP Gold

FPGA-Based Wireless Sensor Network Architecture for High Performance Applications

Software for development and communication with FPGA based hardware

Emerging Memory Technologies for Reconfigurable Routing in FPGA Architecture

A System-Level Stochastic Benchmark Circuit Generator for FPGA Architecture Research

Amenability of Multigrid Computations to FPGA-Based Acceleration*

Network Simulator(NS) Tutorial

EEL 5764 Graduate Computer Architecture Chapter 4 - Multiprocessors and TLP