GPU Functional Simulator for ATI and NVIDIA: Design and Implementation

GPU Functional Simulator Yi Yang yangyi@eecs.ucf.edu CDA 6938 term project Orlando April. 20, 2008

Outline • Motivation and background • Software design • Implementation • Test cases • Future work

Motivation and background • Motivation • Better understanding of GPU • Improving the GPU architecture. • Background • Two GPU manufacturers: Nvidia and ATI • Similar programming mode: • block vs group • share memory vs lds • ATI uses vliw • We want to work on both.

Software design • Programming Model Layer • Platform independent • Define abstract part, most of ISA: ISA, Register • Implement similar most of resource: group, wavefront, … • Hardware Implementation Layer • Implement the abstract part of PML for different platform • ATI • NVIDIA

Programming Model Layer • Code parser to get instruction list • Allocate resource by the configuration file: group, thread, share memory, memory, wavefront schedule. • Load input stream from txt file. • Wavefront schedule executes instruction list on the wavefronts • When instruction is executed on one thread, instruction update the resource: register of thread, share memory of group, texture(global) memory of gpuprogram. • Save the output memory to txt file

Code Parser(HIL) • read the assembly and parse it into instructions • INST LABEL NO: unique # of instruction • Stream core label: one of x, y, z, w, t • INST: • Operand

Operand(HIL) • Global Purpose Register: 0 y: ADD ____, R0.x, -0.5 • Previous Vector(x, y, z, w) and Previous Scalar (t) 3 t: F_TO_I ____, R0.x 4 t: MULLO_UINT R1.z, 1, PS3 • Temporary Register 3 t: RCP_UINT T0.x, R1.x • Constant Register 1 z: AND_INT ____, R0.x, (0x0000003F, 8.828180325e-44f).x

Instruction (HIL) • Opcode dst, src1, src2, … • ADD_INT R0.x, R1.x, R2.x • Dst, src1, src2 is Operand • GPUProgram hold instruction lists. • Instruction implement the execution • Receive the thread as parameter, and execute on the thread • For example: • ADD_INT R0.x, R1.x, R2.x • Instruction get value of R1.x, R2.x from thread • Save value of R1.x+R2.x as R0.x to thread

Memory Handle (HIL) • Texture Memory • 0 SAMPLE R1, R0.xyxx, t0, s0 UNNORM(XYZW) • EXP_DONE: PIX0, R0 • Cache support (future work) • Global Memory • 6 RD_SCATTER R3, DWORD_PTR[0+R2.x], ELEM_SIZE(3) UNCACHED BURST_CNT(0) • 03 MEM_EXPORT_WRITE_IND: DWORD_PTR[0+R1.x].x___, R0, ELEM_SIZE(3) • Coalesced support: First thread handle (future work) • Use the text file as input and output.

Thread(PML) • Belong Group • Hold Data Unit(HIL) • 128 bit (x, y, z, w) + 32 bit ( t ) • Most of resource is 4 component: register • One thread processor is five-way, and have 5 output (x,y,z,w,t) • Hold mapping table of Register (GPR, CR, TR) to Data Unit

Wavefront(PML) • Hold Program counter • Hold the thread id list • Belong to Group

Group(PML) • Hold threads • Hold wavefront • Belong to GPUProgram • Hold Share memory(PML) • Instruction access the share memory through Group • Instruction (HIL) • 12 LOCAL_DS_WRITE (8) R0, STRIDE(16) SIMD_REL • 17 LOCAL_DS_READ R2, R2.xy WATERFALL

Wavefront Schedule(PML) • Current version (function simulator) • Pick up one instruction, let all wavefronts execute this operation. • for time simulator • Decided by the hardware capacity and software request • Decided by the static instruction list • Decided by execution result

GPUProgram(PML) • Code parser parses instruction list • Load input stream from txt file to memory. • Allocate resource by the configuration file: group, thread, share memory, memory, wavefront schedule. • Wavefront schedule executes instruction list on the wavefronts • When instruction is executed on one thread, instruction update the resources: register of thread, share memory of group, texture(global) memory of gpuprogram. • Save the output memory to txt file

Test case • Sum, division, subtract, multiplication • Support texture memory • Support different data types (int, float, uint, int1, int4…) • Support fundamental ALU operations (+-*/, shift, and, compare, cast) • domain_sum • Support global memory read and write • Sum_share_memory • support share memory read and write • Support group, wavefront • Branch and Loop: to be done • Support constant buffer • Loop operation

Future work • Now support 30 of 200 instructions for ATI • Support Nvidia, optimize two layers design • Timing simulator

GPU Functional Simulator for ATI and NVIDIA: Design and Implementation

GPU Functional Simulator for ATI and NVIDIA: Design and Implementation

Presentation Transcript

GPU

Analyzing CUDA Workloads Using a Detailed GPU Simulator

The PTX GPU Assembly Simulator and Interpreter

A Functional Network Simulator for IBM Cell based Multiprocessor Systems

GPU

Why GPU?

GPU Stock Simulator

GPU Computing

GPU Programming

Transformer: A Functional-Driven Cycle-Accurate Multicore Simulator

GPU Functional Simulator

GPU Brainstorming

GPU

Implementing a Functional/Timing Partitioned Microprocessor Simulator with an FPGA

A Novel Multi-GPU Neural Simulator

P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator

GPU

GPU Manufacturers -JETALL GPU

GPU MANUFACTURERS WITH JETALL GPU

A Functional Network Simulator for IBM Cell based Multiprocessor Systems

GPU