380 likes | 657 Views
Attila Research Group. attila.ac.upc.edu Computer Architecture Department Univ Politècnica de Catalunya (UPC). Attila Project. Started 2003 Research on GPUs Focus on the microarchitecture Use real games as workloads Analyze bandwidth/latency/threading tradeoffs
E N D
Attila Research Group attila.ac.upc.edu Computer Architecture Department Univ Politècnica de Catalunya (UPC)
Attila Project • Started 2003 • Research on GPUs • Focus on the microarchitecture • Use real games as workloads • Analyze bandwidth/latency/threading tradeoffs • Spent large fraction of time developing tools • Currently three PhDs in progress • Funding from • CICYT / Ministry of Education, Spain (2) • Intel (1) • 2 Students spent 6 months with ATI
Attila Team • Faculty • Agustin Fernandez • 3 Ph.D. Students • Victor Moya -- Hired by Intel / VCG ’06 • Carlos González -- 6 months internship at ATI (Jun’07) • Jordi Roca -- 6 months internship at ATI (Jun’07) • Master Thesis • Chema Solis – DX9 Driver Development • Alumni • David Abella – DX9 Player and PIX reader • Christian Perez – Color Compression in Attila • Industrial Advisor • Roger Espasa, Intel VCG
Attila Facts • Simulation time • 1 frame @1280x1024 per hour • Lines of code • Simulator: 142697 lines • Library, driver and trace tools: 217266 lines • ACDL : 37791 lines • OpenGL : 35960 lines • D3D9: 17348 lines
Attila Publications Conference Papers Workload Characterization of 3D Games.Jordi Roca, Victor Moya, Carlos González, Chema Solis, Agustín Fernández and Roger Espasa.IEEE International Symposium on Workload Characterization (IISWC-2006), pp. - , January 2006. ATTILA: A Cycle-Level Execution-Driven Simulator for Modern GPU Architectures.Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa.IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), March 2006. Shader Performance Analysis on a Modern GPU Architecture.Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa.The 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38), November 2005. A Single (Unified) Shader GPU Microarchitecture for Embedded Systems.Víctor Moya, Carlos González, Jordi Roca, Agustín Fernández and Roger Espasa.2005 International Conference on High Performance Embedded Architectures & Compilers (HiPEAC 2005), November 2005. Master Thesis Caracterización e implementación de algoritmos de compresión en la GPU ATILA (Text in Spanish) Christian Perez. Master Thesis for the Graduate Studies, January 2008. Extensión a Direct3D del driver de un simulador de GPU (Text in Spanish) Chema Solis Master Thesis for the Graduate Studies, July 2007. Librería Direct3D (Text in Catalan) David Abella Master Thesis for the Graduate Studies, July 2007 Shader generation and compilation for a programmable GPU (Text in Spanish) Jordi Roca. Master Thesis for the Graduate Studies, July 2005. Support tools for a 3D graphics processor simulation framework (Text in Spanish) Carlos González.Master Thesis for the Graduate Studies, June 2004. 5
Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research • Shaders • Memory Hierarchy • Micropolygons • DX9 Driver Development
Supported workloads Doom 3 UT2004 Quake 4 Riddick Half Life 2 Prey and upcoming D3D games …
Collect Verify Simulate Analyze OGL/D3D App OGL/D3DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace Signal TraceVisualizer OGL/D3DPlayer API Stats or Attila Pix Player Signal Traffic Vendor OGL/D3DDriver Vendor OGL/D3DDriver ATTILA OGL/D3D Driver µ-ArchStatistics ATI R600/NVIDIA G80 ATI R600/NVIDIA G80 ATTILA Simulator Internal traces(mem,$,…) Framebuffer Framebuffer Framebuffer CHECK CHECK
Collect Verify Simulate Analyze OGL/D3D App • API Capturers • Capture API calls from a real game • Gather API level statistics OGL/D3DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace Signal TraceVisualizer OGL/D3DPlayer API Stats or Attila Pix Player Signal Traffic Vendor OGL/D3DDriver Vendor OGL/D3DDriver ATTILA OGL/D3D Driver µ-ArchStatistics ATI R600/NVIDIA G80 ATI R600/NVIDIA G80 ATTILA Simulator Internal traces(mem,$,…) Framebuffer Framebuffer Framebuffer CHECK CHECK
Collect Verify Simulate Analyze OGL/D3D App • API Players • Trace checking/integrity • Batch-to-batch playing (helps debug) OGL/D3DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace Signal TraceVisualizer OGL/D3DPlayer API Stats or Attila Pix Player Signal Traffic Vendor OGL/D3DDriver Vendor OGL/D3DDriver ATTILA OGL/D3D Driver µ-ArchStatistics ATI R600/NVIDIA G80 ATI R600/NVIDIA G80 ATTILA Simulator Internal traces(mem,$,…) Framebuffer Framebuffer Framebuffer CHECK CHECK
Collect Verify Simulate Analyze • Simulation • Attila Drivers • AOGL (90%) • AD3D9 (60%) • Attila Simulator • Detailed cycle-to-cycle simulation • 20 Boxes modeling 100-deep pipeline • Execute@Execute: Functionality embedded at each pipeline stage OGL/D3D App OGL/D3DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace Signal TraceVisualizer OGL/D3DPlayer API Stats or Attila Pix Player Signal Traffic Vendor OGL/D3DDriver Vendor OGL/D3DDriver ATTILA OGL/D3D Driver µ-ArchStatistics ATI R600/NVIDIA G80 ATI R600/NVIDIA G80 ATTILA Simulator Internal traces(mem,$,…) Framebuffer Framebuffer Framebuffer CHECK CHECK
Collect Verify Simulate Analyze OGL/D3D App • Simulation output • Micro-architectural statistics • Traffic for cache, mem, … • Signal trace (input for STV tool) • Debug simulation performance OGL/D3DCapturer or Microsoft PIX Capturer Detailed cycle-to-cycle visualization Trace Signal TraceVisualizer OGL/D3DPlayer API Stats or Attila Pix Player Signal Traffic Vendor OGL/D3DDriver Vendor OGL/D3DDriver ATTILA OGL/D3D Driver µ-ArchStatistics ATI R600/NVIDIA G80 ATI R600/NVIDIA G80 ATTILA Simulator Internal traces(mem,$,…) Framebuffer Framebuffer Framebuffer CHECK CHECK
Attila Drivers • DirectX9 driver • About 50 callssupported. • 60% API functionality. • OpenGL driver • 200 API calls supported. • 80% OpenGL 2.0 fixed functionality Attila OpenGL Driver (GLLIB) Attila DX9 Driver (D3DLIB) HAL ATTILA Architecture
Unified Driver Architecture • Currently stalled due to lack of resources • Runs basics traces • Non-textured torus with simple vtx shader. AOGL* AGL/ES ADX9* ADX10 AREY ACDLX ACDL HAL ATTILA Architecture
Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research • Shaders • Memory Hierarchy • Micropolygons • DX9 Driver Development
Attila Architecture Memory Controller Shader Vertex Fetch ROP Memory Controller Shader Scheduler Distributor ROP Primitive Assembly Clipping Shader Memory Controller ROP Triangle Setup Rasterization Shader ROP Memory Controller HierarchicalZ Unified shaders, multithreaded … GDDR4 detailed protocol, selectable memory schedulers…
Attila Simulator ImplementationUsing Boxes & Signals STREAMER/VERTEX FETCH Streamer Fetch Streamer Output Cache Streamer Commit Primitive Assembly Clipper Triangle Setup Fragment Generator Hierarchical Z Streamer Loader SHADER Shader Fetch Fragment FIFO Interpolator Shader Decode Execute Texture Unit Z Stencil Test Color Write DAC Data-driven & cycle-accurate Command Processor Memory Controller
Statistics – High Level • API level • µ-arch level • “Workload Characterization of 3D Games”, IEEE International Symposium on WC 2006
Statistics – Zooming In Stencil pass Shading pass Stencil pass Shading pass Light 0 Light 1 • Fine-grainstats at configurable fractions of i.e: 100, 1K, 10K or 100K executioncycles.
Outline • Attila Tracing Environment • Attila Architecture & Simulator • Current Research • Shaders • Memory Hierarchy • Micropolygons • DX9 Driver Development
GPU Memory Hierarchy Optimizations Carlos González cgonzale@ac.upc.edu
Previous Work • Initial Attila’s Boxes & Signals framework • Tracing Framework • GLInterceptor & GLPlayer Tools • OpenGL Driver for Attila • Signal Trace Visualizer tool • New highly-detailed Memory Controller for Attila • Internship at ATI (6 months, 07’) • Work mainly focused on the MC block • Analysis of bandwidth and latency by means of simulation techniques • Some contributions to the initial system • Mechanisms to pinpoint sources of latency and analyze bandwidth over time slices
Today’s GPUs remarks Tremendous bandwidth available Core 2: 12 GB/sec VS NVIDIA G80 > 100 GB/sec But… Dozens of clients accessing memory simultaneously Unbalance and inefficient scheduling of memory transactions can lead to poor performance Workload unbalance Total available BW decreases Inefficient scheduling Latency increases (DDR protocol overhead) Overall performance degradation
Thesis Goals Optimize bank mapping and load balancing among memory channels. Also, propose multiple separated address spaces (per client) Propose efficient memory controller scheduling algorithms Also: Measure DRAM chips consumption of our proposals Propose new cache hierarchies for ROP and Texture units Research in interconnection topologies
Some experiments… 17’1% 17’8% 17’8% 17’6% 17’1% 13’6% Channel Interleaving Analysis • Some config. parameters of the experiment • 8 channels of 32-bit • 8 banks per 32-bit IO chip • Bank interleaving fixed to 256 • 4 unified shaders (4x) Memory Scheduling Analysis • Some config. parameters of the simulation • 4 channels of 64-bit • 8 banks per 32-bit IO chip • Channel interleaving = 256 bytes • Bank interleaving = 1024 bytes • 4 unified shaders (4x) • Texture cache line (L1) = 64 bytes • Texture cache ways (L1) = 16 • Texture cache lines (L1) = 16 • Color and Zstencil caches: • 4 ways • line size = 256 bytes - 16 cache lines
Micropolygon Rendering Jordi Roca jroca@ac.upc.edu
Past work • OpenGL Fixed Function to ARB vp/fp 1.0 translator. • Workload Characterization of 3D Games (IISWC´06): • Extensive analysis of current games in terms of both API call and µarchitectural level stats. • Multi-GPU performance evaluation project (at ATI 2007´s internship): • Hybrid SFR/AFR modes. • Alternatives for RTT surface synchronization. • Scaling of current PCIe BW. (Related paper is currently submitted at the IISWC 2008).
Micropolygon rendering • Understanding and characterizingthe pipeline backendunbalanceduetoverysmallpolygons. • Newergamestendtorenderoutsides, thusprojectingpolygons of a fewpixelssize. • Syntheticmicropolygon test: • Fillsthescreenwith 1 pixel alignedquads: • Raster Input: 1 triangle/clock • Raster Output: 15/16 empty slots/clock (high-endcards).
Research on: • Proposal #1: µpolygon grid traversal scheme: • An alternative rasterization path to detect and efficiently traverse grids of adjacent pixel-size primitives: • Fill backend slots combining fragments of different primitives. • Reuse triangle setup and traversal computations for pixel proximate primitives. • Proposal #2: Dynamic balancing of rasterization workload: • Assign & schedule shader threads for rasterization.
Chema Solís csolis@ac.upc.edu DX9 Driver Development 32
Project target • Project target is to use D3D9 games as workload for ATTILA GPU simulator. • Two main tasks: • Trace D3D9 calls executed by the games. • Build a D3D9 driver on top of GPU simulator. D3D application D3D9 Trace Microsoft D3D9 ATTILA D3D9 driver
PixRun Player • Executes traces of calls to D3D9 captured by Microsoft PIX. • Analyse how the game is using D3D9.
D3D9 Driver • D3D9 functionality is being added progressively. • The driver is close to support commercial games.
Unified Shader Architecture Victor Moya vmoya@ac.upc.edu 36
Unified Shader Architecture • Evaluated performance of an unified vertex and fragment shader architecture on legacy applications • Evaluated area vs performance • Evaluated the performance of implementing Triangle Setup on the shader for embedded GPU architectures • Evaluated bottleneck of GPU architectures with high shader ALU to texture.
Current Research • Evaluate thread and resource scheduling in an unified shader architecture • Implementation blending on the shader