270 likes | 353 Views
by Eyal Sarfati and Eran Gilat Supervised by Prof. Shmuel Wimer , Amnon Stanislavsky and Mike Sumszyk. Graphics on Key. Overview. Motivation Algorithm Improvements Software simulation GPU VLSI Design GoK system design Challenges and contributions Summary Demo. 2. Motivation.
E N D
by EyalSarfati and EranGilat Supervised by Prof. ShmuelWimer, Amnon Stanislavsky and Mike Sumszyk Graphics on Key
Overview Motivation Algorithm Improvements Software simulation GPU VLSI Design GoK system design Challenges and contributions Summary Demo 2
Motivation GoK • External GPU with a standard • interface can significantly • enhance graphic performance of systems with limited computing resources GPU (Graphics Processing Unit) is the key for high-performance in graphics applications (games, flight simulations, virtual worlds, etc.) Mobile systems (e.g. cellphones, handheld devices…) lack a suitable GPU
Project Goal Host GoK USB VGA • Standard interface for data input/output • Provides real time graphics processing to systems with limited computing resources Develop a low-cost prototype which performs 3D animation and displays it on a 2D RGB screen.
Project Stages • Software Design • Implementing algorithm in Matlab • Simulation and analysis • Adaptation of algorithm to hardware • ASIC Design • Architectural design • Implementation in VHDL • Synthesis and layout • System Design • Implementation of system blocks including SW and HW interfaces • System integration • System performance enhancement 5
Graphic Animation • 3D Data Representation • Series of triangles • Each triangle is represented by: • 3 vertices • 3 RGB vectors • 1 normal vector α β γ Elementary operations : • Translation • Rotation • Scaling
Rendering Algorithm stages [Wimer] • Projection of triangles on viewing plane • Composed of 2 stages : • Transformation from 3D to 2D (projection) • Transformation from real co-ordinates to screen co-ordinates 1 2 • Determine potential triangle visibility • Hidden triangles are discarded on the basis of their normal direction • This detection reduces the processed data by 50% Elementary transformations • Four transformations are executed for every triangle: • Three matrix multiplications for vertex co-ordinates • One matrix multiplication for normal vector
Algorithm Details • Determine projected triangle’s visibility • Scan all points and compare their depth with depth of previously saved points • Scan in 3D space using inverse transformation • To increase efficiency : • Split triangles • Increase parallelism • Color of visible points • Compute pixel color from the RGB vector and the current lighting vector • Using mathematical average for all the pixels inside triangles rather than linear interpolation I II
MATLAB Simulation • Run Time on Matlab based software : 1 hour • Run Time on Arm based processor : 16 seconds Matlab implementation of rendering algorithm [Wimer]
System Overview Concept Prototype Host GoK GoK USB VGA
GPU Architecture Design Principles Design Goal: maximize throughput Use parallel architecture to overcome bottlenecks Minimize expensive memory accesses Optimize accuracy for fast calculations
Rasterization 1 Rasterization 0 GPU Architecture Prefetch & Visibility Detection Unit 3D Transformation Unit Triangles Triangle pre-processor FIFO task queue Z-Buffer Z Rasterization 10 Z-Buffer Arbiter Snooping Cache Scheduler Unit RGB Frame RGB Arbiter Snooping Cache RGB
Transformation and Pre-processor 3D Transformation Unit Vertex / Normal Transform Project Triangle Sort Coordinates according to y axis Triangle slopes calculation D calculation Triangle pre-processor Create 2 half triangles -1 / C FIFO RGB Color Set Note : Early elimination of invisible triangles reduces load by 50% !
Rasterization 1 Rasterization 0 GPU Architecture Prefetch & Visibility Detection Unit 3D Transformation Unit Triangles Triangle pre-processor FIFO task queue Z-Buffer Z Rasterization 10 Z-Buffer Arbiter Snooping Cache Scheduler Unit RGB Frame RGB Arbiter Snooping Cache RGB
Rasterization 1 Rasterization 0 FIFO Task Queue Triangle pre- processor FIFO task queue FIFO task queue Scheduler Unit • Target : Maximize throughput • Minimize idle time of rasterization units • Immediately issue next half triangle for processing upon completion of processing previous triangle Rasterization 10 Scheduler Unit Stalls input stream to prevent overflow by means of a backward communication protocol Backwards communication permeable to the Prefetch and Visibility Detection Unit
Rasterization 1 Rasterization 0 GPU Architecture Prefetch & Visibility Detection Unit 3D Transformation Unit Triangles Triangle pre-processor FIFO task queue Z-Buffer Z Rasterization 10 Z-Buffer Arbiter Snooping Cache Scheduler Unit RGB Frame RGB Arbiter Snooping Cache RGB
Rasterization 0 Rasterization 1 Rasterization Units Rasterization 10 • For eachpoint of each half triangle: • Calculate the new Z value • Read the stored Z value and compare it with the calculated one • Update both the Z-Buffer and RGB Frame Buffer accordingly Z-Buffer Arbiter Snooping Cache Scheduler Unit RGB Arbiter Snooping Cache
Multi Core Architecture Problem Z-Buffer Z RGB Frame RGB • Multi core architecture with shared memory must cope with: • Efficient management of multiple requests to the shared memory • Guaranteeing data coherency • Solution : Arbiter Snooping Multi Cache Rasterization 10 Rasterization 1 Rasterization 0 18
Rasterization 1 Rasterization 0 Arbiter Snooping Multi Cache (ASMC) Snooping Multi - Cache Arbiter Rasterization 10 Shared Memory • Deadlock • Using Snooping mechanism • Using Watchdog mechanism Reduce memory access time Cache memory Simultaneous multiple memory access requests Arbiter for efficient memory access management Data Coherency Add Snooping mechanism to cache to guarantee data coherency
GPU ASIC Implementation Technology : 65ns CMOS 8LM Clock frequency : 300Mhz Core area : 2.25 mm2 Power consumption : Approx. 130mW @ 300Mhz USB Host can supply up to 400mW 20
GoK System Requirements Input: The data is sent by the host to the GoK in two stages: • Initialization : a list of triangles are sent to the GoK • Animation : a transformation for all triangles is sent to the GoK every 40 msec (25 FPS) Output: Real-time object animation at : • 160x120 pixels resolution • 120,000 triangles/sec • 25 frames/sec
System Overview - SoPC FPGA GPU Host System Controller VGA Controller GPU Processor ASMC Communication Bus USB Controller Memory Controller USB
Summary 23
Challenges 24 Matlab implementation and simulation for detailed investigation and evaluation of algorithm VLSI design and implementation of an efficient architecture (with maximum parallelism) for GPU algorithm Real-time embedded system design on FPGA • NIOS II, USB1.1, DDR2, VGA, Avalon Bus, Software drivers & code • GPU integration in the system Modification of USB1.1 driver for acceptable reliability of data transfer Modification of standard VGA interface core to enable 100Mhz GPU core to interface with 50Mhz VGA unit
Main Contributions Enhancement of algorithm for increased performance • Early elimination of invisible triangles - 50% computation reduction • Splitting of triangles to reduce computation complexity and increase parallelism • Simplification of pixel color computation Pre-process the triangles data for fast rasterization computation Efficient scheduling of half triangles to rasterization units Design and implementation of arbiter snooping multi cache • Shared memory management, cache memory, data coherency Double memory buffer for continuous motion of animation
The Bottom Line • Achieved performance : 1,000,000 triangles/sec @ 640X480 resolution. Approx. 25mW @ 50Mhz 26 Implementation of a “Graphics on Key” that enhances the graphic performance of low power, low cost gadgets The device performs the required computations and displays the animation on screen Project required specifications : 120,000 triangles/sec @ 160X120 resolution.