210 likes | 420 Views
Brook+. Matthew Caylor. Introduction. AMD Stream Computing is a first step in harnessing the tremendous processing power the GPU (Stream Processor). high performance, data-parallel computing in a wide range of business, scientific and consumer applications.
E N D
Brook+ Matthew Caylor
Introduction • AMD Stream Computing is a first step in harnessing the tremendous processing power the GPU (Stream Processor). • high performance, data-parallel computing in a wide range of business, scientific and consumer applications. • AMD’s Stream Computing platform provides organizations and individuals the ability to integrate accelerated computing in existing IT Infrastructure, enabling improved decision making, accelerated work-flows and reduced time-to-discovery
Introduction • Brook+ is a special purpose language designed to operate on top of AMD CAL. • Brook is an extension of standard ANSI C • Designed to incorporate the ideas of data parallel computing and arithmetic intensity into a familiar, efficient language. • The general computational model, referred to as streaming • Provides two main benefits over traditional conventional languages: • Data Parallelism: Allows the programmer to specify how to perform the same operations in parallel on different data. • Arithmetic Intensity: Encourages programmers to specify operations on data which minimize global communication and maximize localized computation.
Introduction • AMD along with the open source community are working to mask the GPU's graphics programming heritage. • The open source Brook compiler plus AMD enhancements are geared directly at non-graphics stream computing. • CAL provides high-level language access to the various parts of the GPU as needed.
CAL • AMD Compute Abstraction Layer (CAL) is a device-driver library that provides a forward compatible interface to AMD’s Stream Processors (Devices). • CAL allows software developers to interact with the processing cores at the lowest-level, if needed, for optimized performance, while maintaining forward compatibility.
CAL CAL provides the following main functions: • Device Specific Code Generation • Device Management • Resource Management • Kernel Loading and Execution • Multi-device support • Interoperability with 3D Graphics APIs
CAL • The CAL SDK includes a small set of ‘C’ routines and data types that allow higher level software tools to directly interact with and control hardware memory buffers (device-level streams) and GPU programs (device-level kernels). • The CAL runtime accepts kernels written in AMD IL and generates optimized code for the target architecture.
CAL • A typical CAL program has two parts: • A program running on the host CPU (written in C or C++) The program. • A program running in the stream processor (Written in CAL IL for example.) The kernel. • The CAL API comprises one or more stream processors connected to one or more CPUs by a high-speed bus. • The CPU runs the CAL and controls the stream processor by sending commands using the CAL API. • The stream processor runs the kernel specified by the application. The stream processor device driver program (CAL) runs on the host CPU.
Streams • Streams provide connectivity between processing stages. • A stream is a reference to an N-dimensional array of identically-typed primitive elements. • It has more restricted access semantics than do conventional arrays. • Restrictions permit optimization of both storage requirements and computation locality, providing higher performance for those algorithms that this model can accommodate.
Streams • The syntax for specifying a stream is similar to other C variable or type declarations • The angle brackets are used to mark the type/variable as a stream and to delineate the stream dimensions. • For example: • float a<>; A single dimension stream of unspecified length. • float b<10>; A single dimension stream of length 10. • float c<10,10>; A two dimension steam of length ten by ten. • float d<,10>; A two dimension stream of unspecified length by width 10.
Steam Operators • Streams are accessed by use of a stream operator. • A stream operator looks like a function call. • When reading a stream, it is copied twice: • first, from the host (CPU) memory • to the PCIe memory, • then to the local (stream processor) memory. • The commands are: • streamRead(destinationStream, sourceArray) • streamWrite(sourceStream, destinationArray) • The Stream and Array must have the same number of dimensions, size, and element types must match; otherwise, the behavior is undefined
Kernels • Kernals are where the work of stream processing takes place. • There are two types of kernals: • Basic • Reduction. • A basic kernel looks like the following: void kernel mad(float a<>, float b<>, float c, out float d<>) { d = a * b + c; } void mad_slow(float a[], float b[], float c, float d[])
Kernels • Reductions are kernels that decrease the dimensionality of a stream by folding along one axis using an associative and commutative binary operation. • The requirement that the operation be associative and commutative means that the result is independent of evaluation order. • An example of a reduction kernel is as follows: reduce void kernel sum(float a<>, reduce float b) { b += a; }
Kernels • The stream a is folded and reduced to b. • The out put being one value. • In reduction kernel a reduction variable is • specified as part of a kernel • operated on using any of the C assignment operators that satisfies the associativity and commutativity requirements . • An other examples of reduction kernals as follows: reduce void max_reduce(double a, reduce double b) { if (a > b) b = a; }
Kernels • A partial reduction is possible if the target stream has the same number of dimensions as the source stream. • The size of each dimension must be an integer multiple of the corresponding dimension of the target. • For example float s<100,100>, float t<100,50>, can be partially reduced by sum(s,t) where the dimensionality of s is reduced to match t. • Conversely, float s<100,100>, float t<100,200> is expanded by sum(s,t) where the dimensionality of s expands to match the dimensionality of u.
Kernels • Kernels can call other functions defined in the same .br file or any files it includes • However, there are restrictions: • A top-level kernel must have a return type of void to be callable from host code. • Subkernels can return data of any non-stream type. • A subkernel also can be bound to streams propagated from its parent kernel. • Subkernels are logically expanded inline, so recursion is not permitted. • Kernels cannot call stream operators
Kernels • Kernels can use both stream and non-stream parameters as inputs. • Generally, only streams can be used as outputs. • Within a kernel definition, the following restrictions apply: • The goto, volatile , and static keywords are prohibited. • Pointers are not supported. • Recursion is not allowed. • Any pointers passed into Brook+ code are required not to alias each other
Conclusion • Brook+ is almost as good in readability and writability as C and C++ • It is an environment that a C or C++ programmer can work in. • Do not have to worry about the CAL structure that Brook+ is built on. • There are some abstract concepts about the functionality of kernels that make it slightly harder to write a function to operate on the GPU core rather than the CPU, but that is the trade off of this method of computation. • Since Brook+ is still very new it is not as reliable as C or C++ • Support is not yet readily available for it, but there is a small community growing and using ATI graphics cards for desktop super computing.
Conclusion Brook+ is simply a language to act as an interface between programmer and graphical device.