Scalable Vector Processors for Embedded Systems

Scalable Vector Processors for Embedded Systems Kozyrakis, Patterson Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing Architectures

Outline • Introduction • Instruction Set • Compiler • The Design • Evaluation • Clustered Processor • Conclusion

Introduction • Embedded processors requires low power and complexity • Performance and scalability are primary • Superscalar and VLIW (ILP) • Superscalar requires complex hardware to detect dependence • VLIW requires a very through compiler • Scaling is difficult

Introduction • Multimedia and telecommunications have data Level Parallelism (DLP) • Revise vector architecture for supercomputers • Introduce Vector IRAM (VIRAM)

Instruction Set • Coprocessor extension to MIPS • Vector Register File (VRF) • 32 Registers • Integer and floating point • Flag register • Vector operations • Arithmetic: integer and floating point • Logical operations • Other functions e.g. population count

Instruction Set • Supports three common access patterns and virtual addressing • Elements can be 64, 32 or 16 bit wide • The 64-bit datapath can execute multiple narrow elements • Element permutation is limited to dot product and fast Fourier transforms • Supports speculative execution using the flag register

The Compiler • Based on PDGCS compilation system for Cray supercomputers • Extensive vectorization techniques: • Outer-loop vectorization • Handling partially vectorizable constructs • Does not require special functions nor custom libraries • Requires pragmas for irregular scatter/gather patterns

The Compiler • Selects operation and element width • Recognizes reduction

The Design • Coprocessor to 64-bit MIPS • VRF capacity is 8KB • Can be 32-64-bit, 64 32-bit or 128 16-bit • A lane has 2 64-bit ALU and vector load/store unit • On-chip 13 MB DRAM organized as 8 banks • The scalar core is a single issue in order MIPS

The Design • Operates at 200MHZ with 2W power consumption

Evaluation

Clustered Processor • VIRAM has complex VRF • Approx. 3 ports per FU • Proposed: replace centralized VRF with clustered VRF • A cluster has a datapath for one FU and few vector registers • It contains access to intercluster network • Area, power and latency per cluster is constant

Clustered Processor • Renaming is used to utilize clustered configuration • It is done using a renaming table that identifies the source and destination • It can be used to implement more than 32 registers • Clustering improves scaling

Clustered Processor: Evaluation • ss

Conclusion • Designed for embedded systems • Area, power and performance • Exploits DLP • Instruction set VRF • Vectorizing compiler • Evaluation • Clustered configurtaion

Scalable Vector Processors for Embedded Systems