150 likes | 268 Views
Scalable Vector Processors for Embedded Systems. Kozyrakis , Patterson Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing Architectures. Outline. Introduction Instruction Set Compiler The Design Evaluation Clustered Processor Conclusion. Introduction.
E N D
Scalable Vector Processors for Embedded Systems Kozyrakis, Patterson Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing Architectures
Outline • Introduction • Instruction Set • Compiler • The Design • Evaluation • Clustered Processor • Conclusion
Introduction • Embedded processors requires low power and complexity • Performance and scalability are primary • Superscalar and VLIW (ILP) • Superscalar requires complex hardware to detect dependence • VLIW requires a very through compiler • Scaling is difficult
Introduction • Multimedia and telecommunications have data Level Parallelism (DLP) • Revise vector architecture for supercomputers • Introduce Vector IRAM (VIRAM)
Instruction Set • Coprocessor extension to MIPS • Vector Register File (VRF) • 32 Registers • Integer and floating point • Flag register • Vector operations • Arithmetic: integer and floating point • Logical operations • Other functions e.g. population count
Instruction Set • Supports three common access patterns and virtual addressing • Elements can be 64, 32 or 16 bit wide • The 64-bit datapath can execute multiple narrow elements • Element permutation is limited to dot product and fast Fourier transforms • Supports speculative execution using the flag register
The Compiler • Based on PDGCS compilation system for Cray supercomputers • Extensive vectorization techniques: • Outer-loop vectorization • Handling partially vectorizable constructs • Does not require special functions nor custom libraries • Requires pragmas for irregular scatter/gather patterns
The Compiler • Selects operation and element width • Recognizes reduction
The Design • Coprocessor to 64-bit MIPS • VRF capacity is 8KB • Can be 32-64-bit, 64 32-bit or 128 16-bit • A lane has 2 64-bit ALU and vector load/store unit • On-chip 13 MB DRAM organized as 8 banks • The scalar core is a single issue in order MIPS
The Design • Operates at 200MHZ with 2W power consumption
Clustered Processor • VIRAM has complex VRF • Approx. 3 ports per FU • Proposed: replace centralized VRF with clustered VRF • A cluster has a datapath for one FU and few vector registers • It contains access to intercluster network • Area, power and latency per cluster is constant
Clustered Processor • Renaming is used to utilize clustered configuration • It is done using a renaming table that identifies the source and destination • It can be used to implement more than 32 registers • Clustering improves scaling
Conclusion • Designed for embedded systems • Area, power and performance • Exploits DLP • Instruction set VRF • Vectorizing compiler • Evaluation • Clustered configurtaion