1 / 64

Andes Embedded Processor AndesCore TM N1213-S

Andes Embedded Processor AndesCore TM N1213-S. Agenda. Computer architecture AndesCore TM Pipeline Cache MMU DMA BIU Interruption AICE. Computer architecture taxonomy. von Neumann architecture. Computer architecture taxonomy (1/3). von Neumann architecture Features of each:

aimee-cobb
Download Presentation

Andes Embedded Processor AndesCore TM N1213-S

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Andes Embedded ProcessorAndesCoreTM N1213-S www.andestech.com

  2. Agenda • Computer architecture • AndesCoreTM • Pipeline • Cache • MMU • DMA • BIU • Interruption • AICE

  3. Computer architecture taxonomy • von Neumann architecture

  4. Computer architecture taxonomy (1/3) • von Neumann architecture • Features of each: •  Execution in multiple cycles • Serial fetch instructions & data • Single memory structure • Can get data/program mixed • Data/instructions same size  • Examples, von Neumann: PCs (Intel 80x86/Pentium, Motorola 68000, Mot 68xx uC families

  5. Computer architecture taxonomy (2/3) • Harvard architecture address CPU data memory PC data address program memory data

  6. Computer architecture taxonomy (3/3) • Harvard architecture • Features of each: Execution in 1 cycle                      • Parallel fetch instructions & data    • More Complex H/W                           • Instructions and data always separate       • Different code/data path widths       • Harvard: 8051, Microchip PIC families, Atmel AVR, AndeScore

  7. Architectures: CISC vs. RISC (1/2) • CISC - Complex Instruction Set Computers • Emphasis on hardware • Includes multi-clock complex instructions • Memory-to-memory • Sophisticated arithmetic (multiply, divide, trigonometry etc.). • Special instructions are added to optimize performance with particular compilers.

  8. Architectures: CISC vs. RISC (2/2) • RISC - Reduced Instruction Set Computers • A very small set of primitive instructions • Fixed instruction format • Emphasis on software • All instructions execute in one cycle (Fast!). • Register to register (except Load/Store instructions) • Pipeline architecture

  9. Single-, Dual-, Multi-, Many- Cores • Single-core: • Most popular today. • Dual-core, multi-core, many-core: • Forms of multiprocessors in a single chip • Small-scale multiprocessors (2-4 cores): • Utilize task-level parallelism. • Task example: audio decode, video decode, display control, network packet handling. • Large-scale multiprocessors (>32 cores): • nVidia’s graphics chip: >128 core • Sun’s server chips: 64 threads

  10. AndesCoreTM

  11. AndesCore™ N1213-S • CPU Core • 32bit CPU • Single issue with 8-stage pipeline • Andestar™ ISA with 16-/32-bit intermixable instructions to reduce code size • Dynamic branch prediction to reduce branch penalties • 32/64/128/256 BTB • Configurability for customers • Configuration options for power, performance and area requirements

  12. AndesCore™ N1213-S • MMU • fully-associative iTLB/dTLB: 4 or 8 entries • 4-way set-associative main TLB: 32/64/128 entries • Two groups of pages size support: (4K,1M) and (8K,1M) • Locking support for TLB • I & D cache • Virtual index and physical tag (for faster context switching) • Cache size: 8KB/16KB/32KB/64KB • Cache line size: 16B/32B • 2/4-way set associative • I Cache locking support

  13. AndesCore™ N1213-S • I&D Localmemory • wide range support for internal /external local memory • 4KB~1024KB • Provide fixed access latencies for internal local memory • Double buffer mode for D local memory • Optional external local memory interface • Bus • Synchronous/Asynchronous AHB • 1 or 2 port configuration • Synchronous HSMP • AXI like • 1 or 2 port configuration

  14. AndesCore™ N1213-S • For performance • Improved memory accesses: • 1D/2D DMA, load/store multiple • Efficient synchronization without locking the whole bus • Load lock, store conditional instructions • Vectored interrupt to improve real-time performance • 6 interrupt signals • MMU • Optional HW page table walker • TLB management instructions • For flexibility • Memory-mapped IO space • PC-relative jumps for position independent code • JTAG-based debug support • Optional embedded program trace interface • Performance monitors for performance tuning • Bi-endian modes to support flexible data input

  15. Pipeline

  16. AndesCore 8-stage pipeline

  17. Instruction Fetch Stage • F1 – Instruction Fetch First • Instruction Tag/Data Arrays • ITLB Address Translation • Branch Target Buffer Prediction • F2 – Instruction Fetch Second • Instruction Cache Hit Detection • Cache Way Selection • Instruction Alignment IF1 IF2 ID RF AG DA1 DA2 WB EX MAC1 MAC2

  18. Instruction Issue Stage • I1 – Instruction Issue First / Instruction Decode • 32/16-Bit Instruction Decode • Return Address Stack prediction • I2 – Instruction Issue Second / Register File Access • Instruction Issue Logic • Register File Access IF1 IF2 ID RF AG DA1 DA2 WB EX MAC1 MAC2

  19. Execution Stage • E1 – Instruction Execute First / Address Generation / MAC First • Data Access Address Generation • Multiply Operation (if MAC presents) • E2 –Instruction Execute Second / Data Access First / MAC Second / ALU Execute • ALU • Branch/Jump/Return Resolution • Data Tag/Data arrays • DTLB address translation • Accumulation Operation (if MAC presents) • E3 –Instruction Execute Third / Data Access Second • Data Cache Hit Detection • Cache Way Selection • Data Alignment IF1 IF2 ID RF AG DA1 DA2 WB EX MAC1 MAC2

  20. Write Back Stage • E4 –Instruction Execute Fourth / Write Back • Interruption Resolution • Instruction Retire • Register File Write Back IF1 IF2 ID RF AG DA1 DA2 WB EX MAC1 MAC2

  21. Branch Prediction Overview • Why is branch prediction required? • A deep pipeline is required for high speed • Why dynamic branch prediction? • Static branch prediction • Dynamic branch prediction

  22. Branch Prediction Unit • Branch Target Buffer (BTB) • 128 entries of 2-bit saturating counters • 128 entries, 32-bit predicted PC and 26-bit address tag • Return Address Stack (RAS) • Four entries • BTB and RAS updated by committing branches/jumps

  23. BTB Instruction Prediction • BTB predictions are performed based on the previous PC instead of the actual instruction decoding information, BTB may make the following two mistakes • Wrongly predicts the non-branch/jump instructions as branch/jump instructions • Wrongly predicts the instruction boundary (32-bit -> 16-bit) • If these cases are detected, IFU will trigger a BTB instruction misprediction in the I1 stage and re-start the program sequence from the recovered PC. There will be a 2-cycle penalty introduced here

  24. RAS Prediction • When return instructions present in the instruction sequence, RAS predictions are performed and the fetch sequence is changed to the predicted PC. • Since the RAS prediction is performed in the I1 stage. There will be a 2-cycle penalty in the case of return instructions since the sequential fetches in between will not be used.

  25. Branch Miss-Prediction • In N12 processor core, the resolution of the branch/return instructions is performed by the ALU in the E2 stage and will be used by the IFU in the next (F1) stage. In this case, the misprediction penalty will be 5 cycles.

  26. Cache

  27. N1213-S Block diagram

  28. Cache and CPU address data cache main memory CPU cache controller address data data

  29. Multiple levels of cache L2 cache CPU L1 cache

  30. Cache data flow I-Cache I Cache refill I Fetches Uncached Instruction/data CPU Ext Memory Uncached write/write-through Write back Load & Store D-Cache D-Cache refill

  31. Cache operation • Many main memory locations are mapped onto one cache entry. • May have caches for: • instructions; • data; • data + instructions (unified).

  32. Replacement policy • Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. • Two popular strategies: • Random. • Least-recently used (LRU).

  33. Write operations • Write-through: immediately copy write to main memory. • Write-back: write to main memory only when location is removed from cache.

  34. Improving Cache Performance • Goal: reduce the Average Memory Access Time (AMAT) • AMAT = Hit Time + Miss Rate * Miss Penalty • Approaches • Reduce Hit Time • Reduce or Miss Penalty • Reduce Miss Rate • Notes • There may be conflicting goals • Keep track of clock cycle time, area, and power consumption

  35. Tuning Cache Parameters • Size: • Must be large enough to fit working set (temporal locality) • If too big, then hit time degrades • Associativity • Need large to avoid conflicts, but 4-8 way is as good a FA • If too big, then hit time degrades • Block • Need large to exploit spatial locality & reduce tag overhead • If too large, few blocks ⇒ higher misses & miss penalty Configurable architecture allows designers to makethe best performance/cost trade-offs

  36. Memory Management Units (MMU)

  37. N1213-S Block diagram

  38. MMU Functionality • Memory management unit (MMU) translates addresses logical address memory management unit physical address CPU

  39. M-TLB Tag M-TLB data M-TLB Tag M-TLB data MMU Architecture M-TLB entry index IFU LSU N(=32) sets k(=4) ways =128-entry 4/8 I-uTLB 4/8 D-uTLB 6 4 Set number 0 5 Way number Log2(N*K)-1 Log2(N) Log2(N)-1 0 M-TLB arbiter 32x4 M-TLB HPTWK Bus interface unit

  40. MMU Functionality • Virtual memory addressing • Better memory allocation, less fragmentation • Allows shared memory • Dynamic loading • Memory protection (read/write/execute) • Different permission flags for kernel/user mode • OS typically runs in kernel mode • Applications run in user mode • Cache control (cached/uncached) • Accesses to peripherals and other processors needs to be uncached.

  41. Direct Memory Access (DMA)

  42. N1213-S Block diagram

  43. DMA overview • Two channels • One active channel • Programmed using physical addressing • For both instruction and data local memory • External address can be incremented with stride • Optional 2-D Element Transfer (2DET) feature which provides an easy way to transfer two-dimensional blocks from external memory. Local Memory DMA Controller Ext. Memory

  44. Local MemoryBank 0 Local MemoryBank 1 LMDMA Double Buffer Mode CorePipeline ExternalMemory DMA Engine Computation Data Movement Bank Switch between core and DMA engine Width byte stride (in DMA Setup register)=1

  45. Bus Interface Unit (BIU)

  46. N1213-S Block diagram

  47. N1213-S BUS • AMBA 2.0 AHB bus • 1 port • 2 port • ICU/MMU (read only) for port 1 • LSU/DMA/EDM (read/write) for port 2 • HSMP • High speed memory port • Same frequency with CPU core • AMBA 3.0 (AXI) protocol compliant, but with reduced I/O requirements • 1 and 2 port configuration

  48. BIU introduction • Bus Interface unit is responsible for off-CPU memory access which includes • System memory access • Instruction/data local memory access • Memory-mapped register access in devices.

  49. Bus Interface • Compliance to AHB/AHB-Lite/APB • High Speed Memory Port • Andes Memory Interface • External LM Interface

  50. HSMP – High speed memory port • N12 also provides a high speed memory port interface which has higher bus protocol efficiency and can run at a higher frequency to connect to a memory controller. • The high speed memory port will be AMBA3.0 (AXI) protocol compliant, but with reduced I/O requirements.

More Related