1 / 26

ARM Architecture

ARM Architecture. Charles Bock. Arm Flavors. Cortex- A- Application ( F ully Featured) Android/IPhone Windows RT Tablets Cortex- R – Real time (RTOS) Cars Routers Infrastructure Cortex- M – Embedded (Minimal) Automation Appliances ULP Devices I will focus on Cortex-A15 Most Featured

Download Presentation

ARM Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ARM Architecture Charles Bock

  2. Arm Flavors • Cortex- A- Application (Fully Featured) • Android/IPhone • Windows RT Tablets • Cortex- R – Real time (RTOS) • Cars • Routers • Infrastructure • Cortex- M – Embedded (Minimal) • Automation • Appliances • ULP Devices • I will focus on Cortex-A15 • Most Featured • Most Complex • Most Interesting

  3. Cortex A15 Overview • ARM processor architecture supports 32-bit ARM and 16-bit Thumb ISAs • Superscalar, variable-length, out-of-order pipeline. • Dynamic branch prediction with Branch Target Buffer (BTB) and Global History Buffer (GHB) • Two separate 32-entry fully-associative Level 1 (L1) Translation Look-aside Buffers • 4-way set-associative 512-entry Level 2 (L2) TLB in each processor • Fixed 32KB L1 instruction and data caches. • Shared L2 cache of 4MB • 40 Bit physical addressing (1TB)

  4. Instruction Set • RISC (ARM – Advanced Risc Machine) • Fixed instruction width of 32 bits for easy decoding and pipelining, at the cost of decreased code density. • Additional Modes or States allow Additional Instruction sets • Thumb (16 bit) • Thumb 2 (16 and 32 bit) • Jazzelle (Byte Code) • Trade-Off: 32 Bit arm vs 16 bit Thumb

  5. Thumb • Thumb is a 16-bit instruction set • Improved performance, more assumed operands. • Subset of the functionality of the ARM instruction set

  6. Instruction Encoding Always ADD Op 1 Destination Wasted! Op 2 ADD Operands Destination

  7. Jazzelle • JazelleDBX technology for direct java bytecode execution • Direct interpretation bytecode to machine code

  8. General Layout • Fetch • Decode • Dispatch • Execute • Load/Store • WriteBack

  9. Block Diagram 1 FP / SIMD Depth 18-24 Integer Depth 15 (Same as recent Intel Cores) Instructions broken down into Sub Operations here This is Genius Register Renaming SIMD

  10. Block Diagram 2

  11. Fetch • Up to 128 bits per fetch depending on alignment • ARM Set: 4 Instructions (32 bit) • Thumb Set: 8 Instructions (16 bit) • Only 3 can be dispatched per cycle. • Support for unaligned fetch address. • Branch prediction begins in parallel with fetch.

  12. Branch prediction - Global History Buffer • Global History Buffer • 3 arrays: Taken array, Not taken array, and Selector

  13. Branch prediction -microBTB • microBTB • Reduces bubble on taken branches • 64 entry fully associative for fast turn around prediction • Caches taken branches only • Overruled by main predictor if they disagree

  14. Branch Prediction - Indirect • Indirect Predictor • 256 entry BTB indexed by XOR of target and address • Xor Allows for indexing of Multiple Target addresses per branch

  15. Branch Prediction – Return Stack • Return Address Stack • 8-32 entries deep • indirect jumps (85%) are returns from functions • Push on call • Pop on Ret

  16. Branch Prediction - Misc • Deeper Pipeline = Larger mispredict penalty • Static Predictor: Always Predicts True if Not Known

  17. Decode / Out of order Issue • Instructions are Decoded into discrete sub operations • Multiple Issue Queues (8) • Instructions dispatched 3 per cycle to the appropriate issue queue • The instruction dispatch unit controls when the decoded instructions can be dispatched to the execution pipelines and when the returned results can be retired

  18. Register Renaming • RRT (Register rename Table) • Maps from Used register to available register • Rename Loop • Queue which stores available registers for use • Registers removed when in use • Registers re-added when retired from use • 13 General Purpose Registers R0-R12 • R13 = Stack Pointer • R14 = Return Address (Function Calls) • R15 = Program Counter

  19. Loop Buffer / Loop Cache • 32 Entries Long • Can contain up to two “forward” and one “backward” branch • Completely shuts down fetch and large parts of decode stages. • Why? Saves power, Saves time. • Smart!

  20. Execution Lanes • Integer Lane • Single cycle integer operations • 2 ALUs, 2 shifters • FPU / SIMD (NEON) Lane • Asymetric, Varying Length 2-10 Cycles • Branch Lane • Any operation that targets the PC for writeback, usually 1 cycle • Mult / Div Lane • All Mult/Div operations, 4 cycles. • Load / Store Lane • Cache / Mem access 4 cycles. • Cache maintenance • 1 load and 1 store per cycle • Load cannot bypass store, store cannot bypass store

  21. Load Store Pipeline • Issue queue 16 deep • Out of order but cannot bypass stores (safe) • Stores in order but only require address to issue • Pipeline • AGU Address generation Unit / TLB Lookup • Address and Tag Setup • Data / Tag Access • Data selection and forwarding

  22. L1 Instruction / Data Caches • 32KB 2-way set-associative cache. • 64 Byte Block so 256 Blocks * 2 way Assoc. = 32KB • Physically-Indexed and Physically-Tagged (PIPT). • Strictly enforced write-through (Important for cache consistancy!)

  23. L2 Shared Cache • 16 Way Set Assoc, 4MB • 4 tag banks to handle parallel requests • All Snooping is done at this level to keep caches consistent. • If a core is powered down its L1 cache can be restored from L2. • Any “Read Clean” Requests on the bus can be serviced by L2. • Supports Automatic Prefetching for Streaming Data Loads

  24. Dual Layer TLB Structure • Layer One: • Two separate 32-entry fully associative L1 TLBs for data load and store pipelines. • Layer Two: • 4-way set-associative 512-entry L2 TLB in each processor • In General: • The TLB entries contain a global indicator or an Address Space Identifier (ASID) to permit context switches without TLB flushes. • The TLB entries contain a Virtual Machine Identifier (VMID) to permit virtual machine switches without TLB flushes. • Miss: • Trade off: add more hardware for faster page fault handling or let the os handle it in software? • CPU Includes full table walk machine incase of TLB Miss, no OS involvement required.

  25. BIG Little • Combine A15 with A7. • Interconnect Below The L2 Shared Cache

  26. References [1] Arm Information Center, infocenter.arm.com, 2012,http://infocenter.arm.com/help/topic/com.arm.doc.ddi0438g/DDI0438G_cortex_a15_r3p2_trm.pdf [2] BDTi, bdti.com, 2012,http://www.bdti.com/InsideDSP/2011/11/17/ARM [3] Arm, arm.com, 2012,http://www.arm.com/files/pdf/AT-Exploring_the_Design_of_the_Cortex-A15.pdf [4] Meet ARM’s Cortex A15, wired.com, 2012,http://www.wired.com/insights/2011/10/meet-arms-cortex-a15-the-future-of-the-ipad-and-possibly-the-macbook-air/ [5] ARM Cortex-A15 explained, extremetech.com, 2012,http://www.extremetech.com/computing/139393-arm-cortex-a15-explained-intels-atom-is-down-but-not-out [6] eecs373, web.eecs.umich.edu, 2012,http://web.eecs.umich.edu/~prabal/teaching/eecs373/readings/ARM_Architecture_Overview.pdf [7] ARM Cortex A Programming Guide, cs.utsa.edu, 2012,http://www.cs.utsa.edu/~whaley/teach/FHPO_F11/ARM/CortAProgGuide.pdf [8] Branch Prediction Review, cs.washington.edu, 2012,http://www.cs.washington.edu/education/courses/cse471/12sp/lectures/branchPredStudent.pdf [9] Cortex A 15, 7-cpu.com, 2012, http://www.7-cpu.com/cpu/Cortex-A15.html

More Related