280 likes | 303 Views
Explore a comprehensive approach for early design phase evaluation & debugging of System on Chip (SoC), addressing issues like bandwidth allocation, traffic management, and interconnect optimization. Learn about proposed solutions, alternative methods, limitations, and the importance of accurate performance simulation for efficient SoC development.
E N D
Unified Approach for Performance Evaluation and Debug of System on Chip at Early Design Phase Nishit Gupta Scientist, Ministry of Electronics & Information Technology(MeitY), Government of India
Agenda • Introduction • Problem statement • Alternate solutions and limitations • Proposed solutions • Performance and Debug Approaches • Conclusions
Introduction – Problem Statement (Then & Now) BW is a major Issue now 4 GB/s required for a typical Digital TV SOC IP 4 BW was not an issue IP1 IP 5 MEMORY IP 6 IP 7 IP2 Available b/w < 4GB/s due to parallel multiple accesses There is a limit on the available BW IP 8 IP 9 IP 10 IC’s now IP3 IC’s during 90-2000’s
Introduction – Problem Statement (Typical SoC)
Introduction – Problem Statement (Typical Interconnect)
Introduction – Problem Statement (Areas of Concern) IP 2 Design a good interconnect and tune it to give a “FAIR” BW share to every IP IP 1 IP 3 MEMORY IP 4 IP 5 MIXER IP 7 • The IP Traffic Classes: • Real time (video IP’s) • Latency sensitive (processors) • High Bandwidth (Video Decoder) IP 6 IP 9 Tune the DDR accesses for maximizing the efficiency IP 8 IP 10
Introduction – Problem Statement (Alternate Solutions & Limitations) • Spreadsheet analysis • Low accuracy when traffic from different IP’s get mixed • Tuning the Mixer for best DDR efficiency is not feasible • High level C/C++ model for the whole system (including IP, interconnect, DDR subsystem) • High level C/C++ models for complex SOC not available. Need a lot of effort to create/maintain for each SOC • Low accuracy and not effective correlation with post-silicon results • Performance Simulation with SOC RTL • Stable SOC RTL available very late. Very Slow though accurate • Software drivers for all IP’s need to be available • Emulation platforms • Faster than RTL simulations but results are very late in SOC cycle and availability of s/w drivers on time is difficult • RTL changes are very difficult without major impact on schedule
Requirements • Model of System of Chip (SOC) at an Abstraction Level which: • Available at an Early Design Phase • Simulate fast enough • Accurate • Able to exercise various scenarios
Proposed Solution • Embedded s/w dev. • Golden Reference Models for Functional verification • Rough Power estimation • Performance analysis • Embedded s/w • Power analysis • Timing analysis • Architecture analysis • Performance analysis • Rough Power estimation TLM VSOC available in Months Low accuracy & high simulation speed Proposed Solution SoC available in weeks Moderate- accuracy & simulation speed RTL SOC available in years High accuracy & low simulation speed Solution is to use reconfigurable components at abstraction level - Transaction Level Model (TLM) + Bus Cycle Accurate (BCA)
Proposed Solution (Overview) Hard tuning (triggers RTL changes) • Interconnect topology • Interconnect data path widths • Interconnect frequencies • DRAM interface frequency • IP FIFO sizes create applications with a set of Bus Mater’s for each “Use case” • “Use case” is one of the several modes in which the chip can operate Create Bus Mater (IP Traffic Generator) models • Model the periodic peak traffic accurately • For “real time” IP’s model the internal FIFO’s accurately Soft tuning • Interconnect parameters • Mixer + DRAM controller parameters • IP parameters Run simulations • Check bandwidth and latency values • Check FIFO levels for Real time IP’s Create a platform with BCA+TLM model of the interconnect and actual RTL for the DDR memory subsystem Run SysPerf on VCD Interconnect (systemC model) Bus Master SystemC Wrapper Mixer + Memory Ctrl RTL Bus Master Memory Model . . Bus Master Tune the Programmable parameters
What needs to be tuned for good traffic management ? Node with arbiter Fifo size, Store & forward Fifo size, Store & forward Fifo size, Store & forward IP parameters Arbitration scheme, Bandwidth limiters Mixer programming for good flow regulation and best DRAM efficiency Freq/size converter IP Mixer With arbiter DRAM controller DRAM
Not able to meet bandwidth, latency requirements ? Node with arbiters IP FIFO size increase unavoidable?? Look for bottlenecks in the interconnect Run performance simulations and find out.. performance simulation results have this info… Run Performance simulations to find out Freq/size converter IP Mixer With arbiter DRAM controller DRAM Interconnect topology change may help Frequency increase unavoidable?? Did it work? How much increase is optimal? How much increase is optimal?
Performance Evaluation Approach CFG SysPerf HW Result Files .vcd Simulation time Analysis time • Flow • Transaction extractions from simulation and Simulation Database Recording • Different abstraction levels displayed & visualization of transactions along with signals • Features • Debug: Transaction recording and Protocol Checking • Performance: Analyzers Module (Latencies, BD …)
Performance Evaluation Approach • (Transaction Recording Module)
Performance Evaluation Approach • (Performance Recording Module) Bandwidth 865 MB/s Latency (#cycles) - wvalid2wready - bvalid2bready Response packet to Initiator1 Request packet from Initiator1
Performance Evaluation Approach • (Performance Recording Module)
Modeling Approach- Extraction of Memory References • Memory References are extracted by running TxE scripts over Value Change Dump (.vcd) files resulted from SoC simulation • Further tuned for timing parameters & synchronization and fed to BUS Masters embedded with Cache Simulation Models. $MASTER_NAME FDMA $MASTER_PROCESS_SEQ {1,2,3}*10 # Process 1: IP doing no operation $PROCESS_NAME Nop_Operation $PROCESS_SEQ NOP 500 END # Process 2: FDMA produces traffic noise of 50MB/s on memory area $PROCESS_NAME Mem_Write_Access $PROCESS_BANDWIDTH 50 $PROCESS_DATA_LENGTH 1024 $PROCESS_OPCODE {WRITE32;WRITE16;WRITE8} $PROCESS_ADDRESS {(0x0~0x2ff)=20;(0x3ff~0x4ff)=80} # Process 3: IP does read accesses on control register area $PROCESS_NAME Read_To_CtrReg $PROCESS_SEQ START READ16 $addr++ 0xffff REPEAT 1000
Cache Memory Policies/ Configurations Replacement Policy • Least Recently Used • First-In-First-Out • Random Fetch Policy • Demand Fetch • Pre-fetch
Functional Verification Approach TLM IPs Adapter Adapter TLM DUT In-system Verif. RTL DUT Testbench TBMaster Memory C/C++ Abstract interconnect IP Verification SystemC Timing Randomizer (NTRP)
Nonintrusive Timing Randomization Probes (NTRP) • Based on SystemC Verification Library • Configured through input cfg file • Introduces Timing Randomization at Communication Interface – Random, Constrained, Fixed, Timed/ Cycle delay • Selectively adds delay in the transaction – ID, Address, Opcode etc. • Re-orders transactions/ Introduces Out-of-Order
NTRP- configuration file <timing:set> <timing:id_number>default</timing:id_number> <timing:signal_scope>AWREADY</timing:signal_scope> <timing:conditions> </timing:conditions> <timing:cycle_delay_model>RANDOM</timing:cycle_delay_model> <timing:cycle_delay timing:min="0" timing:max="0" timing:percentage="50"/> <timing:cycle_delay timing:min="20" timing:max="40" timing:percentage="50"/> </timing:set>
NTRP- AXI3 Timing Randomizer (few scenarios) BVALID Delay: Disordering disabled, different BID value BVALID Delay: Disordering enabled, different BID value
Conclusion • (Comparison of available solutions)
THANK YOU Nishit Gupta Scientist, R&D in Electronics Group, Ministry of Electronics & Information Technology(MeitY), Government of India