240 likes | 396 Views
Giga-Scale System-On-A-Chip International Center on System-on-a-Chip (ICSOC). Jason Cong University of California, Los Angeles Tel: 310-206-2775, Email: cong@cs.ucla.edu (Other participants are listed inside). Project Summary.
E N D
Giga-Scale System-On-A-ChipInternational Center on System-on-a-Chip (ICSOC) Jason Cong University of California, Los Angeles Tel: 310-206-2775, Email: cong@cs.ucla.edu (Other participants are listed inside)
Project Summary • Develop new design methodology to enable efficient giga-scale integration for system-on-a-chip (SOC) designs • Project includes three major components • SOC synthesis tools and methodologies • SOC verification, test, and diagnosis • SOC design driver – network processor
Research Team by Institutions • US • UCLA: Jason Cong • UC Santa Barbara: Tim Cheng • Taiwan • NTHU: Shi-Yu Huang, Tingting Hwang, J. K. Lee, Youn-Long Lin, C. L. Liu, Cheng-Wen Wu, Allen Wu • NCTU: Jing-Yang Jou • China • Tsinghua Univ.: Jinian Bian, Xianlong Hong, Zeyi Wang, Hongxi Xue • Peking Univ.: Xu Cheng • Zhejiang Univ.: Xiaolang Yan
Current Research Team • US • UCLA: Jason Cong • UC Santa Barbara: Tim Cheng • Taiwan • NTHU: Shi-Yu Huang, Tingting Hwang, J. K. Lee, Youn-Long Lin, C. L. Liu, Cheng-Wen Wu, Allen Wu • NCTU: Jing-Yang Jou • China • Tsinghua Univ.: Jinian Bian, Xianlong Hong, Zeyi Wang, Hongxi Xue • Peking Univ.: Xu Cheng • Zhejiang Univ.: Xiaolang Yan • Several new faculty members in the 7 institutions • Guest members from National University of Singapore, Purdue Univ., and UCLA (EE Dept)
ASIC Synthesis Interconnect-Driven High-level Synthesis Synthesis for IP Reuse Physical Synthesis for Full-Chip Assembly Thrust 1 -- SOC Synthesis Environment/Methodology(Led by Jason Cong) VHDL/C Co-Simulation Design Spec VHDL/C Design Partitioning Code Generation for Retargetable Compiler and Assembler Generator DSP Synthesis and Optimization FPGA Synthesis and Technology Mapping Embedded Processors DSPs Embedded FPGAs Customized Logic
5 clock 4 clock 3 clock 2 clock 1 clock 28.3 11.4 22.8 0 Interconnect Bottleneck in Nanometer Designs • 2nd challenge: Single-cycle full chip synchronization is no longer possible • Not supported by the current CAD toolset • About to happen soon • ITRS’01 0.07um Tech • 5.63 G Hz across-chip clock • 800 mm2 (28.3mm x 28.3mm) • IPEM BIWS estimations • Buffer size: 100x • Driver/receiver size: 100x • On semi-global layer (tier 3) : • Can travel up to 11.4 mm in one cycle • Need 5 clock cycles from corner to corner
Island MUL MUX Register File FSM FSM FSM ADD …. Cluster with area constraint Hi FSM Local Computational Cluster (LCC) Global Interconnect Wi FSM FSM FSM Reg. file Reg. file Reg. file Reg. file Reg. file Reg. file … … … … … … LCC LCC LCC LCC LCC LCC 2 cycle 1 cycle k cycle Regular Distributed Register Architecture (2) • Use register banks: • Registers in each island are partitioned to k banks for 1 cycle, 2 cycle, … k cycle interconnect communication in each island • Highly regular
Cycle1 Cycle2 Cycle3 Cycle4 Cycle5 Cycle6 Cycle7 - + C / VHDL Mult1 * * Alu2 Reg. file Reg. file Reg. file Reg. file CDFG generation - - Mul2 3,7,12 Mul2 3,7,11 Alu1 1,5,10 Alu1 1,5,10 CDFG Resource allocation - 1 + 2 Alu1 Mult2 Resource constraints * * Functional unit binding * 3 * 4 … … … … … … … … * * RDR Arch. Spec. Target clock period Interconnected Component Graph (ICG) - - - 5 - 6 Reg. file Reg. file Reg. file Reg. file Scheduling-driven placement Location information Mul1 4,8,11 Mul1 4,8,12 Alu2 2,6,9 Alu2 2,6,9 * 7 * 8 Placement-driven rebinding & scheduling - + * 12 - 9 * 11 * * Register and port binding - 10 - - Datapath & FSM generation * * Interconnected Component Graph (ICG) RTL VHDL files Floorplan constraints Multi-cycle path constraints * - * - MCAS: Placement-Driven Architectural Synthesis Using RDR Architecture
Experimental Results (3) • MCAS basic flow vs. Synopsys’ Behavioral Compiler (on Virtex-II) • Synopsys Behavioral Compiler setting: default (optimizing latency) • Average latency ratio of MCAS vs. BC: 69% Latency Resource
Construct instances with known optimal using the characteristic of the original problem • First quantitative evaluation of the optimality of circuit placement problem • Existing placement algorithms can be 70% to 150% away from the optimal Optimality Study of Large-Scale Circuit Placement • Construction of Placement Example with Known Optimal (PEKO) [C. Chang et al, 2003] ?
High Interest in the Community • Two EE Times articles coverage • Placement tools criticized for hampering IC designs [Feb’03] • IC placement benchmarks needed, researchers say [April’03] • More than 60 downloads from our website • Cadence, IBM, Intel, Magma, Mentor Graphics, Synopsys, etc • CMU, SUNY, UCB, UCSB, UCSD, UIC, UMichgan, UWaterloo, etc • Used in every placement since its publication http://ballade.cs.ucla.edu/~pubbench
1. Synthesis & Verification • Hardware/Software Partition: • Propose a SSS based H/S partition algorithm (ASICON2003) • better solution than SA and less runtime than Tabu • High-level Synthesis: • Re-synthesis algorithm after floorplanning for timing optimization (ASICON2003) • Based on initial scheduling do floorplanning • After floorplanning do re-scheduling and re-allocation by force-balance method • Controller Synthesis: • A Heuristic State Minimization Algorithm For Incompletely Specified Finite State Machine (ASICON2003, JCST)
2. Floorplanning & Interconnect Planning • Based on proposed Corner Block List (CBL) representation propose several Extended Corner Block List, ECBL, CCBL and SUB-CBL to speed up floorplanning and handle more complicate L/T shaped and rectilinear shaped blocks. • Propose floorplanning algorithms with some geometric constraints, such as boundary, abutment, L/T shaped blocks. • Propose integrated floorplanning and buffer planning algorithms with consideration of congestion . • Using research results from UCLA on interconnect planning • About 30 papers published in DAC, ICCAD, ISPD, ASPDAC, ISCAS and Transactions.
3. P/G Network Analysis & Optimization • Propose anArea Minimization of Power Distribution Network Using Efficient Nonlinear Programming Techniques (ICCAD2001, accepted by IEEE Trans. On CAD) • Propose a decoupling capacitance optimization algorithm for Robust On-Chip Power Delivery (ASPDAC2004, ASICON2003) 4. Global Routing & Special Routing • Propose several congestion, timing, and both timing and congestion optimization global routing algorithms • Papers were published in ASPDAC, ISCAS, and IEEE Transactions.
5. Parasitic R/L/C Etraction • 3-D R/C Extraction using Boundary Element Method (BEM) • Quasi-Multiple Medium (QMM) BEM algorithms • Hierarchical Block BEM (HBBEM)technique • Fast 3-D Inductance Extraction (FIE) • Papers were published in ASPDAC, ASICON and IEEE Transaction on MTT
Thrust 2 -- SOC Verification, Test, and Diagnosis(Led by Tim Cheng) Verification and Testing Enabling techniques for semi-formal functional verification Testing and diagnosis for heterogeneous SOC Self-testing using on-chip programmable components Self-testing for on-chip analog/mixed-signal components Automatic/semi-automatic functional vector generation from HDL code Scalable constraint-solving techniques Integrated framework for simulation, vector generation and model checking New test techniques for deep-submicron embedded memories
Key Results - Verification • Developed and released ATPG-based SAT solvers for circuits(Univ. of California, Santa Barbara) • Integrating structural ATPG and SAT techniques with new conflict learning • CSAT: Fast combinational solver (released on March 2003) • Demonstrated 10-100X speedup over state-of-the-art SAT solvers on industrial test cases (reported by Intel and Calypto) • Has been integrated into Intel’s FV verification system and a startup’s verification engine • Publications: DATE2003 and DAC2003 • Satori2: Fast sequential solver (released on Dec. 2003) • Demonstrated 10X-200X speedup over a commercial, sequential ATPG engine on public benchmark circuits • Publications: ICCAD2003, HLDVT2003 and ASPDAC2004
ATPG/Pattern Selection Diagnosis Critical Path Selection Defect Injection & Simulation Path Filtering Dynamic Timing Simulator Static Timing Analysis Statistical Timing Analysis Framework (Cell-based characterization) Key Results - Testing A new Statistical Delay Testing and Diagnosis framework consisting of five major components (UCSB): • Statistical timing analysis • Statistical critical path selection [DAC’02,ICCAD’02] • Selecting statistical long & true paths whose tests maximize detection of parametric failures • Path coverage metric [ASPDAC’03] • Estimating the quality of a path set • Selection/Generation of high quality tests for target paths [ITC’01][DATE 2004] • Identifying tests that activate longer delay along the target path • Delay fault diagnosis based on statistical timing model [DATE’03, VTS’03, DAC’03] • Ref: Krstic, Wang, Cheng,& Abadir, DATE’03–Best Paper Award in Test
Key Results - Testing • On-Chip Jitter Extraction for Bit-Error-Rate (BER) Testing of Multi-GHz Signal (UCSB) • Using on-chip, single-shot measurement unit to sample signal periods for spectral analysis • Demonstrated, through simulation, accurate extraction of multiple sinusoids and random jitter components for a 3GHz signal • Publications: ASPDAC2004 and DATE2004
Thrust 3 – Design Driver: Network Security Processor (Led by Prof. C. W. Wu) • Applications: IPSec, SSL, VPN, etc. • Functionalities: • Public key: RSA, ECC • Secret key: AES • Hashing (Message authentication): HMAC (SHA-1/MD5) • Truly random number generator (FIPS 140-1,140-2 compliant) • Target technology: 0.18m or below • Clock rate: 200MHz or higher (internal) • 32-bit data and instruction word • 10Gbps (OC192) • Power: 1 to 10mW/MHz at 3V (LP to HP) • Die size: 50mm2 • On-chip bus: AMBA (Advanced Microcontroller Bus Architecture)
Encryption Modules (PKEM) • Public key encryption module • Operations: • 32-bit word-based modular multiplication • Multiplication over GF(p) and GF(2m) • An RSA cryptography engine with small area overhead and high speed • Scalable word-width • TSMC 0.35μm • 34K gates (1.7×1.8 mm2 ) • 100MHz clock • Scalable key length • Throughput • 512-bit key: 1.79Kbps/MHz • 1024-bit key: 470bps/MHz
Encryption Modules (SKEM) • Secret key encryption module • Operations: • Matrix operations, manipulation • AES cryptography • 32-bit external interface • 58K gates • Over 200MHz clock • Throughput: 2Gbps • Support key length of 128/192/256 bits
Journal Publications • C.-T. Huang and C.-W. Wu, ``High-speed easily testable Galois-field inverter'', IEEE Trans. Circuits and Systems II: Analog and Digital Signal Processing, vol. 47, no. 9, pp. 909-918, Sept. 2000. • S.-A. Hwang and C.-W. Wu, ``Unified VLSI systolic array design for LZ data compression'', IEEE Trans. VLSI Systems, vol. 9, no. 4, pp. 489-499, Aug. 2001. • C.-H. Wu, J.-H. Hong, and C.-W. Wu, ``VLSI design of RSA cryptosystem based on the Chinese Remainder Theorem'', J. Inform. Science and Engineering, vol. 17, no. 6, pp. 967-979, Nov. 2001. • J.-H. Hong and C.-W. Wu, ``Cellular array modular multiplier for the RSA public-key cryptosystem based on modified Booth's algorithm'', IEEE Trans. VLSI Systems, vol. 11, no. 3, pp. 474-484, June 2003. • C.-P. Su, T.-F. Lin, C.-T. Huang, and C.-W. Wu, ``A high-throughput low-cost AES processor'', IEEE Communications Magazine, vol. 41, no. 12, pp. 86-91, Dec. 2003.
Conference Publications • J.-H. Hong and C.-W. Wu, ``Radix-4 modular multiplication and exponentiation algorithms for the RSA public-key cryptosystem'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Yokohama, Jan. 2000, pp. 565-570. • J.-H. Hong, P.-Y. Tsai, and C.-W. Wu, ``Interleaving schemes for a systolic RSA public-key cryptosystem based on an improved Montgomery's algorithm'', in Proc. 11th VLSI Design/CAD Symp., Pingtung, Aug. 2000, pp. 163-166. • C.-H. Wu, J.-H. Hong, and C.-W. Wu, ``An RSA cryptosystem based on the Chinese Remainder Theorem'', in Proc. 11th VLSI Design/CAD Symp., Pingtung, Aug. 2000, pp. 167-170. • C.-H. Wu, J.-H. Hong, and C.-W. Wu, ``RSA cryptosystem design based on the Chinese Remainder Theorem'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Yokohama, Jan. 2001, pp. 391-395. • Y.-C. Lin, C.-P. Su, C.-W. Wang, and C.-W. Wu, ``A word-based RSA public-key crypto-procesoor core'', in Proc. 12th VLSI Design/CAD Symp., Hsinchu, Aug. 2001. • T.-F. Lin, C.-P. Su, C.-T. Huang, and C.-W. Wu, ``A high-throughput low-cost AES cipher chip'', in Proc. 3rd IEEE Asia-Pacific Conf. ASIC, Taipei, Aug. 2002, pp. 85-88. • Y.-T. Lin, C.-P. Su, C.-T. Huang, C.-W. Wu, S.-Y. Huang, and T.-Y. Chang, ``Low-power embedded memory architecture design for SOC'', in Proc. 13th VLSI Design/CAD Symp., Taitung, Aug. 2002, pp. 306-309. • M.-C. Sun, C.-P. Su, C.-T. Huang, and C.-W. Wu, ``Design of a scalable RSA and ECC crypto-processor'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Kitakyushu, Jan. 2003, pp. 495-498, (Best Paper Award). • C.-P. Su, T.-F. Lin, C.-T. Huang, and C.-W. Wu, ``A highly efficient AES cipher chip'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Kitakyushu, Jan. 2003, pp. 561-562, (Design Contest Special Feature Award). • J.-H. Hong, C.-L. Liu, B.-Y. Tsai, and C.-W. Wu, ``A radix-4 modular multiplier for fast RSA public-key cryptosystem'', in Proc. 14th VLSI Design/CAD Symp., Hualien, Aug. 2003, pp. 553-556. • M.-Y. Wang, C.-P. Su, C.-T. Huang, and C.-W. Wu, ``An HMAC processor with integrated SHA-1 and MD5 algorithms'', in Proc. Asia and South Pacific Design Automation Conf. (ASP-DAC), Yokohama, Jan. 2004 (to appear).