260 likes | 440 Views
Efficient Real-Time Multicore Image Processing on TI C66x midterm presentation. Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project. Project Goals Development Tools Learning Steps What’s next. Contents.
E N D
Efficient Real-Time Multicore Image Processing on TI C66xmidterm presentation YaronDoweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project
Project Goals • Development Tools • Learning Steps • What’s next Contents
Learn to use the new TI C66 platform and to exploit its abilities and advantages. • Implement a Real-Time computer vision algorithm using multi-core programming. Project Goal
Project Goals • Development Tools • Learning Steps • What’s next Contents
Hardware: TMS320C6678 Multicore Fixed and Floating-Point Digital Signal Processor • Software: Code Composer Studio v5 with BIOS MCSDK 2.0 Development tools
8 C66x CorePac DSP’s • Based on TI’s Keystone Multicore Architecture • 320 GMAC/160 GFLOP @ 1.25GHz • 32KB L1P, 32KB L1D, 512KB L2 Per Core • 4MB Shared L2 • 64-Bit DDR3 Interface (DDR3-1600) TMS320C6678
Project Goals • Development Tools • Learning Steps • What’s next Contents
CCS Simulator and Profiler • Cache configuration • DMA data transfer • Interrupts • Fixed and Floating point libraries (DSPlib, IMGlib, Vlib,…) • SYS/BIOS • Multi-core programming Learning steps
The CCS V5 can simulate the C6678 processor and some peripherals. • The profiler analyzes execution time and statistics for functions and code lines. Step 1: CCS Simulator and Profiler
Graph viewer – enables to view data from memory in time or frequency domain. • Image Analyzer – enables to view an image stored in memory or file. Supports grayscale, RGB and YUV color formats. Step 1: CCS Simulator and Profiler
32 KB L1P cache. L1P is read-allocate and direct mapped. • 32 KB L1D cache. L1D is read-allocate, write-back and 2-way set associative. • Each can be configured as 0, 4, 8, 16 or 32 KB cache. • 512KB L2 cache. L2 is read and write allocate and 4-way set associative. • L2 can be configured as 0, 32, 64, 128, 256 or 512 KB cache. • All configurations can be done during run time. Step 2: Cache
Achievements: • Configuring different L1 and L2 cache sizes during or before run time. • Using L1 and L2 as SRAM memory (fully SRAM or part SRAM and part cache). • Controlling variable locations (L1,L2 or DDR3 memories). Step 2: Cache
C66xx Processors has 3 EDMA3 controllers, each with 64 DMA channels + 8 QDMA channels. • EDMA3 supports data transfer to\from cache, shared memory or external memory. • EDMA3 supports the use of hardware interrupts. • In addition, each core has a faster IDMA controller for internal transfers. Step 3: DMA
Achievements: • Using IDMA to transfer data inside a core (L2↔L1). • Using EDMA3 to transfer data to\from L1, L2 and DDR3. Step 3: DMA
The interrupt controller supports up to 128 system events. They consist of both internally-generated events (within the C66x CorePac) and chip-level events. Step 4: Interrupts
The interrupt controller outputs 15 signals to the core from the event inputs: • One maskablehardware exception • 12 maskablehardware interrupts • One non-maskable signal • One reset signal Step 4: Interrupts
Achievements: • Configuring manually triggered events. • Configuring EDMA transfer completion routine using EDMA system event. Step 4: Interrupts
DSPLib – an optimized DSP function library that includes general-purpose signal-processing routines for real-time applications. Step 5: Libraries LPF
IMGLib – an optimized image/video processing function library that includes general-purpose image/video processing routines for real-time applications. Histogram Derivative Step 5: Libraries Edge Detection
Some more libraries • VLib – a collection of computer vision algorithms that are optimized for TI DSPs. • IQMath – a collection of highly optimized fixed point arithmetic, trigonometric and mathematical functions. typically used in real-time applications. • fastMath – optimized arithmetic and trigonometric functions for floating point devices. Step 5: Libraries
Achievements: • Using DSPLib for a simple signal-processing application with floating point arrays. • Using IMGLib for a simple image-processing application. Still left: • Studying VLib, IQMath and fast Math Libraries. • Compare actual running time to the running time specified in the User Guide. Step 5: Libraries
SYS/BIOS is a real time operating system designed to be used by applications that require real-time scheduling and synchronization. • SYS/BIOS provides preemptive multi-threading, hardware abstraction, real-time analysis, and configuration tools. • SYS/BIOS is designed to minimize memory and CPU requirements on the target. Step 6: SYS/BIOS
Achievements: • Using SYS/BIOS modules to configure DSP’s memory (cache sizes, memory sections, heap and stack size). • Running a multi-threaded program with shared variables protection. Still left: • Using SYS/BIOS modules to configure DSP peripherals (LAN, SRIO, PCIe). Step 6: SYS/BIOS
CCS Simulator and Profiler - done • Cache configuration - done • DMA data transfer - done • Interrupts - done • Fixed and Floating point libraries (DSPlib, IMGlib, Vlib,…) – In Progress • SYS/BIOS – In Progress • Multi-core programming Learning steps
Project Goals • Development Tools • Learning Steps • What’s next Contents
Implementation of a bidirectional data flow between DDRIII and L1, possibly through L2. (3 weeks) • Performance analysis (throughput, latency and accuracy) when using floating point versus fixed point libraries. (2 weeks) • Usage of hardware semaphores for parallel data access and Multicore Navigator for enabling messages communication between different cores. (4 weeks) What’s next