180 likes | 189 Views
This article discusses the communication interface of the Tile processor, including its on-chip interconnection architecture, hardware and software interfaces, and performance comparison with different communication mechanisms. The common and difference between DaCS and MCAPI are also discussed.
E N D
2008-12-23 progress Wen-Long Yang
Outline • An example of many-core communication – Tile processor • The interface of micro-kernel • The common & difference between DaCS & MCAPI
Overview • On-chip Interconnection Architecture of the Tile Processor • IEEE Micro, 2007 • Tile processor overview (TILE64 launched in 2007) • Developed by Tilera and inspired by MIT’s Raw processor • Consisting of 2D grid of homogeneous general-purpose compute element, called tiles • The tiles are connected by 5 mesh networks to provide massive on-chip communication bandwidth • Support DMA between the cores and between main memory & the cores • Each link consists of 2 32-bit-wide unidirectional links • Each tile combines a processor, which implements a 3-way VLIW, and its associated cache hierarchy with a switch • Each tile operates at 1GHz • 4.8MB on-chip cache distributed among the processor
Interconnect hardware • 5 networks: • Static network (STN) • Doesn’t have packetized format but rather allows static configuration of the routing decisions at each switch • Let applications send streams of data to another tile • User dynamic network(UDN) • I/O dynamic network (IDN) • Memory dynamic network (MDN) • Tile dynamic network (TDN) • Only responsible for tile-to-tile cache requests
Receive-side hardware demultiplexing • UDN & IDN implement this functionality • Implemented by having several independent hardware queues with settable tags, which used to identify packets
Software interfaces • They provide a C-based iLib library to support communication via UDN • Raw channels: low-overhead but only with limited buffer size • Buffered channels: slightly higher overhead but with unlimited buffer • Socket-like channels • FIFO & point-to-point connection • Message passing • MPI-like • Except unlimited buffer, it also provides a message key to identify the message • Receivers can process message out-of-orderly • The messages would be saved until the receivers are ready • The synchronization of communication is managed by a messaging engine • Target to allow more flexible communication mechanism
Implementation • Raw channel • Reserve a demux queue for it directly • Buffer channel • When demux buffer is full, the demux trigger a interrupt handler to fill the data into memory • So read operations of receiver also need to check the buffer in main memory • Message passing • Also depend on interrupt to inform sender/receiver/messaging engine to do their work
Performance comparison • UDN hardware provides 4byte/cycle at maximum • For raw channels, max bandwidth is 3.93 bytes/cycle • For buffered channel, max bandwidth is 1.4 bytes/cycle • For messaging, max bandwidth is 0.97 bytes/cycle • Overheads • Buffered channel: interrupts & copies between on-chip cache and memory • Messaging: more frequent interrupt, identification of message keys, and copies between on-chip cache and memory
Characteristics • Support MPI-like communication • Send a block of data to receivers • Support multiple senders and receivers in a function call • Allow batch transfers
Conclusions • MCAPI is more flexible than DaCS because its communication is identified by endpoint, not only by process ID or physical ID. • In many-core environment, MCAPI is more suitable.