240 likes | 358 Views
HAsim On-Chip Network Model Configuration. Michael Adler. IMEM. FET. The Front End Multiplexed. Legend: Ready to simulate?. 1. redirect. No. CPU 1. CPU 2. (from Back End). training. 1. Line Pred. (from Back End). Branch Pred. 1. 2. fault. vaddr. pred. 1. mispred. 0. 1.
E N D
HAsim On-Chip Network ModelConfiguration • Michael Adler
IMEM FET The Front End Multiplexed Legend: Ready to simulate? 1 redirect No CPU 1 CPU 2 (from Back End) training 1 Line Pred (from Back End) Branch Pred 1 2 fault vaddr pred 1 mispred 0 1 inst or fault 0 first FET ITLB IMEM PC Resolve Inst Q 0 1 1 1 0 vaddr paddr enq or drop 0 deq paddr 0 rspImm 0 I$ 1 rspDel 1 slot
Problem: On-Chip Network CPU L1/L2 $ msg credit [0 1 2] [0 1 2] CPU 1 L1/L2 $ CPU 0 L1/L2 $ CPU 2 L1/L2 $ r r r r Memory Control • Problem: routing wires to/from each router • Similar to the “global controller” scheme • Also utilization is low msg msg r credit credit router
Multiplexing On-Chip Network Routers Router 0 Router 1 Router 0..3 reorder σ(x) = (x + 1) mod4 Router 3 Router 2 reorder σ(x) = (x + 2) mod4 σ(x) = (x + 3) mod4 reorder 1 2 3 1 2 3 Simulate the network without a network 2 3 0 2 3 0 3 0 1 3 0 1 0 1 2 0 1 2
HAsim’s Network Model is Abstract • In a software model the target network can be built at run-time • Dynamism is expensive in FPGAs and recompilation is slow • Solution: Constrained dynamism • Fixed parameters: Max nodes, max edges per node, max VCs • Dynamic: • Number of active contexts (nodes) • Endpoints of each edge (indirection table) • Routing table • Address mapping of distributed LLC
Topology Manager • Software – runs once at startup so no need to optimize • HASIM_CHIP_TOPOLOGY_CLASS: • Manages streaming of parameters to the FPGA • Iterates over all software topology mapping classes until convergence • Namespace defined by dictionaries • .dic files are preprocessed by LEAP tools • Hierarchy of enumerated types
How do I… • Map address ranges to LLC segments? • Map target cores to nodes? • Pick a number of memory controllers and map them to nodes? • Define a target machine network topology? • Manage interleaving for multiplexing the network and cores?
Map Address Ranges to LLC Segments (SW) • Build a table of n_llc_map_entries, where each entry is an index to a portion of the distributed LLC. • icn-mesh.cpp: for (intaddr_idx = 0; addr_idx < n_llc_map_entries; addr_idx++){boolis_last = (addr_idx + 1 == n_llc_map_entries);topology->SendParam(TOPOLOGY_NET_LLC_ADDR_MAP,&cores_net_pos[addr_idx % num_cores],sizeof(TOPOLOGY_VALUE),is_last);}
Map Address Ranges to LLC Segments (FPGA) Consume the table that was streamed in from SW • last-level-cache-no-coherence.bsv: // Define a node that will stream in the topology. This builds a node// on a ring. The node looks for messages tagged TOPOLOGY_NET_MEM_CTRL_MAP// and emits associated payloads.let ctrlAddrMapInit <- mkTopologyParamStream(`TOPOLOGY_NET_MEM_CTRL_MAP);// Allocate a local memory and initialize it with the streamed-in entries.LUTRAM#(Bit#(TLog#(TMul#(8, MAX_NUM_MEM_CTRLS))),STATION_ID) memCtrlDstForAddr<-mkLUTRAMWithGet(ctrlAddrMapInit);// Map an address to a node ID using the tablefunction STATION_ID getMemCtrlDstForAddr(LINE_ADDRESS addr); // Use the low bits of the address as the index (resize does this).return memCtrlDstForAddr.sub(resize(addr));endfunction
Map Address Ranges to LLC Segments (LLC Hub) • rule . . .// Incoming request from coreif (m_reqFromCore matches tagged Valid .req) begin// Which instance of the distributed cache is responsible?let dst = getLLCDstForAddr(req.physicalAddress);if (dst == local_station_id) begin// Local cache handles the address.if (can_enq_reqToLocalLLC &&& ! isValid(m_new_reqToLocalLLC)) begin// Port to LLC is available. Send the local request.did_deq_reqFromCore = True;m_new_reqToLocalLLC = tagged Valid LLC_MEMORY_REQ { src: tagged Invalid,mreq: req};debugLog.record(cpu_iid, $format("1: Core REQ to local LLC, ") + fshow(req)); endendelse if (can_enq_reqToRemoteLLC && ! isValid(m_new_reqToRemoteLLC)) begin// Remote cache instance handles the address and the OCN requestport is available. //// These requests share the OCN request port since only one type of request goes to // a given remote station. Memory stations get memory requests above. LLC stations get// core requests here.did_deq_reqFromCore = True;m_new_reqToRemoteLLC = tagged Valid tuple2(dst, req);debugLog.record(cpu_iid, $format("1: Core REQ to LLC %0d, ", dst) + fshow(req)); endend . . .endrule
Map Cores and Memory Controllers to Nodes • All computed (currently) in icn-mesh.cpp • Given number of target cores and number of memory controllers: • Builds a rectangle of cores as close to square as possible • Adds a row of memory controllers at the top and bottom • Topology streamed to FPGA using same mechanism as address mapping • E.g., 15 cores and 3 memory controllers: • x M M xC CCCCCCCCCCCCCC xx M x x
Network Topology: Map Cores/Memory Controllers to Nodes • Multiplexed order of nodes is the same as order of cores • No permutations required for local port • Nodes are connected to: • Core • Memory controller • Nothing • The node doesn’t care what is connected! • Hide indirection in ports
Network Topology: Map Cores/Memory Controllers to Nodes • In icn-mesh.bsv: • //// Local ports are a dynamic combination of CPUs, memory controllers, and// NULL connections. //// localPortMap indicates, for each multiplexed port instance ID, the type// of local port attached (CPU, memory controller, NULL). //let localPortInit <- mkTopologyParamStream(`TOPOLOGY_NET_LOCAL_PORT_TYPE_MAP);LUTRAM#(Bit#(TLog#(TAdd#(TAdd#(MAX_NUM_CPUS, 1),NUM_STATIONS))),Bit#(2)) localPortMap <- mkLUTRAMWithGet(localPortInit);PORT_SEND_MULTIPLEXED#(MAX_NUM_CPUS, OCN_MSG) enqToCores<-mkPortSend_Multiplexed("Core_OCN_Connection_InQ_enq");PORT_SEND_MULTIPLEXED#(MAX_NUM_MEM_CTRLS, OCN_MSG) enqToMemCtrl<-mkPortSend_Multiplexed("ocn_to_memctrl_enq");PORT_SEND_MULTIPLEXED#(NUM_STATIONS, OCN_MSG) enqToNull<-mkPortSend_Multiplexed_NULL(); let enqToLocal<- mkPortSend_Multiplexed_Split3(enqToCores, enqToMemCtrl, enqToNull,localPortMap);
Network Topology: Defining Inter-Node Edges • Each network node: N Local W E S
Network Multiplexing • Logically, there are n nodes in the network. • Each has a local port connected either to a core, to memory or to nothing. • Network connection mapping and routing will determine the topology. • Topology manager defines the routing table. • Note: Dateline not yet implemented
Network Topology and Routing • Torus:
Network Topology and Routing • Mesh (connections identical, routing table ignores some edges):
Network Topology and Routing • Bi-directional ring:
Network Topology and Routing • Uni-directional ring:
Final Problem: Multiplexing On-Chip Network Routers Router 0 Router 1 Router 0..3 reorder σ(x) = (x + 1) mod4 Router 3 Router 2 σ(x) = (x + 2) mod4 reorder reorder σ(x) = (x + 3) mod4 1 2 3 1 2 3 2 3 0 2 3 0 3 0 1 3 0 1 0 1 2 0 1 2
Network Topology: Communication Across Multiplexed Nodes • Each node talks to a different multiplexed node instance • Naïve port binding would have each node talk only to itself • A-Ports are already buffered • Bury transformation in A-Ports • Retain simple read next / write next port semantics within models
Network Topology: Communication Across Multiplexed Nodes • icn-mesh.bsv: • // Initialization from topology managerReadOnly#(STATION_IID) meshWidth <- mkTopologyParamReg(`TOPOLOGY_NET_MESH_WIDTH);ReadOnly#(STATION_IID) meshHeight <- mkTopologyParamReg(`TOPOLOGY_NET_MESH_HEIGHT); // Outbound and inbound ports are loopbacks to the same multiplexed module. Ports connect // to logically different nodes but physically to the same simulator object. Vector#(NUM_PORTS, PORT_SEND_MULTIPLEXED#(NUM_STATIONS, MESH_MSG)) enqTo = newVector();Vector#(NUM_PORTS, PORT_RECV_MULTIPLEXED#(NUM_STATIONS, MESH_MSG)) enqFrom = newVector(); // Outbound port is a normal A-Port. It has no buffering.enqTo[portEast] <- mkPortSend_Multiplexed("mesh_interconnect_enq_E"); // Inbound port provides buffering for multiplexing. Instead of forwarding messages FIFO // it must transform the messages so they cross to the correct multiplexed instance when // instances (nodes) are traversed sequentially.enqFrom[portWest] <- mkPortRecv_Multiplexed_ReorderLastToFirstEveryN("mesh_interconnect_enq_E", 1,meshWidth, meshHeight); . . .enqFrom[portEast] <- mkPortRecv_Multiplexed_ReorderFirstToLastEveryN("mesh_interconnect_enq_W", 1,meshWidth, meshHeight);enqFrom[portSouth] <- mkPortRecv_Multiplexed_ReorderFirstNToLastN("mesh_interconnect_enq_N", 1,meshWidth);enqFrom[portNorth] <- mkPortRecv_Multiplexed_ReorderLastNToFirstN("mesh_interconnect_enq_S", 1,meshWidth);