180 likes | 298 Views
Early Experience with Out-of-Core Applications on the Cray XMT. Daniel Chavarría-Miranda § , Andrés Márquez § , Jarek Nieplocha § , Kristyn Maschhoff † and Chad Scherrer § § Pacific Northwest National Laboratory (PNNL) † Cray, Inc. Introduction.
E N D
Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda§, Andrés Márquez§, Jarek Nieplocha§, Kristyn Maschhoff† and Chad Scherrer§ §Pacific Northwest National Laboratory (PNNL) † Cray, Inc.
Introduction • Increasing gap between memory and processor speed • Causing many applications to become memory-bound • Mainstream processors utilize cache hierarchy • Caches not effective for highly irregular, data-intensive applications • Multithreaded architectures provide an alternative • Switch computation context to hide memory latency • Cray MTA-2 processors and newer ThreadStorm processors on the Cray XMT utilize this strategy 2
Cray XMT • 3rd generation multithreaded system from Cray • Infrastructure is based on XT3/4, scalable up to 8192 processors • SeaStar network, torus topology, service and I/O nodes • Compute nodes contain 4 ThreadStorm multithreaded processors instead of 4 AMD Opteron processors • Hybrid execution capabilities: code can run on ThreadStorm processors in collaboration with code running on Opteron processors 3
Cray XMT (cont.) • ThreadStorm processors run at 500 MHz • 128 hardware thread contexts, each with its own set of 32 registers • No data cache • 128KB, 4-way associative data buffer on the memory side • Extra bits in each 64-bit memory word: full/empty for synchronization • Hashed memory at a 64-byte level, i.e. contiguous logical addresses at a 64-byte boundary might be mapped to uncontiguous physical locations • Global shared memory 4
Cray XMT (cont.) • Lightweight User Communication library (LUC) to coordinate data transfers and hybrid execution between ThreadStorm and Opteron processors • Portals-based on Opterons • Fast I/O API-based on ThreadStorms • RPC-style semantics • Service and I/O (SIO) nodes provide Lustre, a high-performance parallel file system • ThreadStorm processors cannot directly access Lustre • LUC-based execution and transfers combined with Lustre access on the SIO nodes • Attractive and high-performance alternative for processing very large datasets on the XMT system 5
Outline • Introduction • Cray XMT • PDTree • Multithreaded implementation • Static & dynamic versions • Experimental setup and Results • Conclusions 6
PDTree (or Anomaly Detection for Categorical Data) • Originates from cyber security analysis • Detect anomalies in packet headers • Locate and characterize network attacks • Analysis method is more widely applicable • Uses ideas from conditional probability • Multivariate categorical data analysis • For a combination of variables and instances of values for these variables, find out how many times the pattern has occurred • Resulting count table or contingency table specifies a joint distribution • Efficient implementation of algorithms using such tables are very important in statistical analysis • ADTree data structure (Moore & Lee 1998), can be used to store data counts • Stores all combinations of values for variables 7
PDTree (cont.) • We use an enhancement to the ADTree data structure called a PDTree where we don’t need to store all possible combinations of values • Only store a priori specified combinations 8
Multithreaded Implementation • PDTree implemented using a multiple type, recursive tree structure • Root node is an array of ValueNodes (counts for different value instances of the root variables) • Interior and leaf nodes are linked lists of ValueNodes • Inserting a record at the top level involves just incrementing the counter of the corresponding ValueNode • XMT’s int_fetch_add() atomic operation is used to increment counters • Inserting a record at other levels requires the traversal of a linked list to find the right ValueNode • If the ValueNode does not exist, it must be appended to the end of the list • Inserting at other levels when the node does not exist is tricky • To ensure safety the end pointer of the list must be locked • Use readfe() and writeef() MTA operations to create critical sections • Take advantage of full/empty bits on each memory word • As data analysis progresses the probability of conflicts between threads is lower 9
Multithreaded Implementation (cont.) T1 trying to grab the end pointer vi = j (count) vi = k (count) T2 trying to grab the end pointer T1 succeeded and inserted a new node vi = j (count) vi = k (count) vi = m (count) T2 now has a lock to a non-end pointer 10
count = 5 columns = column = b values = column = c values = ... Static and Dynamic Versions Array of RootNodes column = a numCols = 3 values = RootNode Hash table of ValueNodes Array of ColumnNodes count = 3 columns = column = b values = column = c values = ... value = 10 count = 1 numCols = 3 columns = ... nextVN = value = 19 count = 4 numCols = 3 columns = ... nextVN = ... Linked list of ValueNodes 11
Outline • Introduction • Cray XMT • PDTree • Multithreaded Implementation • Static & dynamic versions • Experimental setup and Results • Conclusions 12
Experimental setup and Results • Large dataset to be analyzed by PDTree • 4 GB resident on disk (64M records, 9 column guide tree) • Options: • Direct file I/O from ThreadStorm procesors via NFS • Not very efficient • Indirect I/O via LUC server running on Opteron processors on the SIO nodes • Large input file can reside on high-performance Lustre file system • Simulates the use of PDTree for online network traffic analysis • Need to use dynamic PDTree • 128K element hash table 13
Opteron CPU Threadstorm CPU Threadstorm CPU Compute node Service/login node DRAM DRAM Threadstorm CPU Threadstorm CPU Indirect Access SeaStar Interconnect LUC RPC Direct Access Lustre filesystem Experimental setup and Results (cont.) Note: results obtained on a preproduction XMT with only half of the DIMM slots populated 14
Experimental setup and Results (cont.) In-core, 1M record execution, static PDTree version 15
Conclusions Results indicate the value of the XMT hybrid architecture and its improved I/O capabilities Indirect access to Lustre through LUC interface Need to improve I/O operation implementation to take full advantage of Lustre Multiple LUC transfers in parallel should improve performance Scalability of the system is very good for complex, data-dependent irregular accesses in the PDTree application Future work includes comparisons against parallel cache-based systems 18