Early Experience with Out-of-Core Applications on the Cray XMT

Early Experience with Out-of-Core Applications on the Cray XMT Daniel Chavarría-Miranda§, Andrés Márquez§, Jarek Nieplocha§, Kristyn Maschhoff† and Chad Scherrer§ §Pacific Northwest National Laboratory (PNNL) † Cray, Inc.

Introduction • Increasing gap between memory and processor speed • Causing many applications to become memory-bound • Mainstream processors utilize cache hierarchy • Caches not effective for highly irregular, data-intensive applications • Multithreaded architectures provide an alternative • Switch computation context to hide memory latency • Cray MTA-2 processors and newer ThreadStorm processors on the Cray XMT utilize this strategy 2

Cray XMT • 3rd generation multithreaded system from Cray • Infrastructure is based on XT3/4, scalable up to 8192 processors • SeaStar network, torus topology, service and I/O nodes • Compute nodes contain 4 ThreadStorm multithreaded processors instead of 4 AMD Opteron processors • Hybrid execution capabilities: code can run on ThreadStorm processors in collaboration with code running on Opteron processors 3

Cray XMT (cont.) • ThreadStorm processors run at 500 MHz • 128 hardware thread contexts, each with its own set of 32 registers • No data cache • 128KB, 4-way associative data buffer on the memory side • Extra bits in each 64-bit memory word: full/empty for synchronization • Hashed memory at a 64-byte level, i.e. contiguous logical addresses at a 64-byte boundary might be mapped to uncontiguous physical locations • Global shared memory 4

Cray XMT (cont.) • Lightweight User Communication library (LUC) to coordinate data transfers and hybrid execution between ThreadStorm and Opteron processors • Portals-based on Opterons • Fast I/O API-based on ThreadStorms • RPC-style semantics • Service and I/O (SIO) nodes provide Lustre, a high-performance parallel file system • ThreadStorm processors cannot directly access Lustre • LUC-based execution and transfers combined with Lustre access on the SIO nodes • Attractive and high-performance alternative for processing very large datasets on the XMT system 5

Outline • Introduction • Cray XMT • PDTree • Multithreaded implementation • Static & dynamic versions • Experimental setup and Results • Conclusions 6

PDTree (or Anomaly Detection for Categorical Data) • Originates from cyber security analysis • Detect anomalies in packet headers • Locate and characterize network attacks • Analysis method is more widely applicable • Uses ideas from conditional probability • Multivariate categorical data analysis • For a combination of variables and instances of values for these variables, find out how many times the pattern has occurred • Resulting count table or contingency table specifies a joint distribution • Efficient implementation of algorithms using such tables are very important in statistical analysis • ADTree data structure (Moore & Lee 1998), can be used to store data counts • Stores all combinations of values for variables 7

PDTree (cont.) • We use an enhancement to the ADTree data structure called a PDTree where we don’t need to store all possible combinations of values • Only store a priori specified combinations 8

Multithreaded Implementation • PDTree implemented using a multiple type, recursive tree structure • Root node is an array of ValueNodes (counts for different value instances of the root variables) • Interior and leaf nodes are linked lists of ValueNodes • Inserting a record at the top level involves just incrementing the counter of the corresponding ValueNode • XMT’s int_fetch_add() atomic operation is used to increment counters • Inserting a record at other levels requires the traversal of a linked list to find the right ValueNode • If the ValueNode does not exist, it must be appended to the end of the list • Inserting at other levels when the node does not exist is tricky • To ensure safety the end pointer of the list must be locked • Use readfe() and writeef() MTA operations to create critical sections • Take advantage of full/empty bits on each memory word • As data analysis progresses the probability of conflicts between threads is lower 9

Multithreaded Implementation (cont.) T1 trying to grab the end pointer vi = j (count) vi = k (count) T2 trying to grab the end pointer T1 succeeded and inserted a new node vi = j (count) vi = k (count) vi = m (count) T2 now has a lock to a non-end pointer 10

count = 5 columns = column = b values = column = c values = ... Static and Dynamic Versions Array of RootNodes column = a numCols = 3 values = RootNode Hash table of ValueNodes Array of ColumnNodes count = 3 columns = column = b values = column = c values = ... value = 10 count = 1 numCols = 3 columns = ... nextVN = value = 19 count = 4 numCols = 3 columns = ... nextVN = ... Linked list of ValueNodes 11

Outline • Introduction • Cray XMT • PDTree • Multithreaded Implementation • Static & dynamic versions • Experimental setup and Results • Conclusions 12

Experimental setup and Results • Large dataset to be analyzed by PDTree • 4 GB resident on disk (64M records, 9 column guide tree) • Options: • Direct file I/O from ThreadStorm procesors via NFS • Not very efficient • Indirect I/O via LUC server running on Opteron processors on the SIO nodes • Large input file can reside on high-performance Lustre file system • Simulates the use of PDTree for online network traffic analysis • Need to use dynamic PDTree • 128K element hash table 13

Opteron CPU Threadstorm CPU Threadstorm CPU Compute node Service/login node DRAM DRAM Threadstorm CPU Threadstorm CPU Indirect Access SeaStar Interconnect LUC RPC Direct Access Lustre filesystem Experimental setup and Results (cont.) Note: results obtained on a preproduction XMT with only half of the DIMM slots populated 14

Experimental setup and Results (cont.) In-core, 1M record execution, static PDTree version 15

Experimental setup and Results (cont.) 16

Experimental setup and Results (cont.) 17

Conclusions Results indicate the value of the XMT hybrid architecture and its improved I/O capabilities Indirect access to Lustre through LUC interface Need to improve I/O operation implementation to take full advantage of Lustre Multiple LUC transfers in parallel should improve performance Scalability of the system is very good for complex, data-dependent irregular accesses in the PDTree application Future work includes comparisons against parallel cache-based systems 18

Early Experience with Out-of-Core Applications on the Cray XMT

Early Experience with Out-of-Core Applications on the Cray XMT

Presentation Transcript

Enabling Oracle Applications on DB2 an Early User Experience

Dynamic Chunking for Out-of-Core Volume Visualization Applications

Cray

Taking a Bite Out of the Common Core with Business Education

The Effect of Multi-core on HPC Applications in Virtualized Systems

Taking the Bite out of the Common Core

Efficient I/O on the Cray XT

XMT-GPU

Cray Supercomputers: The Cray X1

Open MPI on the Cray XT

Cray XT3 Experience so far

Seymour Cray

Massively Parallel Magnetohydrodynamics on the Cray XT3

Mstack on the Cray MTA-2

DOE Evaluation of the Cray X1

oclib: out of core optimization

Early Experience

Enhancing the Passenger Experience with Interactive Applications

Why out-of-core?

Out of Core Simplification

Experience With Personalization on

Making the Most out of Experience with Sauna Buckets