1 / 29

Power, Resilience, Capacity, Oh My! I/O-integrated computing with NVRAM

Power, Resilience, Capacity, Oh My! I/O-integrated computing with NVRAM. Maya Gokhale, LLNL. Exascale node architectures add NVRAM to memory hierarchy . Amount of DRAM per core is severely constrained in exascale node Many cores per node Number of nodes

brock
Download Presentation

Power, Resilience, Capacity, Oh My! I/O-integrated computing with NVRAM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Power, Resilience, Capacity, Oh My!I/O-integrated computing with NVRAM Maya Gokhale, LLNL

  2. Exascale node architectures add NVRAM to memory hierarchy • Amount of DRAM per core is severely constrained in exascale node • Many cores per node • Number of nodes • Total power requirements dominated by DRAM • Scale of an exaflop computer makes frequent checkpointing to global file system infeasible • Peer-to-peer checkpoints (SCR) harder due to less memory per core • NVRAM in the compute fabric is required for energy, capacity, resilience

  3. Exascale machine incorporates NVRAM at every level X-Stack I NoLoSS

  4. NVRAM bridges memory/storage latency gap • With this new hardware technology, can we … • directly operate on arrays and tables that are too big to fit in main memory • make permanent updates to arrays and tables without file I/O • traverse large graphs “as if” in memory • save lightweight checkpoints by flushing cache to persistent memory • enable scalable applications to transition to exascale • Memory and storage are regarded as different in kind by the entire software stack: direct access to memory, but read and write operations to storage • Emerging persistent memories support efficient random, byte level, read/write access, filling a gap in memory/storage hierarchy • Disk is a factor of 100,000 slower access latency of DRAM • PCM is a factor of 2× slower (in read) than DRAM latency

  5. Outline • Out-of-core memory map runtime for block device • Customized page cache management for persistent memory data store • Applications • Graph traversal framework • Metagenomic analysis and classification • Active storage • Future generations of byte-addressable NVRAM storage • Compute logic co-located with storage controller

  6. Systems software for data intensive computing: data intensive memory map Brian Van Essen, LLNL

  7. Views of node local NVRAM • I/O bus • Block device/File system • Persistent memory/Extended memory • Memory hardware interface

  8. Optimize system software for application-specific access patterns • Performance measurements of system mmap were disappointing • We want abstraction of memory mapping but improve its implementation for data-centric application classes • Data intensive memory mapping (DI-MMAP) • DRAM as cache to persistent memory • Data intensive runtime manages the cache in a pre-allocated memory buffer • Differentiates between frequently used locations vs. random access

  9. Buffer management applies caching techniques to NVRAM pages Minimize the amount of effort needed to find a page to evict: • In the steady state a page is evicted on each page fault • Track recently evicted pages to maintain temporal reuse • Allow bulk TLB operations to reduce inter-processor interrupts

  10. DI-MMAP is being integrated with Livermore Computing software stack • Scalable out-of-core performance provides an opportunity for working with node-local storage in HPC • Catalyst system • 150 teraflop/s: 324 nodes, 7776 cores • 40 TB DRAM + 253 TB Flash • Each node: 128 GiB DRAM + 800 GiB Flash • DI-MMAP is integrated with SLURM • Can be loaded by user • Allocated per job • User specified total memory used for buffering external Flash https://bitbucket.org/vanessen/di-mmap/

  11. Parallel graph traversal framework • High visibility data intensive supercomputing application class • Traverse scale free graph with trillions of edges • Social network, web • Developed parallel, latency tolerant asynchronous traversal framework • Uses visitor abstraction • Visitor is an application-specific kernel that the framework applies to each graph vertex • Visit results in traversal of graph edges to queue work on target vertices • Visitor is queued to the vertex using priority queue • Scalable: single server with NVRAM to data intensive cluster to BG/P • 15% faster than best Intrepid result • Demonstrated with multiple traversal algorithms: Breadth first search, Single source shortest path,(Strongly) Connected components, Triangle counting, K-th core • Multiple Graph500 submissions • Code has been open sourced (https://bitbucket.org/PerMA/havoqgt) and will continue to be updated Roger Pearce, LLNL

  12. BFS Graph Analysis performance shows DI-MMAP advantages DI-MMAP provides … High-performance • 7.44x faster than mmap • with 256 threads • and 40 GiB of DRAM Scalable out-of-core performance • 50% less DRAM • 20 GiBvs 40 GiB DRAM • Only 23% slower

  13. Application: Livermore Metagenomics Analysis Toolkit Our Approach:Use highly concurrent search and large persistent memory to store and access reference data in a “pre-processed” form that supports efficient runtime queries. 375 GB k-mer database mapped into memory with DI-MMAP (24.5 billion k-mer keys) with tens to hundreds of threads querying database • Challenge: Given an “anonymous” sample give as complete a report on the pathogen contents as possible. • Standard method: sequence alignment – align every read in sample against every known genome Jonathan Allen, LLNL

  14. We’ve designed a two-level index optimized for NVRAM • COTS Hash tables aren’t suitable for page-access storage • Solution: split k-mer key • Manage prefix with array • Suffixes managed in lists typically span no more than one page • Results: up to 74X speedup over gnu hashmap • Need to determine split parameter– can depend on input data set and k-mer length Sasha Ames, LLNL

  15. LMAT + DI-MMAP performance LMAT DI-MMAP vslinuxmmap8 GB first-level size • General Experimental Setup • 22 GB original reference index containing microbial (with fungi+protozoa) and human sequences. • LMAT (version 1.1) runs with 160 threads on 40 intel core with 1.2TB FusionIO node, and DI-MMAP with 16 GB buffer LMAT on ramdisk, flash vs other methods1 GB first-level size Two-level vs hash – ERR data set index configuration (below) includes first level table size

  16. LMAT experiment runs at scale on the Catalyst cluster • Analyze human genomes consisting of samples from 2948 individuals collected as part of the NIH “1000” human genome project, 100TB+ compressed • LMAT can answer science questions • Can we build a "more complete" human reference for screening samples such as the microbiome or cancer sequences? • Are there any interesting previously undetected microbial components in the samples? • What is the statistical profile of k-mer uniqueness across individuals? • Replicate database on node-local flash, distribute dataset across nodes

  17. Orders of magnitude range in query rate for different samples

  18. Modest database increase with additional human genomes Database 1 - original LMAT database Database 2 - addition of GenBank human reference Database 3 -  (2) + addition of K-mers from from 1000 Genome project samples .

  19. More accurate DB means more labeled reads

  20. Expanded database with human variants improves classification https://computation-rnd.llnl.gov/lmat/

  21. Active storage with Minerva (UCSD and LLNL) • Prototyped on BEE-3 system • 4 Xilinx Virtex 5, each with 16GB memory, ring interconnect • Emulate PCM controller by inserting appropriate latency on reads and writes, emulate wear leveling to store blocks • Get advantage of fast PCM and FPGA-based hardware acceleration • Designed for future generations of NVRAM that are byte addressable, very low latency on read • Based on computation close to storage • Moved data or I/O intensive computations to the storage to minimize data transfer between the host and the storage • Processing scales as storage expands • Power efficiency and performance gain Arup De, UCSD/LLNL

  22. Minerva architecture provides concurrent access, in-storage processing units

  23. Minerva Programming Model Normal reads and writes bypass storage processor Remote procedure call-like interface for computational “reads” and “writes” Storage processor generates interrupt for assistance whenever requires the host intervention to resolve The software driver is responsible for setup, manage, monitor and finally release the storage processor on behalf of a given application

  24. Interaction with storage processor Receives command Loads it to memory Reads command Memory Interface Update status SendsDMA request via request manager Sends back completion notification Loads data to memory Command 1 Ring Interface Status 1 Actual String 11 Command 2 Local Interface Comparator Address Status 2 DMA Engine 1 Ref. String Local Memory Data Scheduler Grep FSM Request Extractor Data Manager Req-rsp Interface Enable Control Processor K0 K1 Splitter Compute Manager Response Generator Word Counter K2 K3 Compute Scheduler Local Memory Request Manager Kn-1 Kn Initiate computations via request manager Receives result Compute Kernel

  25. Minerva Prototype • Built on BEE3 board • PCIe 1.1 x8 host connection • Clock frequency 250MHz • Virtex 5 FPGA implements • Request scheduler • Network • Storage processor • Memory controller • DDR2 DRAM emulates PCM

  26. Minerva FPGA emulator allows us to estimate performance, power • Streaming string search • 8+× performance • 10+× energy efficiency • Saxpy • 4× performance • 4.5× energy efficiency • Key-Value store (key lookup) • 1.6× performance • 1.7× energy efficiency • B+ tree search • 4+× performance • 5× energy efficiency Compare standard I/O with computational offload approach Design alternative storage processor architectures and evaluate trade-offs Run benchmarks and mini-apps in real-time Estimate power draw and energy for different technology nodes

  27. Contributors • Further reading • Arup De, Maya Gokhale, Rajesh Gupta, and Steven Swanson , Minerva: Accelerating Data Analysis in Next-Generation SSDs, in Proceedings of Field-Programmable Custom Computing Machines (FCCM), April 2013. • Brian Van Essen, Henry Hsieh, Sasha Ames, Roger Pearce, Maya Gokhale, DI-MMAP: a scalable memory map runtime for out-of-core data-intensive applications, Cluster Computing, Oct. 2013 (open access) • Sasha K. Ames, David A. Hysom, Shea N. Gardner, G. Scott Lloyd, Maya B. Gokhale and Jonathan E. Allen, Scalable metagenomictaxonomy classification using a reference genome database, Bioinformatics, July 2, 2013 (open access). • Roger Pearce, Maya Gokhale, Nancy Amato, Scaling Techniques for Massive Scale-Free Graphs in Distributed (External) Memory, IEEE International Parallel and Distributed Processing Symposium, May 2013. • Active storage • Arup De (Lawrence Scholar, UCSD) • Data-intensive memory map • Brian Van Essen • Metagenomic analysis • Sasha Ames • Jonathan Allen – PI of Metagenomics LDRD • Graph algorithms • Roger Pearce • Collaborators • UCSD – active storage • SDSC – data intensive architectures • UCSC – Resistive RAM as Content Addressable Memory • TAMU – graph algorithms • Argonne, ASCR – data intensive memory map for the exascale

  28. In-cluster persistent memory addresses capacity, power, and latency/bandwidth gaps • In-storage computing can improve performance and EDP, particularly for low latency, byte-addressable persistent memory on attached I/O bus. • Extending the in-memory model to present day flash based block devices has demonstrated significant performance gains. • Highly concurrent throughput-driven algorithms can achieve excellent performance when accessing NVRAM through the optimized software stack.

More Related