220 likes | 345 Views
Data Intensive Computing at Sandia. September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories.
E N D
Data Intensive Computing at Sandia September 15, 2010 Andy Wilson Senior Member of Technical Staff Data Analysis and Visualization Sandia National Laboratories Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000.
The Question What is Data-Intensive Computing?
My Answer What is Data-Intensive Computing? Parallel computing where you design your algorithms and your software around efficient access and traversal of a data set; where hardware requirements are dictated by data size as much as by desired run times Usually distilling compact results from massive data
Outline • What is Data-Intensive Computing? • Data-Intensive Computing at Sandia • Physics • Informatics • Architectures • Into the Future
Traditional Visualization Workflow Solver Full Mesh Disk Storage Visualization
Traditional In-Situ Visualization Solver Solver Visualization Full Mesh Disk Storage Images Disk Storage Visualization
Coprocessing Solver Solver Solver Features & Statistics Visualization Full Mesh Disk Storage Salient Data Images Disk Storage Disk Storage Visualization Visualization
Outline • What is Data-Intensive Computing? • Data-Intensive Computing at Sandia • Physics • Informatics • Architectures • Into the Future
Community Detection in Networks • Find many small groups of vertices and/or edges • O(n) communities • overlaps may be allowed • Hundreds of papers in physics and computer science Lancichinetti, Fortunato, Radicchi 2008
Twitter social network (|V|≈200M) [Akshay Java, 2007] Analysis of Massive Graphs • Finding communities: a kernel of social network analysis • “Dunber’s number” from sociology: there is a size limit (~150) on stable social group size (from neolithic farming village to academic sub-discipline)
Collapsed Dendrograms and Statistical Confidence: wCNM The wCNM partitioning is much deeper, resolving smaller communities The statistically significant variation is visually close, but does not reproduce ground truth as well The (much better) wCNM solution also has a statistically significant variation. Image credit: Titan
LSA and LDA from 5 miles up (LDA) Image credit: Dave Robinson
LSA/LDA: Increasing Data Size, Single ProcessorStraight Line = Linear Scaling, Lower = Faster Slide 16 of 18
LSA/LDA: Weak Scaling(Bigger Problem, Same Time)Flat Lines = Perfect Scaling Slide 17 of 18
Outline • What is Data-Intensive Computing? • Data-Intensive Computing at Sandia • Physics • Informatics • Architectures • Into the Future
NGC System Diagram “This project seeks to bring these two strengths – a solid reputation for excellence in computing, and our niche expertise in specific classes of intelligence analysis – to bear on a thorny problem: developing advanced informatics capabilities that are both usable and useful to analysts who are drowning in data.” NGC project proposal Architectures Algorithms Data Web Services Applications (Clients) Titan, browser Trilinos Algebraic Methods Clustering, Ranking, High Dimensional Mapping Titan Analysis Pipelines, Capability Integration, Data Access, Lightweight analysis MTGL Graph Methods Subgraph searches, Connection sg’s, Shortest Path, etc. Titan Analysis Pipelines, Capability Integration, Data Access, Lightweight analysis Specialized Distributed Data Operations Highly optimized Iterative, flexible
SQL ServiceEnables Remote Access to Data Warehouse Appliances (DWA) Analyst HPC System (Red Storm) DWA Service Nodes (GUI and Database Services) Netezza TCP/IP SQL Additional Modifications for Multilingual • Tokenization support on Netezza (goal is to count unique words) • Developed a custom UTF-8 words splitter for SPU (snippet processing unit) • Allows parallel tokenization and counting at storage device SQL Service* • Provides “bridge” between parallel apps and external DWA • Runs on Red Storm network nodes • Titan applications communicate with service through Portals • External resources (Netezza) communicate through standard interfaces (e.g. ODBC over TCP/IP) High-Speed Network (Portals) LexisNexis Other ODBC DWA Compute Nodes (Titan Analysis Code) Anywhere Tech Area 1 CSRI The SQL service enables an HPC application to access a remote DWA * Results of SQL access from parallel statistics code presented at CUG’2009.
Outline • What is Data-Intensive Computing? • Data-Intensive Computing at Sandia • Physics • Informatics • Architectures • Into the Future
Into the Future • I don’t care about flops anymore. I care about mops. • I want to send more complex requests to the storage system. • There is no one perfect architecture.