300 likes | 434 Views
Combinatorial Scientific Computing: A View to the Future. Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept. Combinatorial Scientific Computing.
E N D
Combinatorial Scientific Computing:A View to the Future Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.
Combinatorial Scientific Computing • The development, application and analysis of combinatorial algorithms to enable scientific and engineering computations • Highlighted areas from a survey talk I composed in 2003 • Sparse matrices (direct & iterative methods) • Optimization & derivatives • Parallel computing • Mesh generation • Statistical physics • Chemistry • Biology
A Brief History of CSC • Grew out of series of minisymposia at SIAM meetings • Deeper origins in • Sparse direct methods community (1950s and onward) • Statistical physics – graphs and Ising models (1940s & 1950s) • Chemical classification (1800s, Cayley) • Recognition of common esthetic, techniques and goals among researchers who were far apart in traditional scientific taxonomy • Name selected in 2002 • After lengthy email discussion among ~ 30 people. • Now, ~3000 hits for “combinatorial scientific computing” on Google.
Previous Milestones • This is the 4th major CSC workshop • SIAM ’04 (with Parallel Processing) • Organizers J. Gilbert, B. Hendrickson, A. Pothen, H. Simon, S. Toledo • CERFACS ’05 • SIAM ’07 (with Computational Science & Engineering) • Coming soon: • SIAM ’09 (with Applied Linear Algebra) • SIAM ’11 (with Optimization) (?) • Special issue of ETNA in 2004 • Importance recognized by scientific community and funding agencies
Invited Speakers from Earlier CSC Workshops • Richard Brualdi (combinatorial matrix theory) • Dan Gusfield (computational biology) • Shang-Hua Teng (smoothed analysis of algorithms) • Stan Eisenstat (sparse direct methods) • Dan Halperin (geometric algorithms) • Denis Trystram (parallel scheduling) • Iain Duff (sparse direct methods) • Phil Duxbury (statistical physics)
Outline • A look back: • A brief history of a brief history • A look ahead: • New application opportunities: data-centric computing • Graph models of information retrieval • Emerging science of complex networks • Architectural revolution: challenges and promise • Challenges of near-future machines • Potential architectures for discrete problems • Conclusions
Data-Centric Computing • Many science disciplines generate huge data sets • Biology, astronomy, high-energy physics, environmental science, social sciences (internet data), etc. • Important scientific knowledge lurks within this data • What abstractions and algorithms are needed? • Claim: • Combinatorial algorithms have an important role to play • “Combinatorial problems generated by challenges in data mining and related topics are now central to computational science.” • [I. Beichl & F. Sullivan, 2008]
ddocuments tterms Example 1: Information Retrieval • Consider a document corpus • Each document is a “bag of words” • Represent as non-negative term/document matrix • A(i,j) encodes frequency of term i in document j • A set of terms in a query can be thought of as a vector q • Large entries in ATq identify good matches for retrieval
Latent Semantic Analysis • LSA uses truncated SVD for dimension reduction • A≈ Ukk VkT • Retrieval query now becomes • ATq ≈ Vkk UkTq • Widely used idea to reduce noise and reduce query expense • [Deerwester, et al., 1990] • Basic idea has many applications • Image recognition, machine translation, pattern recognition, etc.
Graph Based Alternative • View the term-document matrix as a bipartite graph • Terms and documents have weighted links if they are related • Embed the graph in a low dimensional space using (for example) Laplacian eigenvectors • Given a query vector, map it to same space and look for nearby documents • Fiedler retrieval [H., 2007] • Algebraically, this involves low eigenvectors of the matrix L= • Note that LSA involves low eigenvectors of
Advantages of Graph Representation • Terms & Documents live in same space • Principled method for adding doc-doc or term-term similarities • E.g. former from dictionary, latter from citation analysis or hyperlinks • Unified text and link analysis • Supports more complex queries • “similar to these documents and these terms” • Supports extensions to more classes of objects. • E.g., instead of just term-document, could do term-document-author.
The way it was … The way it is now … Zachary’s karate club (|V|=34) Twitter social network (|V|≈20K) Example II: Network Science • Graphs are ideal for representing entities and relationships • Rapidly growing use in social, environmental, and other sciences
New Questions • New algorithms • Community detection, centrality, graph generation, etc. • Right set of questions and concepts still emerging. • New issues • Noisy, error-filled data. What can we conclude robustly? • Semantic graphs with edges and vertices of different types. • E.g. people, organizations, events • How should this be exploited algorithmically? • Multilinear instead of linear algebra? • New paradigms: • E.g. graph evolves over time • Temporal analysis, dynamics, streaming algorithms on graphs, etc • Enormous opportunities for combinatorial algorithms
Outline • A look back: • A brief history of a brief history • A look ahead: • New application opportunities: data-centric computing • Graph models of information retrieval • Emerging science of complex networks • Architectural revolution: challenges and promise • Challenges of near-future machines • Potential architectures for discrete problems • Conclusions
A Renaissance in Architecture Research • Good news • Moore’s Law marches on • Real estate on a chip is essentially free • Major paradigm change – huge opportunity for innovation • Bad news • Power considerations limit the improvement in clock speed • Eventual consequences are unclear • Current response, multicore processors • Computation/Communication ratio will get worse • Makes life harder for applications
Applications Also Getting More Complex • Leading edge scientific applications increasingly include: • Adaptive, unstructured data structures • Complex, multiphysics simulations • Multiscale computations in space and time • Complex synchronizations (e.g. discrete events) • Significant parallelization challenges on today’s machines • Finite degree of coarse-grained parallelism • Load balancing and memory hierarchy optimization • Dramatically harder on millions of cores • Huge need for new algorithmic ideas – CSC will be critical
Architectural Challenges for Graph Algorithms • Runtime is dominated by latency • Particularly true for data-centric applications • Random accesses to global address space • Perhaps many at once – fine-grained parallelism • Essentially no computation to hide access time • Access pattern is data dependent • Prefetching unlikely to help • Usually only want small part of cache line • Potentially abysmal locality at all levels of memory hierarchy
What we traditionally care about Emerging Codes What industry cares about From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007 Locality Challenges
Example: AMD Opteron Memory (Latency Avoidance) L1 D-Cache L2 Cache L1 I-Cache
Example: AMD Opteron Memory (Lat. Avoidance) Out-of-Order Exec Load/Store Mem/Coherency (Latency Tolerance) Load/Store Unit L1 D-Cache L2 Cache I-Fetch Scan Align L1 I-Cache Memory Controller
Example: AMD Opteron Memory (Latency Avoidance) Load/Store Unit L1 D-Cache Out-of-Order Exec Load/Store Mem/Coherency (Lat. Toleration) L2 Cache Bus DDR HT I-Fetch Scan Align L1 I-Cache Memory and I/O Interfaces Memory Controller
Example: AMD Opteron Memory (Latency Avoidance) FPU Execution Load/Store Unit L1 D-Cache Out-of-Order Exec Load/Store Mem/Coherency (Lat. Tolerance) L2 Cache Int Execution Bus DDR HT I-Fetch Scan Align L1 I-Cache Memory and I/O Interfaces Memory Controller COMPUTER Thanks to Thomas Sterling
Architectural Wish List for Graphs • Low latency / high bandwidth • For small messages! • Latency tolerant • Light-weight synchronization mechanisms • Global address space • No graph partitioning required • Avoid memory-consuming profusion of ghost-nodes • No local/global numbering conversions • One machine with these properties is the Cray MTA-2 • And successor XMT
How Does the MTA Work? • Latency tolerance via massive multi-threading • Context switch in a single tick • Global address space, hashed to reduce hot-spots • No cache or local memory. • Multiple outstanding loads • Remote memory request doesn’t stall processor • Other streams work while your request gets fulfilled • Light-weight, word-level synchronization • Minimizes conflicts, enables parallelism • Flexible dynamic load balancing • Notes: • 220 MHz clock • Largest machine is 40 processors
Case Study I: MTA-2 vs. BlueGene • With LLNL, implemented S-T shortest paths in MPI • Ran on IBM/LLNL BlueGene/L, world’s fastest computer • Finalist for 2005 Gordon Bell Prize • 4B vertex, 20B edge, Erdös-Renyi random graph • Analysis: touches about 200K vertices • Time: 1.5 seconds on 32K processors • Ran similar problem on MTA-2 • 32 million vertices, 128 million edges • Measured: touches about 23K vertices • Time: .7 seconds on one processor, .09 seconds on 10 processors • Conclusion: 4 MTA-2 processors = 32K BlueGene/L processors • [Berry, H., Kahan, Konecny, 2007]
PBGL SSSP Time (s) MTA SSSP # Processors Case Study II: Single Source Shortest Path • Parallel Boost Graph Library (PBGL) • Lumsdaine, et al., on Opteron cluster • Some graph algorithms can scale on some inputs • PBGL - MTA Comparison on SSSP • Erdös-Renyi random graph (|V|=228) • PBGL SSSP can scale on non-power law graphs • Order of magnitude speed difference • 2 orders of magnitude efficiency difference • Big difference in power consumption • [Lumsdaine, Gregor, H., Berry, 2007]
Longer Term Architectural Opportunities • Near future trends • Multithreading to tolerate latencies • XMT-like capability on commodity machines? • Potential big impact on latency-dominated applications (e.g. graphs) • Further out • Application-specific circuitry • E.g. hashing, feature detection, etc. • Reconfigurable hardware? • Adapt circuits to the application at run time • Lots of new combinatorial problems in these alternative computing models
Conclusions • CSC is in robust health • Growing in breadth, depth, impact and visibility • Trends in science play to our strengths • Growing complexity of traditional applications requires more CSC • Unstructured, adaptive meshes; bigger problems; multiphysics; optimization; etc. • New science domains with combinatorial needs are emerging • Social sciences, ecology, structural biology, etc. • Many sciences are becoming more data-rich • Complex computers require new discrete algorithms • We can help applications on multicore nodes, and maybe influence future architectures • Enormous need for new models and algorithmic improvements • It’s a great time to be doing CSC!
Thanks • Cevdet Aykanat, Jon Berry, Rob Bisseling, Erik Boman, Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean-Loup Faulon, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Simon Kahan, Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan Toledo, Keith Underwood, etc.