Condor Project: Distributed Computing Research at UW

Welcome to CW 2008!!!

The Condor Project(Established ‘85) Distributed Computing research performed by a team of ~35 faculty, full time staff and students who • face software/middleware engineering challenges in a UNIX/Linux/Windows/OS X environment, • involved in national and international collaborations, • interact with users in academia and industry, • maintain and support a distributed production environment (more than 4000 CPUs at UW), • and educate and train students.

“ … Since the early days of mankind the primary motivation for the establishment of communitieshas been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub-systems can be integrated into multi-computer ‘communities’. … “ Miron Livny, “Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983.

Main Threads of Activities • Distributed Computing Research – develop and evaluate new concepts, frameworks and technologies • Keep Condor “flight worthy” and support our users • The Open Science Grid (OSG) – build and operate a national High Throughput Computing infrastructure • The Grid Laboratory Of Wisconsin (GLOW) – build, maintain and operate a distributed computing and storage infrastructure on the UW campus. • The NSF Middleware Initiative - Develop, build and operate a national Build and Test facility powered by Metronome

Future ofGrid Computing CHEP 07

The Tulmod says in the name of Rabbi Yochanan, “Since the destruction of the Temple, prophecy has been taken from prophets and given to fools and children.” (Baba Batra 12b)

The Grid Computing Movement I believe that as a movement grid computing ran its course. • No more an easy source of funding • No more an easy way to get the “troops” mobilized • No more an easy sell of software tools • No more an easy way to get your papers published or your press releases posted

Introduction “The term “the Grid” was coined in the mid 1990s to denote a proposed distributed computing infrastructure for advanced science and engineering [27]. Considerable progress has since been made on the construction of such an infrastructure (e.g., [10, 14, 36, 47]) but the term “Grid” has also been conflated, at least in popular perception, to embrace everything from advanced networking to artificial intelligence. One might wonder if the term has any real substance and meaning. Is there really a distinct “Grid problem” and hence a need for new “Grid technologies”? If so, what is the nature of these technologies and what is their domain of applicability? While numerous groups have interest in Grid concepts and share, to a significant extent, a common vision of Grid architecture, we do not see consensus on the answers to these questions.” “The Anatomy of the Grid - Enabling Scalable Virtual Organizations”Ian Foster, Carl Kesselman and Steven Tuecke 2001.

Distributed Computing Distributed computing is here to stay and to continue to evolve as processing, storage and communication resources get more powerful and cheaper • Big science is inherently distributed • Most scientific disciplines (and many commercial sectors) depend on High Throughput Computing (HTC) capabilities

Keynote 3: When All Computing Becomes Grid Computing Speaker: Prof. Daniel A. Reed Chancellor’s Eminent Professor Director, Renaissance Computing Institute University of North Carolina at Chapel Hill Abstract: Scientific computing is moving rapidly from a world of “reliable, secure parallel systems” to a world of distributed software, virtual organizations and high-performance, though unreliable parallel and distributed systems with few guarantees of availability and quality of service. In addition, a tsunami of new experimental and computational data poses equally vexing problems in analysis, transport, visualization and collaboration. This transformation poses daunting scaling and reliability challenges and necessitates new approaches to collaboration, software development, performance measurement, system reliability and coordination. This talk describes Renaissance approaches to solving some of today’s most challenging scientific and societal problems using Grids and parallel systems, supported by rich tools for performance analysis, reliability assessment and workflow management.

As we return to the fundamentals and stay away from hype and the technologies of the moment, we will advance the state of the art in distributed computing

Our HTCCommunity is Strongerthan Ever

Downloads per month

Fractions per month

Language Weaver Executive Summary Incorporated in 2002 USC/ISI startup that commercializes statistical-based machine translation software Continuously improved language pair offering in terms of language pairs coverage and translation quality More than 50 language pairs Center of excellence in Statistical Machine Translation and Natural Language Processing

IT Needs • The Language Weaver Machine Translation systems are trained automatically on large amounts of parallel data. • Training/learning processes implement workflows with hundreds of steps, which use thousands of CPU hours and which generate hundreds of gigabytes of data • Robust/fast workflows are essential for rapid experimentation cycles

Solution: Condor • Condor-specific workflows adequately manage thousands of atomic computational steps/day. • Advantages: • Robustness – good recovery from failures • Well-balanced utilization of existing IT infrastructure

The Road Ahead • Green Computing • Computing in the Clouds • “Launch and Leave” Computing • Turn-on of the LHC • Broader and larger community of contributors • More and bigger campus grids • Fetching work from “other” sources • Multi-Core nodes • Low latency and short jobs • Staging data through Storage Elements

Thank you for building such a wonderful community

Condor Project: Distributed Computing Research at UW