Evaluating Condor for Enterprise Use: A UBS Case Study

GENERALLY ACCESSIBLE Evaluating Condor for Enterprise Use: A UBS Case Study Gregg Cooke, IT Technical Council April 26, 2006

Overview • Context: Why UBS Uses Grids • Tests: What Did We Look At? • Results: Strengths & Limitations

SECTION 1 The Context: Grids in an Investment Bank

Grids at UBS What do we mean by “grid”? • Specifically, when we say “grid” we mean a computational cluster • Condor fits the definition closely • Other terminology:

Grids at UBS Why do we use grids? • Complex, long-running calculations include: • Monte Carlo simulations of risk exposure • Black-Scholes option valuations on portfolios of stock options • Valuation of complicated “exotic” financial instruments • Speed of computation directly correlates to volume of sales • Accuracy of risk exposure calculation directly correlates to reserve cash • Calculations constructed by quantitative analysts (“quants”) • Write code that’s easy to change, not code that’s particularly efficient or parallelized

Current Grid Environment at UBS How do we build & run our grids? • 10 separate production grids totaling 3000+ engines • All separate grids…some 60-engine, some 2000-engine • 1 million tasks per day • Wide variety of platforms, languages, architectures • C/C++, C#, Java on Windows or Linux • Service-oriented vs. batch-oriented, embarrassingly parallel vs. workflow • Rarely any greenfield development • Dedicated deployment & operations teams (“GSD”) • Straddle the development / operations worlds • Focused on meeting businesses SLAs • Strong drivers of what grid platform we use

Typical UBS Grid Environment • Quants • write the calculations • part of the business • GSD • makes app meet SLAs • faces off with business • Dev • builds & tests the application • uses quant code, partners with GSD

SECTION 2 The Tests: Function, not Performance

How to Test Condor? Feasibility Study: is Condor suitable for use within our enterprise? • No performance tests…instead: • Determine the functional limits of Condor • Determine how Condor integrates with existing enterprise systems • Port one or more projects to use Condor and measure: • Porting effort • Opportunities for new functionality (and cost of lost functionality) • Operational impact

The Tests We tested the following aspects of Condor: • Scheduling capabilities • Various combinations of Requirements, Rank, Start, Suspend, etc. rules • Administrative capabilities • Features of command line tools, common admin practices, • Interaction model • Integrating Condor with an app: APIs, SOAP interface, command line interface • Robustness and resilience • Failover options, long-term stability, task retry, realtime reconfiguration, etc. • Usability • Impact to the user when a Condor engine is installed on their desktop • And…scheduling latency…

Scheduling Latency Definition: the interval between the initial request and when the first engine starts working on your task • Applications may be designed with a given scheduling latency in mind • We can control how long our code takes…we cannot control the scheduling latency • Redevelopment is often a major undertaking • We were expecting a very short (100msec) deterministic scheduling latency • Condor’s is much longer (1min or more) and nondeterministic • Condor does have an alternative (COD) but it changes the expected behavior of the grid • Impact on testing: new set of questions! • “Does Condor’s scheduling latency present a problem for our applications?” • “Do we have applications that were not developed with assumptions about the scheduling latency?” • “Are there other aspects of Condor’s performance that offset the scheduling latency concerns?” • “Can we measure the performance of our applications on Condor without regard to scheduling latency?”

SECTION 3 The Result: Condor as a Functional Benchmark

What We Love About Condor Too many to list…here are the top four: • Incredibly powerful expression-based scheduling policy • No-impact desktop cycle scavenging • Easy reconfiguration • Anything that can be run from a command line can be a task But, Condor has limits too…

What Condor Needs to Better Support UBS We found issues in four key areas: • Administrative interface • Code deployment • Scheduling latency • Job submission APIs Important: remember that these conclusions are only relevant to UBS! This is only what we found, based on our context…your mileage may vary

Administration Interface Our conclusions: • What we expected: • A nice GUI admin console similar to others our operations personnel are familiar with • What we found: • A rich command-line administration interface, but no GUI • Our conclusion: • At UBS, Condor will not be used by operations teams that cannot accept a command-line admin interface • These are usually Windows teams…Unix teams don’t seem to have as much bias • What this means for the Condor community: • A GUI admin console will make Condor more acceptable to enterprise users • Web-based is best • Doesn’t have to be fancy…just needs to be point & click (and stable, of course) • Work being done at Indiana University on a Condor portal is a start

Code Deployment Our conclusions: • What we expected: • Automatic task code deployment done once and refreshed automatically when the grid system senses a change in a central repository • What we found: • Automatic task code deployment every time a job is submitted • Our conclusion: • At UBS, Condor causes problems with applications with huge (15Mb+) task codes and short tasks because the network transmission time impacts job completion time • What this means for the Condor community: • To make Condor more acceptable to enterprise users, task code should be cached at the engines and only refreshing when it changes • Fortunately, this is being worked on by the Condor Project! • We’ve watched commercial grid vendors implement this…is not an easy feature!

Scheduling Latency Our conclusions: • What we expected: • Negligibly small latency that’s deterministic enough for us to predict job completion times • What we found: • Latencies that depend on configuration settings and complexity of classads • Our conclusion: • At UBS, Condor cannot be used for tasks that require less than 3.5 min to complete or where the total job completion time must be easily predictable • However, • Even though our highest-value applications require short deterministic scheduling latencies, there are many more lower-value applications that aren’t sensitive to scheduling latency

Application Programmer’s Interface Our conclusions: • What we expected: • Nice, well-designed APIs for all our favorite languages • What we found: • A command line interface and a maturing SOAP interface • Our conclusion: • Once the SOAP interface matures, UBS programmers will be more amenable to using Condor • What this means for the Condor community: • Full-speed ahead on the SOAP interface! • Make sure all of the functionality available in the command-line interface is available in the SOAP interface

Condor at UBS We will continue to use Condor for: • Teaching new teams how to grid their applications • Condor is an excellent exploration and learning environment • Has already accelerated at least one team • A functional benchmark for all things grid • Condor is a crucible where new and innovative grid ideas get tried and refined • Many of these features will prove valuable for commercial vendors to embrace • Check-pointing & task migration • Expression-based scheduling policy • User-centric cycle scavenging • Non-critical batch-oriented applications with standalone or SOAP-enabled service code, with operations teams that don’t mind a command line administration interface • There are lots and lots of non-critical batch-oriented apps with standalone services • There are not a lot of operations teams that will tolerate a command line interface…

Evaluating Condor for Enterprise Use: A UBS Case Study

Evaluating Condor for Enterprise Use: A UBS Case Study

Presentation Transcript

The Case Study

Case Study Presentation: Too Far Ahead of the IT Curve?

The Effects of Coyote Removal in Texas: A Case Study in Conservation Biology

Malaria Case Study

CASE STUDY

Design Patterns Case Study: Designing A Document Editor

Method and Validation basics —HPLC case study Hua YIN (Assessor)

To Case Study or Not to Case Study, That is the Question.

Using Condor An Introduction ICE 2011

Eating Disorder Unfolding Case Study

Condor Administration

Condor User Tutorial Fermilab 27-Jan-2005

Condor-G and DAGMan An Introduction

Condor Users Tutorial National e-Science Centre Edinburgh, Scotland October 2003

AXEON N.V CASE STUDY

Case Study: An Online Bookstore

Chapter 2

STUDY DESIGNS: case control, cohort and qualitative

An Introduction To Condor International Summer School on Grid Computing 2006

Warlord Case Study review and giant bonus with 100 items