190 likes | 308 Views
GENERALLY ACCESSIBLE. Evaluating Condor for Enterprise Use: A UBS Case Study. Gregg Cooke, IT Technical Council. April 26, 2006. Overview. Context: Why UBS Uses Grids Tests: What Did We Look At? Results: Strengths & Limitations. SECTION 1. The Context: Grids in an Investment Bank.
E N D
GENERALLY ACCESSIBLE Evaluating Condor for Enterprise Use: A UBS Case Study Gregg Cooke, IT Technical Council April 26, 2006
Overview • Context: Why UBS Uses Grids • Tests: What Did We Look At? • Results: Strengths & Limitations
SECTION 1 The Context: Grids in an Investment Bank
Grids at UBS What do we mean by “grid”? • Specifically, when we say “grid” we mean a computational cluster • Condor fits the definition closely • Other terminology:
Grids at UBS Why do we use grids? • Complex, long-running calculations include: • Monte Carlo simulations of risk exposure • Black-Scholes option valuations on portfolios of stock options • Valuation of complicated “exotic” financial instruments • Speed of computation directly correlates to volume of sales • Accuracy of risk exposure calculation directly correlates to reserve cash • Calculations constructed by quantitative analysts (“quants”) • Write code that’s easy to change, not code that’s particularly efficient or parallelized
Current Grid Environment at UBS How do we build & run our grids? • 10 separate production grids totaling 3000+ engines • All separate grids…some 60-engine, some 2000-engine • 1 million tasks per day • Wide variety of platforms, languages, architectures • C/C++, C#, Java on Windows or Linux • Service-oriented vs. batch-oriented, embarrassingly parallel vs. workflow • Rarely any greenfield development • Dedicated deployment & operations teams (“GSD”) • Straddle the development / operations worlds • Focused on meeting businesses SLAs • Strong drivers of what grid platform we use
Typical UBS Grid Environment • Quants • write the calculations • part of the business • GSD • makes app meet SLAs • faces off with business • Dev • builds & tests the application • uses quant code, partners with GSD
SECTION 2 The Tests: Function, not Performance
How to Test Condor? Feasibility Study: is Condor suitable for use within our enterprise? • No performance tests…instead: • Determine the functional limits of Condor • Determine how Condor integrates with existing enterprise systems • Port one or more projects to use Condor and measure: • Porting effort • Opportunities for new functionality (and cost of lost functionality) • Operational impact
The Tests We tested the following aspects of Condor: • Scheduling capabilities • Various combinations of Requirements, Rank, Start, Suspend, etc. rules • Administrative capabilities • Features of command line tools, common admin practices, • Interaction model • Integrating Condor with an app: APIs, SOAP interface, command line interface • Robustness and resilience • Failover options, long-term stability, task retry, realtime reconfiguration, etc. • Usability • Impact to the user when a Condor engine is installed on their desktop • And…scheduling latency…
Scheduling Latency Definition: the interval between the initial request and when the first engine starts working on your task • Applications may be designed with a given scheduling latency in mind • We can control how long our code takes…we cannot control the scheduling latency • Redevelopment is often a major undertaking • We were expecting a very short (100msec) deterministic scheduling latency • Condor’s is much longer (1min or more) and nondeterministic • Condor does have an alternative (COD) but it changes the expected behavior of the grid • Impact on testing: new set of questions! • “Does Condor’s scheduling latency present a problem for our applications?” • “Do we have applications that were not developed with assumptions about the scheduling latency?” • “Are there other aspects of Condor’s performance that offset the scheduling latency concerns?” • “Can we measure the performance of our applications on Condor without regard to scheduling latency?”
SECTION 3 The Result: Condor as a Functional Benchmark
What We Love About Condor Too many to list…here are the top four: • Incredibly powerful expression-based scheduling policy • No-impact desktop cycle scavenging • Easy reconfiguration • Anything that can be run from a command line can be a task But, Condor has limits too…
What Condor Needs to Better Support UBS We found issues in four key areas: • Administrative interface • Code deployment • Scheduling latency • Job submission APIs Important: remember that these conclusions are only relevant to UBS! This is only what we found, based on our context…your mileage may vary
Administration Interface Our conclusions: • What we expected: • A nice GUI admin console similar to others our operations personnel are familiar with • What we found: • A rich command-line administration interface, but no GUI • Our conclusion: • At UBS, Condor will not be used by operations teams that cannot accept a command-line admin interface • These are usually Windows teams…Unix teams don’t seem to have as much bias • What this means for the Condor community: • A GUI admin console will make Condor more acceptable to enterprise users • Web-based is best • Doesn’t have to be fancy…just needs to be point & click (and stable, of course) • Work being done at Indiana University on a Condor portal is a start
Code Deployment Our conclusions: • What we expected: • Automatic task code deployment done once and refreshed automatically when the grid system senses a change in a central repository • What we found: • Automatic task code deployment every time a job is submitted • Our conclusion: • At UBS, Condor causes problems with applications with huge (15Mb+) task codes and short tasks because the network transmission time impacts job completion time • What this means for the Condor community: • To make Condor more acceptable to enterprise users, task code should be cached at the engines and only refreshing when it changes • Fortunately, this is being worked on by the Condor Project! • We’ve watched commercial grid vendors implement this…is not an easy feature!
Scheduling Latency Our conclusions: • What we expected: • Negligibly small latency that’s deterministic enough for us to predict job completion times • What we found: • Latencies that depend on configuration settings and complexity of classads • Our conclusion: • At UBS, Condor cannot be used for tasks that require less than 3.5 min to complete or where the total job completion time must be easily predictable • However, • Even though our highest-value applications require short deterministic scheduling latencies, there are many more lower-value applications that aren’t sensitive to scheduling latency
Application Programmer’s Interface Our conclusions: • What we expected: • Nice, well-designed APIs for all our favorite languages • What we found: • A command line interface and a maturing SOAP interface • Our conclusion: • Once the SOAP interface matures, UBS programmers will be more amenable to using Condor • What this means for the Condor community: • Full-speed ahead on the SOAP interface! • Make sure all of the functionality available in the command-line interface is available in the SOAP interface
Condor at UBS We will continue to use Condor for: • Teaching new teams how to grid their applications • Condor is an excellent exploration and learning environment • Has already accelerated at least one team • A functional benchmark for all things grid • Condor is a crucible where new and innovative grid ideas get tried and refined • Many of these features will prove valuable for commercial vendors to embrace • Check-pointing & task migration • Expression-based scheduling policy • User-centric cycle scavenging • Non-critical batch-oriented applications with standalone or SOAP-enabled service code, with operations teams that don’t mind a command line administration interface • There are lots and lots of non-critical batch-oriented apps with standalone services • There are not a lot of operations teams that will tolerate a command line interface…