Improving the Automatic Evaluation of Problem Solutions in Programming Contests

Improving theAutomatic Evaluation of Problem Solutions in Programming Contests Pedro Ribeiro and Pedro Guerreiro

Presentation Overview • Automatic Evaluation: Past and Present • The case of IOI • A possible path for improving evaluation • Developing only a function (not a complete program) • Abstract Input/Output • Repeat the same function call (+ clock precision) • No hints on expected complexity • Examine runtime behaviour as tests increase in size • Some preliminary results • Conclusions

Programming Contests • All programming contests need an efficient and fair way of distinguishing submitted solutions (Automatic) Evaluation • What do we evaluate? • Correction: does the program produce correct answers for all instances of the problem? • Efficiency: does it do it fast enough? Does it have the necessary time and memory complexity?

Programming Contests • Classic way of evaluating • Set of pre-defined tests (inputs) • Run program with tests and check output • IOI has been doing this almost the same way since the beginning with two major advances: • Manual evaluation > Automatic evaluation • Individual Tests -> Grouped tests • Although IOI has 3 different types of tasks, the main core of the event are still batch tasks

IOI Types of Tasks

Programming Contests • Correction: almost “black art” • “Program testing can be a very effective way to show the presence of bugs, but it is hopelessly inadequate for showing their absence” (Dijkstra) • Efficiency: • Typically judges create set of model solutions of different complexities • Tests designed in that model solutions achieve planned number of points • Considerable amount of tuning (environment) • Considerable amount of man power needed • More difficult to introduce new languages

Ideas: Single function • Solve the problem by writing a specific function (as opposed to a complete program) • Motivation: • Concentrate on the core algorithm (less distractors) • Can be used on earlier stages of learning • Opportunities for new ways of testing(more control on submitted code) • It is already done on other types of contests: • TopCoder • Teaching Environments(Ribeiro and Guerreiro, 2008)

Ideas: I/O Abstraction • The Input and Output should be “abstract” and not specific to a language • How to do it: • Input already in memory, passed as function arguments (simple form, no complex data structure) • Output as the function return value(s) • Motivation: • Less information processing details • Less complicated problem statements • We can measure time spent in solution (not in I/O) • More balanced performance between languages

Idea: Repeat function calls • In the past we used smaller input sizes increased speed of computers • Currently we use huge input sizes • Clock resolution is poor: small instances > instant • Need to distinguish small asymptotic complexities • Historic fact: Smaller time limit used on IOI: • IOI 2007, problem training: 0.3 seconds • Future? • Always more speed > bigger input size

Idea: Repeat function calls • Problems completely detached from reality: • Ex: IOI 2007 Sails, ship with 100,000 masts

Idea: Repeat function calls • Real world: How can we measure the thickness of a sheet of paper if we have a standard ruler without enough accuracy? stack of 100 sheets measures 1cm, then each sheet is ~0.1mm • We can use the same idea on functions! • Run once with small instances may be instantaneous But • Running multiple times takes more than 0.00s!

Idea: Repeat function calls • Run the same functions several times and compute average time • Pros • Input size can be smaller and related to problem • We can concentrate on quality of test cases and rely less on randomization to produce big test cases that are impossible to verify manually • Cons • We must be careful with memory persistence between successive function calls

Idea: No hints on complexity • When we give limits for the input: • we simplify implementation details and avoid the need for dynamic memory allocation. but • We disclose the complexity required for the problem • Trained students can identify precisely the complexity needed • This has great impact on problem solving aspect: • Different mindset: I know which complexity I’m looking for and I settle for a solution that does that vs • Scientific approach with real world open problem • Ex: is there a polynomial solution for a problem?

Idea: No hints on complexity • Give limits for implementation purposes, but make it clear that those are not related to sought efficiency • More scientific and open ended approach • Need to think how to really solve the problem (and not how to produce a program that passes the test cases) • Not overemphasize runtime of particular language • (let me make a test with maximum limits and see if it runs in X seconds on this machine with this language)

Idea:Runtime behaviour as tests increase • Typically we measure efficiency by creating set of tests such that different model solutions achieve different number of points But • not passing does not imply that the required complexity was not achieved (other factors) • Just means that the test case is solved within the constraints • A lot of man power needed for model solutions and fine tuning (compiler version, computer speed, language used, etc)

Idea:Runtime behaviour as tests increase • How can we improve on that? • Pen and Paper not an option for large scale evaluation • Need for automatic processes • We have different tests, we have different time measures, why don’t we use all this information? • Plot the runtime as data increases and do some curve fitting • Impossible to determine complexity for all programs, but even a trivial (imperfect) curve can show more information than just knowing which test cases are passed

Some Preliminary Results • As a proof of concept a simple problem: • Input: Sequence of integers • Output: Subsequence of consecutive integers with maximum sum • Only ask for function with I/O already given • Small input limit (only 100) • Measure time by running multiple times (until aggregated time reached 1s) • Use random data for 1,4,8,12,…64

Some Preliminary Results • Implemented 3 model solutions: • O(N^3) – Iterate all possible intervals in O(N^2) plus iterate trough each interval to discover sum in O(N) • O(N^2) – Iterate all possible intervals in O(N^2) plus O(1) checking of each sum with accumulated sums • O(N) – Iterate trough sequence and keep partial sum, whenever the partial sum is negative, it cannot contribute to best and therefore “reset” to zero and continue A B C

Some Preliminary Results • Plot Time(N) / Time(1) • Simple correlation measure with another function

Some Preliminary Results • Out of scope to give more detailed mathematical analysis • We could use other statistical measures • We know that it is impossible to automatically compute and prove complexities but • This simple approach gives meaningful results • runtime is somehow consistent and correlated with a certain function and therefore appears to grow following a pattern that we were able to identify • Ex: Linear > appears to take twice the time when data doubles

Some Preliminary Results • What could this do? • More information from the same test cases • Possibility of giving students automatic feedback on runtime behavior • Possibility of identifying runtime behaviors for which no model solutions were created (less man power!) • Independent of language specific details Ex: Archery Problem, IOI 2009, Day 1 There were solutions with O(N^2R), O(N^3), O(N^2 log N), O(N^2), O(N log N), … No need to code them all in all languages and then tune!

Conclusion • 20 Years of IOI: computers are much faster, style of evaluation is still the same • Setting up test cases is time consuming and requires man power • Need to think of ways to improve evaluation • Our proposal, geared to more informal contests or teaching environments, can offer: • No distraction with I/O • No large data sets • More natural problem statements • No hint on complexity (open ended approach) • No need for implementing many model solutions • New languages can be added without changing tests • Still more work to obtain robust system but we feel this ideas (or some of them) can be used in practice • Future: can evaluation be improved in other ways?

The End • And that’s all!:-) Questions? Pedro Ribeiro (pribeiro@dcc.fc.up.pt) Pedro Guerreiro (pjguerreiro@ualg.pt)

Improving the Automatic Evaluation of Problem Solutions in Programming Contests