330 likes | 492 Views
Benchmarking for Object-Oriented Unit Test Generation. Tao Xie North Carolina State University. Why benchmarks in testing research? . Before 2000, a testing tool/technique paper without serious evaluation could appear in ICSE/FSE/ISSTA/ASE but NOT now! – a healthy trend
E N D
Benchmarking for Object-Oriented Unit Test Generation • Tao Xie • North Carolina State University
Why benchmarks in testing research? Before 2000, a testing tool/technique paper without serious evaluation could appear in ICSE/FSE/ISSTA/ASE but NOT now! – a healthy trend Need benchmarks to justify the benefits of the proposed technique (often comparing with existing techniques)
Outline Survey of incomplete OO testing tool literature after 2000 for benchmarks Discussion on history Discussion on future
TestEra [Marinov and Khurshid ASE 2001] Data structures singly linked lists.mergeSort java.util.TreeMap.remove INS 3 methods Alloy-alpha 1 method Require Alloy models for class invariants Success criterion: # n-bounded exhaustive tests
Korat[Boyapati et al. ISSTA 2002] Data structures korat.examples.BinaryTree.remove korat.examples.HeapArray.extractMax java.util.LinkedList.reverse java.util.TreeMap.put java.util.HashSet.add ins.namespace.AVTree.lookup Require repOk for class invariants, finitization Success criterion: # n-bounded exhaustive tests
JCrasher[Csallner and Smaragdakis S&P 2004] Canvas 6 methods P1 16 methods P1 15 methods P1 16 methods P1 18 methods P1 15 methods BSTree 24 methods UB-Stack 11 methods Success criterion: #real bugs among uncaught exceptions
Check ’n’ Crash [Csallner and Smaragdakis ICSE 2005] Canvas 6 methods P1 16 methods P1 15 methods P1 16 methods P1 18 methods P1 15 methods BSTree 24 methods UB-Stack 11 methods jaba 17.9 k LOC jboss.jms 5.1 k LOC Success criterion: #real bugs among uncaught exceptions
DSD-Crasher[Csallner and Smaragdakis ISSTA 2006] jboss.jms 5K LOC groovy 34 classes 2K LOC Success criterion: #real bugs among uncaught exceptions
eToc [TonellaISSTA 2004] Data structures StringTokenizer 6 methods BitSet 26 methods HashMap 13 methods LinkedList 23 methods Stack 5 methods TreeSet 15 methods Require a configuration file (so far manually prepared) Success criterion: branch coverage and % found seeded faults (5 seeded faults each class)
JPF [Visser et al. ISSTA 2004] Data structure java.util.TreeMap 3 methods deleteEntry fixAfterDeletion fixAfterInsertion Require a driver to close the environment Success criterion: branch coverage
JPF [Visser et al. ASE 2005] Data structure java.util.TreeMap Require a driver to close the environment Success criterion: basic block coverage
JPF [Visser et al. ISSTA 2006] Data structures BinTree 154 LOC BinomialHeap 355 LOC FibHeap 286 LOC Partial java.util.TreeMap 580 LOC Require a driver to close the environment Success criterion: basic block coverage &predicate coverage
Rostra [Xie et al. ASE 2004] Data structures IntStack 4 methods UBStack 10 methods BSet 9 methods BBag 8 methods ShoppingCart 7 methods BankAccount 6 methods BinarySearchTree 10 methods LinkedList 10 methods Require existing tests to provide method arguments Success criterion: branch coverage
Symstra [Xie et al. TACAS 2005] Data structures IntStackpush,pop UBStackpush,pop BinSearchTreeinsert,remove BinomialHeap insert,extractMin, delete LinkedListadd,remove, removeLast TreeMapput,remove HeapArrayinsert,extractMax Require a driver to close the environment Success criterion: branch coverage
Symclat [d'Amorim et al. ASE 2006] Data structures UBStack8 11m, UBStack12 11m, UtilMDE 69m, BinarySearchTree 9m, StackAr 8m, StackLi 9m, IntegerSetAsHashSet 4m, Meter 3m, DLList 12m, OneWayList 10m, SLList 11m, OneWayList 12m, OneWayNode 10m, SLList 12m, TwoWayList 9m, RatPoly (46 versions) 17 Require initial tests Success criterion: #real bugs
Evacon [Inkumsah and Xie ASE 2008] Data structures BankAccount 6m, BinarySearchTree 16m, BinomialHeap 10m, BitSet 25m, DisjSet 6m, FibonacciHeapm, HashMap 10m, LinkedList 29m, ShoppingCart 6m, Stack 5m, StringTokenizer 5m, TreeMap 47m, TreeSet 13m Require a configuration file (so far manually prepared) Success criterion: branch coverage
Nighthawk [Andrews et al. ASE 2007] Data structures java.util.BitSet 16 methods java.util.HashMap 8 methods java.util.TreeMap 9 methods BinTree, BHeap, FibHeap, TreeMap Java 1.5.0 Collection and Map classes: ArrayList, EnumMap, HashMap, HashSet, Hashtable, LinkedList, Pqueue, Propeties, Stack, TreeMap, TreeSet, Vector Success criterion: block, line, condition coverage
Random Test Run Length and Effectiveness[Andrews et al. ASE 2008] Other system test subjects Real buffer overflow bugs Data structures JUnitMoneyBag TreeMap Success criterion: a real bug in MoneyBag, a seeded bug in TreeMap
Delta Execution [d’Amorim et al. ISSTA 2007] 9 data structures with manually written drivers binheap, bst, deque, fibheap, heaparray, queue, stack, treemap, ubstack 4 classes in filesystem (the Daisy code) Require a driver to close the environment Success criterion: # n-bounded exhaustive tests
Incremental State-Space Exploration[Lauterburg et al. ICSE 2008] 9 data structures with manually written drivers binheap, bst, deque, fibheap, heaparray, queue, stack, treemap, ubstack 4 classes in filesystem (the Daisy code) Aodv: routing protocol for wireless ad hoc networks Require a driver to close the environment Success criterion: % time reduction over versions
ARTOO [Ciupa et al. ISSTA 2007] Eiffel Data structures STRING 175 methods PRIMES 75 methods BOUNDED STACK 66 methods HASH TABLE 135 methods FRACTION1 44 methods FRACTION2 45 methods UTILS 32 methods BANK ACCOUNT 35 methods Success criterion: #real bugs
ARTOO [Ciupa et al. ICSE 2008] Eiffel Data structures ACTION SEQUENCE 156 methods ARRAY 86 methods ARRAYED LIST 39 methods BOUNDED STACK 62 methods FIXED TREE 125 methods HASH TABLE 122 methods LINKED LIST 106 methods STRING 171 methods Success criterion: #real bugs
Randoop[Pacheco et al. ICSE 2007] Data structures BinTree , Bheap, FibHeap, TreeMap, BinTree, BHeap, FibHeap, TreeMap Java JDK 1.5 java.util 39KLOC 204 classes 1019 methods javax.xml 14KLOC 68 classes 437 methods Jakarta Commons chain 8KLOC 59 classes 226 methods collections 61KLOC 402 classes 2412 methods See next slide Success criterion: #real bugs
Randoop[Pacheco et al. ICSE 2007] – cont. Jakarta Commons … jelly 14K 99c 724m, logging 4K 9c 1 40m, math 21K 111c 910m, primitives 6K 294c 1908m .NET libraries ZedGraph 33K 125c 3096m .NET Framework: Mscorlib 185K 1439c 17763m, System.Data 196K 648c 11529m, System.Security 9K 128c 1175m, System.Xml 150K 686c 9914m, Web.Services 42K 304c 2527m
Randoop[Pacheco et al. ISSTA 2008] A core component of the .NET Framework > 100KLOC Success criterion: #real bugs
Pex[Tillmann and HalleuxTAP 2008] A core component of the .NET Framework > 10,000 public methods Selected results being presented 9 classes (>100 blocks ~ >500 blocks) Success criterion: block, arc coverage, #real bugs
MSeqGen [Thummalapenta et al. ESEC/FSE2009] QuickGraph 165 classes and interfaces with 5 KLOC Facebook 285 classes and interfaces with 40 KLOC Success criterion: branch coverage
Dynamic Symbolic Execution Tools How about DART, CUTE/jCUTE, Crest, EXE, EGT, KLEE, SAGE, Smart, Splat, Pex... Non-OO vs. OO
Summary of Benchmarks (Mostly) data-structure (DS) classes only TestEra [ASE01], Korat [ISSTA02], JCrasher [S&E04], eToc [ISSTA04], JPF [ISSTA04, ASE05, ISSTA06], Rostra [ASE04], Symstra [TACAS05], Symclat[ASE06], Evacon [ASE08], Nighthawk [ASE07, ASE08], UI JPF extensions [ISSTA07, ICSE08], ARTOO [ISSTA07, ICSE08] Non-DS classes Check ’n’ Crash [ICSE05], DSD-Crasher[ISSTA06] Randoop [ICSE07, ISSTA08] Pex [TAP08] MSeqGen [ESEC/FSE 09]
Open Questions – history Why/how do authors select their benchmarks used in their evaluation? Or Why not select other benchmarks? (Your answers here!) Reason 1: Reason 2: …
Open Questions – history (cont.) Are data structures mandatory? Like Siemens programs in fault localization as sanity check? Are data structures sufficient? How much would the results generalize to broader types of real-world applications? How about libraries (in contrast to applications)? High payoff in terms of testing efforts? More logics/challenging?
Open Questions – future Shall we have categorized benchmarks? What categories (general, DS, GUI, DB, Web, network, embedded, string-intensive, pointer-intensive, state-dependent-intensive, ML, …,)? What criteria shall we use to include/exclude benchmarks? Where to contribute and share? (UNL SIR, a new repository?) How to provide cross-language benchmarks? How about test oracles (if we care more than coverage)? How about evaluation criteria (structural coverage, seeded faults, uncaught exceptions, …)?
Open Questions – cont. Caveats in benchmarking Tools can be tuned to work best on well-accepted benchmarks but don’t work/generalize on other applications …