380 likes | 487 Views
An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis. Swarup Kumar Sahoo , John Criswell, Vikram Adve Department of Computer Science University of Illinois at Urbana-Champaign. Motivation.
E N D
An Empirical Study of Reported Bugs in Server Software with Implications for Automated Bug Diagnosis Swarup Kumar Sahoo, John Criswell, Vikram Adve Department of Computer Science University of Illinois at Urbana-Champaign
Motivation • In-the-field software failures are becoming increasingly common • Software failures results in losses over billions of dollars every year [Charette et.al., IEEE Spectrum, 2005] • Increasing the reliability of systems is critical • Off-site analysis of production run failures is difficult • Difficulty in reproducing failures at development site • Same bug may generate different faults at multiple production sites • Customers have privacy concerns
Motivation – Production Site Diagnosis • Problem: Need to reproduce failures fast and checkpoint based replay limits their usefulness Question: Will a simple restart/replay mechanism work? • Problem: Minimal test case generation is too slow Question: Can the knowledge of fault types and #inputs help? To know answers to these questions we need to understand Characteristics of software bugs
Application Selection • Server applications are widely used and mission critical • Server applications challenging for diagnosis • Run for long periods of time (-) • Handle large amounts of data (-) • Concurrent (-) • Inputs are well-structured (+) We studied 266 randomly selected bug reports and 30 extra concurrency bug reports from 6 servers * (Apache, Squid, Tomcat, sshd, SVN, MySQL) * A detailed spreadsheet of bugs can be found at http://sva.cs.illinois.edu/ICSE2010/bug_statistics.xls
Goals and key results of the study • How many inputs are needed to trigger the symptoms? • 77% of the bugs need just one input (12/266 bugs need >3) • Time duration from first fault-triggering input to symptom? • 57% of multi-input failures, all inputs likely to occur within short time • Time between first fault-triggering input and symptom usually small • Which symptoms appear as a manifestation of bugs? • Majority (63%) bugs result in incorrect outputs • Two applications have fewer incorrect outputs • What fractions of failures are deterministic? • 82% bugs showed deterministic behavior • Very few concurrency bugs, nearly all are non-deterministic, need many more inputs, fewer incorrect outputs
Outline • Motivation and Findings • Methodology and Limitations • Definitions and Terminology • Classification of Software Bugs • Analysis of Multiple Input Bugs • Concurrency Bugs • Implications • Conclusions and Future Work
Bug Selection selected a recent major version of the software in production use for at least a year • Selected a set of bugs from bug database with a set of filters (Status field as RESOLVED, Resolution field as FIXED) Randomly selected a set of bugs from the list of bugs using a seeded rand() function 472 server bugs
Bug Selection • Manual Filtering • Removed bugs in development code versions • Removed trivial bugs like build errors, documentation errors etc. • After filtering, 266 bugs remained out of 472 bugs • We analyzed each bug (reports, test cases, patches) • Classified them into different categories based on • Bug symptom • Reproducibility • #inputs
Limitations • Servers only • Studied a subset of server applications • Only two Programming languages • 5 were in C/C++, 1 in Java • Reported bugs only • Unreported bugs are likely to be less frequent • Difficult to reproduce bugs are possibly less likely to get reported • Fixed bugs only • Bugs unfixed for a long time may have different properties • Human error
Outline • Motivation and Findings • Methodology and Limitations • Definitions and Terminology • Classification of Software Bugs • Analysis of Multiple Input Bugs • Concurrency Bugs • Implications • Conclusions and Future Work
Definitions and Terminology • An input is • Logical input from client to server at the application level • Login input, HTTP request, SQL query, command from SSH client • An input is not • Messages coming from sources other than client • File system, back-end databases, DNS queries • Inputs creating persistent environment • SVN checkout command, create/insert/delete commands in database Login Select Database db1 Set sql_mode = FULL_GROUP_BY Insert into foo values (1,2) Select count(*) from foo group by a POST /login.jsp HTTP/1.1 Host: www.mysite.com User-Agent: Mozilla/4.0 Content-Length: 27 Content-Type: application/x-www-form-urlencoded userid=joe&password=guessme…..
Definitions and Terminology • Symptoms • Incorrect program behavior which is externally visible • Incorrect Output • External program output is different from the correct output without any catastrophic symptom
Definitions and Terminology • Deterministic Bug • Triggers the same symptom each time application is run with the same set of inputs in the same order on a fixed platform • Timing Dependent Bug • Timing in addition to order determines symptom is triggered or not • A special case of non-deterministic bug • Ex: An input arriving before a download input completes crashes server • Non-deterministic Bug • Symptom may not be triggered each time same requests are input into the application in same order
Outline • Motivation and Findings • Methodology and Limitations • Definitions and Terminology • Classification of Software Bugs • Analysis of Multiple Input Bugs • Concurrency Bugs • Implications • Conclusions and Future Work
Bug Symptoms Most of the bugs (63%) result in incorrect outputs *Memory errors include Seg Fault, Memory Leak, NULL Pointer Exception etc
Bug Symptoms Squid, Tomcat have lower incorrect outputs Squid, Tomcat have lower incorrect outputs Many more assertion violations (23%-28%) Many more assertion violations (23%-28%)
Bug Symptoms - Implications • Implications • New techniques needed to detect incorrect outputs at run time • Adding assertions or automatically generated program invariants may help in detecting incorrect outputs
Bug Reproducibility • 82% show deterministic behavior (Similar to Chandra et.al., DSN’02) Few show timing dependence and non-deterministic behavior
Bug Reproducibility - Implications • Implications • Tools should be able to reproduce most bugs by replaying inputs • Need new techniques to reproduce small fraction of bugs classified as timing-dependent or non-deterministic • Time Stamping inputs or controlling thread scheduling
Number of Bug Triggering Inputs Excluding Session Setup Inputs • Nearly 77% of the bugs need single input to trigger • 11% needed more than one input • Apache/SVN need maximum 2 inputs, Squid/Tomcat 3 inputs • Only 12 bugs (excluding the unclear cases)need more than 3 inputs • Remaining 11% were unclear from the reports
Number of Bug Triggering Inputs - Implications • Implications • Most of the bugs can be reproduced with just a single input • Nearly, all of the bugs can be reproduced with a small num of inputs • Few input from the session which triggers the bug is enough • Failure symptom occurs shortly after last faulty input is received (See paper) • Except hang or time-out bugs
Detailed Analysis Classification of 22 non-deterministic bugs Classification of 30 multi-input bugs
Outline • Motivation and Findings • Methodology and Limitations • Definitions and Terminology • Classification of Software Bugs • Analysis of Multiple Input Bugs • Concurrency Bugs • Implications • Conclusions and Future Work
Analysis of Multiple Input Bugs • Goal: Time from first fault-triggering input to last input • Classified into three categories • Clustered: input requests must occur within some time bound • Ex: All inputs should occur within socket timeout period • Likely clustered: fault-triggering inputs are likely to occur within a short duration for most cases • Ex: Two successive login requests with wrong passwords • Arbitrary: there is nothing to indicate that inputs must be or are usually clustered within a short duration • Ex: Request a static file, Request the same file again
Analysis of Multiple Input Bugs • Out of 30 multi-input bugs • 8 were Clustered • 9 were likely clustered • 13 were Arbitrary
Analysis of Multiple Input Bugs • Implications • Majority multi-input bugs will trigger symptom shortly after the first faulty input • Replay tools need to buffer session inputs & a small suffix of the inputs • Locality of the faulty inputs within an input stream can simplify creation of a reduced test case
Outline • Motivation and Findings • Methodology and Limitations • Definitions and Terminology • Classification of Software Bugs • Analysis of Multiple Input Bugs • Concurrency Bugs • Implications • Conclusions and Future Work
Study of Concurrency Bugs • Found very few (3) concurrency bugs in our bug set • Perhaps because servers process each input relatively independently • Even for multi-threaded servers (Apache, MySQL, Tomcat) • Separately selected 30 extra concurrency bugs • From 3 server applications (Apache, MySQL, Tomcat) • Searched on keywords like ’race(s),’ ’atomic,’ ’concurrency,’ ’deadlock,’ ’lock(s),’ and ’mutex(s)’ • 23 were data race/atomicity violation bugs, 5 were deadlock bugs, 2 were not clear
Concurrency Bug Symptom Classification • A much higher fraction of bugs are hangs or crashes • Much fewer incorrect o/p (20% overall, but 45% in MySQL). • Five (17%) of the concurrency bugs produced different, symptoms in different executions
Concurrency Bug Reproducibility Most of the bugs (87% overall, and 100% in Apache, Tomcat) show non-deterministic behavior.
Concurrency Bug Input Characteristics • All bugs need multiple inputs (>1) to trigger a symptom (excluding session setup inputs) • Some of the cases need a large number of inputs • Many bugs needed executions with multiple threads and multiple client connections for some time • Most bugs can usually be triggered using 2/3 threads, client connections
Implications for Concurrency Bugs • Very few reported bugs are concurrency bugs • Implications for tools targeting concurrency bugs • Need new techniques to reliably reproduce symptoms • Need to buffer larger number of inputs • Need to use inputs from multiple different client connection • Validation of results for overall reported bugs • Study of concurrency bugs successfully identified non-deterministic behavior and need for multiple inputs • Similar methodology found a very low occurrence of these behavior for overall reported bugs
Outline • Motivation and Findings • Methodology and Limitations • Definitions and Terminology • Classification of Software Bugs • Analysis of Multiple input Bugs • Concurrency Bugs • Implications • Conclusions and Future Work
Implications for Automated Tools • Diagnosis tools like DDmin (implements delta debugging) [Zeller et.al., TOSE 02] • Test small suffixes of inputs before trying a more general algorithm • One can possibly try subsets of small sizes • From our results, trying subsets of 2 or 3 inputs should work for most • Diagnosis tools like Triage [Tucek et.al., SOSP 08] • Can reduce the input stream to a much smaller set • Symptoms can possibly be triggered by restarting the server and replaying a small num of inputs after session establishment inputs • Alleviates the need for checkpointing
Outline • Motivation and Findings • Methodology and Limitations • Definitions and Terminology • Classification of Software Bugs • Analysis of Multiple Input Bugs • Concurrency Bugs • Implications • Conclusions and Future Work
Conclusion and Future Work • We report the results of an empirical study of server bugs • Most of the bugs were deterministic • Most of the bugs (77%) needed a single input • Set of inputs for multi-input bugs are usually small and clustered • Many bugs produce incorrect outputs • Very few bugs are concurrency bugs • Most of the concurrency bugs need multiple inputs • To create light-weight detectors to detect incorrect outputs • To build production-site automated tools • To automatically diagnose root cause at production site • Reproduce failures • Reduce input stream to a minimal faulty set