Automatic Software Fault Diagnosis for Complex Enterprise Environments

Presented by: Suman Chander B Dept of Computer & Information Sciences University of Delaware Automatic Software Fault Diagnosis by Exploiting Application Signatures

Motivation • Application problem diagnosis in complex Enterprise environments • large number of possible causes, most of the failures due to runtime interactions with the system environment • Troubleshooting these problems requires extensive experience and time

Overview • Present a black box approach to diagnose several application faults • Application signatures • Approach to detect application faults • Provide detail on tool design and implementation • Evaluate effectiveness of the tool to correct fault behaviour of an application • Case studies to support the ideology

Application Behaviour • Factors aiding in capturing application behaviour: • System Calls • Signals • Environment variables • Resource Limits • Access Control • Collecting and keeping history information help in finding the root cause of problem in quick time.

Building Signatures... • Choice of attributes – using “test of goodness test” using KS-test

Signatures for system calls...

handling multiple processes... • Data is collected for each process separately • Relations between systems calls will be correctly reflected after separating interleaving system calls • Some specific attributes (eg. Signals, UID) are specific to a process • For multithreaded applications – data collection and signatures are built separately for each thread • Current approach does not handle user-level threads

Tool Design System Architecture

One...Application Tracer... • Tracer tool force executes target application e.g. ‘tracer application_program’ • Low overheads is crucial • Uses p-trace interface for building signatures for system calls • Some runtime behaviours (environment variables, resource limit, user id, etc) are not relevant to system calls

Two...Signature Bank...

Three...Fault Diagnosis... • Classifier tool provides root cause for deviation from normal behaviour: • Access the signature bank for normal traces • Compare with faulty trace obtained • Determine the root cause for this fault • Provide information to user with diagnosis

Case Studies

Testing with Apache... • For testing the tool with Apache, WebStone 2.5 is used • WebStone 2.5 is free benchmarking tool for web servers • Signature bank was built from performing operations ten times each to generate corresponding traces • Example: • Faulty execution of write system call • Unable to write into log file • Root Cause: Error Number EFBIG indicating that file is too large

Testing with Apache...

Observation 1 • Comparison showing change in size of trace over 45 minute period • 6.3 MB space contains recording of nearly 11 million system call invocations

Observation 2 • Comparison of change in size of trace file and signature bank based on the number of traces run • Signature bank grows slow as redundant data are merged

From other tests... • CVS • Average slowdown – 29.6% • Collected 26 traces ranging from 0.1 MB to 1.6 MB • Recorded signature bank is 6.5 MB consisting of about 1.8 million system calls • PostgreSQL • Average slowdown – 15.7% • Collected traces ranging from 0.6 MB to 2.1 MB • Recorded signature bank is 3.2 MB

Limiting false positives • First cause is related to KS-test • Second cause relates to the fact that Signature bank cannot cover all normal variations of the attributes • Aggregating more traces would complete the bank and reduce false positives gradually

Performance measure • Majority is due to information collection and trace file updating when a system call happens • Overheads that occur: • Switching from kernel to tracer and back both at system call entry and exit • Retrieving system call number, return value and related attributes • Looking up user stack to get its content • Improvement obtained by modifying ptrace code with addition of primitives • PTRACE_SETBATCHSIZE and PTRACE_READBUFFER

Improvement...

Limitations • Labelling of application execution trace as faulty • Manual indication required • Conservative approach in capturing amount of information needed for trace • More analysis required to identify minimum required set of data that will provide higher accuracy in detecting problems • Results are limited from exploring few case studies

THANK YOU

Automatic Software Fault Diagnosis for Complex Enterprise Environments