• 310 likes • 438 Views
Discovering Interesting Subsets Using Statistical Analysis. Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune , MH, India, 411013 { maitreya.natu , gk.palshikar }@ tcs.com. Concept of Interesting Subsets.
E N D
Discovering Interesting Subsets Using Statistical Analysis MaitreyaNatu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune, MH, India, 411013 {maitreya.natu, gk.palshikar}@tcs.com
Concept of Interesting Subsets • In many real-life applications, it is important to automatically discover subsets of records which are interesting with respect to a given measure • Database of customer support tickets: subsets of tickets that have very high or very low service time • Database of employee satisfaction survey: subsets of employees that have very high or very low satisfaction index • Database of sales orders: subsets of orders that took very large or very small time to fulfill the order
Insights from Interesting Subsets • Interesting subsets provide insights for improving business processes, e.g., • Identification of • bottlenecks • areas of improvement • ways to increase per-person productivity • What-if/Impact analysis • Finding the most effective way to improve the overall service time by x%
ISD Vs. Other Related Work • Anomaly detection • ISD focuses on finding interesting subsets, rather than individual interesting records • Top-k heavy hitter analysis • In top-k heavy hitter analysis, each record in top-k subset has an unusual (high or low) value for the measure. • We wish to identify common characteristics of the records in the interesting subsets (rather than individual interesting records)
Application on the domain of customer support tickets • Attributes: • Timestamp-begin, Timestamp-end, Priority, Location, Resource, Problem, Solution, Solution-provider, etc. • Service-time • Discovery of interesting subsets of tickets that have very high service times, as compared to the rest of the tickets
Two Central Questions • How to construct subsets of records? • How to measure interestingness of records?
How to Construct Subsets of Records? • SQL-like SELECT commands provide an intuitive way for the end-user to characterize and understand subset of records • We systematically explore subsets of records of a given table by increasingly refining the condition part of the SELECT command
How to Construct Subsets of Records? • Attributes of customer support database records: • PR (Priority), CT (Category), AC (Affected City) • Domain of each attribute • DPR = {L, M, H}, DCT = {A, B}, DAC = {X} • A descriptor {(PR, L), (AC, ‘New York’)} corresponds to the subset of records selected using • SELECT * from D WHERE PR = L and AC = ‘New York’ • The level of a descriptor is defined as the number of attributes in a descriptor
How to Measure Interestingness of Records? • A : subset of database D • A’ : D – A, records in D that are not in A • Ф(A) : set of measure values for records in A • Ф(A’) : set of measure values for records in A’ • We say A is an interesting subset if Ф(A) is statistically different from Ф(A’) • E.g.: In customer support database, a subset A would be interesting is service times of tickets in A are statistically very different from the service times of the rest of the tickets in A’
How to Measure Interestingness of Records? • More formally, A is an interesting subset of D if the probability distribution of the values in the subset Ф(A) are very different from that of the subset Ф(A’) • We use statistical hypothesis tests (Student’s t-test) to measure interestingness • Note that we focus on interestingness of subsets of tickets rather than individual tickets themselves
Student’s t-test • Student’s t-test makes a null hypothesis that the means of two sets do not differ significantly • Let X and Y be the two set of numbers of sizes n1 and n2 • The t-statistic for the unpaired sets X and Y assumes unequal variance and unequal sizes and tests whether means of the two sets are statistically different • Denominator is a measure of the variability of the data and is called standard error of difference • t-test then calculates a p-value which is the probability of obtaining a result as extreme as the one actually observed, given that the null hypothesis is true • If p-value is below a threshold, the null hypothesis is rejected
How to Measure Interestingness of Records? • For the performance metric values of each subset of tickets in A and its complement A’, we run the Student’s t-test and compute a t-value and a p-value • The t-value is positive if the mean of the first subset is larger than the second subset, and negative if smaller • The p-value provides the probability that the subset A is statistically different from its complement A’
How to Deal with the Large Search Space? • Search space of all attribute combinations can be very large in real data sets • We present three heuristics to prune the search space • Size heuristic • Goodness heuristic • p-prediction heuristic
How to Deal with the Large Search Space? Priority = L Priority = M Priority = H Priority = L Priority = M Priority = I Priority = L Priority = H Priority = I Priority = M Priority = H Priority = I Stage 3 Priority = L Priority = M Priority = L Priority = H Priority = L Priority = I Priority = M Priority = H Priority = M Priority = I Priority = H Priority = I Level 1 Stage 2 Priority = L Priority = M Priority = H Priority = I Stage 1
How to Deal with the Large Search Space? Priority = L Priority = M Priority = H Priority = L Priority = M Priority = I Priority = L Priority = H Priority = I Priority = M Priority = H Priority = I Stage 3 Priority = L Priority = M Priority = L Priority = H Priority = L Priority = I Priority = M Priority = H Priority = M Priority = I Priority = H Priority = I Level 1 Stage 2 Priority = L Priority = M Priority = H Priority = I Stage 1
How to Deal with the Large Search Space? • Size heuristic • Subsets on very small sizes can be noisy leading to incorrect inference of interesting subsets • We apply a threshold Ms and do not explore subsets with size less than Ms
How to Deal with the Large Search Space? • Goodness heuristic • In the case of customer support tickets we are interested in identification of tickets with large service times • The set of tickets with service time significantly smaller than the rest of the tickets can be pruned • Prune a set if the t-test result has a t-value < 0 and p-value < Mg
How to Deal with the Large Search Space? • p-prediction heuristic • A level k subset is built from two level k-1 subsets • We observed that if two level k-1 subsets are statistically very different mutually, then the corresponding level k subset built from the two sets is likely to be less different from its complement • The heuristic prevents combination of two subsets that are statistically very different, where the statistical difference is measured by t-test
Accuracy of the p-prediction Heuristic • For sets with p-value p1 and p2 and mutual p-value p12, the p-value prediction heuristic states that • If p12 < Mp then p3 > min(p1, p2) where p3 is the p-value of the combined set • Accuracy = % of the number of subset pairs with p12 less than Mp that hold the p-value prediction property
Interesting Subset Discovery Algorithm • Build a level k subset from two level k-1 subsets • Two level k-1 subsets can be combined that have exactly one different attribute-value pair • Check the p-prediction heuristic and skip the set if the mutual p-value of the two level k-1 sets is less than Mp. • Compute the interestingness of the subset by applying t-test • If the t-value and p-value of the set is above the threshold of statistical significance, store the set descriptor in the result set R • Apply the size and goodness heuristic on level k sets to decide if the sets should be used further for building sets of subsequent levels • Sort the result set R on increasing p-value
Interesting Subset Discovery Algorithm(using sampling) • ISD algorithm reduces search space using various heuristics, but for very large data sets (in order of millions of records) the search space can still be very large • We hence propose interesting subset discovery using sampling of the data set
Interesting Subset Discovery Algorithm(using sampling) • The algorithm is based on the following observations: • A small number of interesting subsets that give major insight into functioning and improvement of the system is preferred over a large number • Such sets give immediate actions items for major improvement • Out of all the interesting subsets, the subsets that have major impact on the overall system performance are of more importance • Such sets can provide insight into the areas of system improvement that can have maximum impact
Interesting Subset Discovery Algorithm(using sampling) • Proposed algorithm • Take samples of original data set and run ISD algorithm on the samples • Rank the results of all the runs based on the number of occurrences of a subset descriptor in results of different samples • The larger the number of occurrences, higher the rank • If the rank is less than a predefined threshold, then remove the subset from the result
Experimental Results • Experimental setup • Data set of service request records of the IT division of a major financial institution • 6000 records • 7 Attributes (PR, AC, ABU, AI, CS, CT, CD) with domain sizes (4, 23, 29, 48, 4, 9, 7) respectively • Each record contains a Service_Time attribute as a performance metric • We compare the results of ISD algorithm with Brute Force and Random algorithms • Brute Force: Algorithm based on combinatorial search of the set space • Random: Randomly select of k set descriptors for a level l and compute their interestingness. Perform multiple such runs.
Experimental Results • We successfully identified subsets of records with significantly high service time • We ran the algorithm from level 1 to 5 • Level 1 results contain large subsets defined by single attribute-value pair • We were able to identify tickets of a specific day of week, tickets from a specific city, to have a significantly high service times than the rest of the tickets • With higher levels we were able to do finer analysis of ticket properties • The discovered interesting subsets provided interesting insights for system improvement by finding improvement areas that can have highest impact on the improvement of the overall service time of the system
Comparison with ISD_R algorithmCoverage and Accuracy • Coverage = % of ISD_R results covered by ISD • 100% coverage • Accuracy = Average accuracy with which the ISD algorithm covers the ISD_R set descriptors • 80% to 90% accuracy
Comparison of ISD_HS with ISD_H % of ISD_HS results that match with ISD_H % of ISD_H results that are covered by ISD_HS
Conclusion • We presented algorithms for discovery of interesting subsets from a given database of records with respect to a given quantitative measure • We proposed various heuristics to prune the search space • We presented experimental evaluation of the proposed algorithm by applying it on the service request records of the IT division of a major financial institution • The discovered interesting subsets prove to be very insightful for the given data set and provide insights for improvement of the business processes
Future Work • Strengthening of heuristics to further reduce the search space • Use full power of SQL commands to systematically explore more complex subsets • Application of the algorithm on real-life datasets from different domains