170 likes | 197 Views
This paper discusses a linear method for detecting deviations in large databases using a dissimilarity function. Experimental results and conclusions are presented.
E N D
Data mining A Linear Method for Deviation Detection in Large databases Presented by: Ali Triki Date: 09/30/1999
Content • What are Deviations • Approach • Exact exception problem • Sequential exception problem • Algorithm • Dissimilarity function • Experimental results • conclusion
What are Deviations? • Deviations are errors or noise in data • Several approaches for detecting deviations (or exceptions) in the areas of Databases and Machine Learning • Statistical approach (Hoaglin 1983) • Extending learning algorithms to cope with small amount of noise (Aha 1991) • Impact of erroneous examples on the learning results (Quinlan 1986)
Approach • Use the implicit redundancy in the data to detect deviations. • Clustering data into 2 clusters: deviation and non deviations. • Do not discard deviation as noise, but try to isolate small minorities.
Exact Exception Problem • Problem description • Set of Items I= {1,4,4,4} • Cardinality function: C(I) • Dissimilarity Function: the variance of the numbers in the set = 1/n (xi- x)2 • Smoothing factor: C(I-Ij) * (D(I)-D(I-Ij)) • By computing each candidate exception set Ij we get the following results:
Example • The candidate set = {1} is an exception because it has a large smoothing factor SF
Sequential Exception Problem • After seeing a series of similar data, an element disturbing the series is considered an exception • Given: • A set of items I • A sequence S of subsets:: Ij I and Ij-1Ij • Cardinality function • Smoothing factor: SF(Ij)=C(Ij-Ij-1) * (D(Ij)-D(Ij-1)) The Smoothing factor consider the difference with the preceding set instead of the complimentary set
Algorithm • 1- Get the first element i1 of the item set I making up the element subset I1I and compute Ds(I1) • 2- For each following element ij in S, create the subset Ij taking Ij= Ij-1U {ij} and compute the difference in dissimilarity values dj=Ds(Ij) – Ds(Ij-1) • 3- Consider that element ij with the maximal value of dj>0 to be the answer for this iteration. • If dj 0 for all Ij in S, there is no exception
Algorithm • If an exception ij is found: • For each element ik where k>j compute • dk0=Ds(Ij-1U {ik}) –Ds (Ij-1) • dk1=Ds(IjU {ik}) –Ds (Ij) • Add to Ix those ik for which dk0 –dk1 dj • For m iterations, we get m competing exception sets Ix, select the one with the largest value of difference in dissimilarity dj scaled with the dissimilarity function C
Dissimilarity function • Handles the comparison of the character strings, it maintains a pattern of a regular expression that matches all the character strings seen so far. • Starting with the pattern of the 1st string, we introduce wildcard characters as more strings need to be covered. • Ds(Ij)= Ds(Ij-1) + J*(Ms(Ij)-Ms(Ij-1))/Ms(Ij) • Auxiliary function Ms(Ij )= 1/ (3*c-w+2) • With c being the total number of characters • And w being the number of needed wildcards
Why did it fail? • The dissimilarity function used couldn’t catch the exception. • Once 2 values ‘..,n,..’ and ‘..,y,..’ are seen , the pattern takes the form ‘...,*,…’ from then on, there is no change in pattern when ‘?’ appears in the same column as the pattern covers it. • Need a more powerful dissimilarity function.
Conclusion • We presented a linear algorithm for sequential exception problem. • Experimental evaluation shows that the effectiveness of the algorithm depends on the dissimilarity function used. • It seems helpful to have some predefined D.F that works well for particular datasets.
References: • A. Arning, R. Agrawal, P. Raghavan: "A Linear Method for Deviation Detection in Large Databases", Proc. of the 2nd Int'l Conference on Knowledge Discovery in Databases and Data Mining, Portland, Oregon, August, 1996 • S. Sarawagi, R. Agrawal, N. Megiddo: "Discovery-driven exploration of OLAP data cubes", Proc. of the Sixth Int'l Conference on Extending Database Technology (EDBT), Valencia, Spain, March 1998 • R. Agrawal and R Srikant “Fast Algorithms for mining association rules” In Proceedings of the VLDB Conference 1994