310 likes | 323 Views
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation. Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University, U.S.A. Introduction. Related Work
E N D
RDF: A Density-based Outlier Detection Method Using Vertical Data Representation Dongmei Ren, Baoying Wang, William Perrizo North Dakota State University, U.S.A
Introduction • Related Work • Breunig et al. [6] proposed a density-based approach to mining outliers over datasets with different densities. • Papadimitriou & Kiragawa [7] introduce local correlation integral (LOCI). • Contributions in this paper • a relative density factor (RDF) • RDF expresses density information similar to LOF (local outlier factor)[6] and MDEF (multi-granularity deviation factor)[7] • RDF is easier to compute using vertical data • RDF-based outlier detection method • efficiently prunes the data points which are deep in clusters • detects outliers within the remaining small subset of the data; • vertical data representation is used (Predicate-trees = P-trees)
Direct DiskNbr x Indirect DiskNbr Definitions • Definition 1: Disk Neighborhood --- DiskNbr(x,r) • Given a point x and radius r, the disk neighborhood of x is defined as a set DiskNbr(x, r)={x’ X | d(x,x’) r}, where d(x,x’) is the distance of x and x’ • Indirect disk neighborhood of x (disk neighborhood centered at any point in DiskNbr(x, r) Definition 2: Density of DiskNbr(x, r) --- Density(x,r) , where dim is the number of dimensions
Direct neighbor Direct DiskNbr x x r 2r Indirect neighbors Indirect DiskNbr Definitions (Continued) Definition 3: Relative Density Factor for a point, x, and radius, r. RDF(x,r) close to 1 means x is a deep-cluster point. RDF(x,r) close to 0 means x is a borderline cluster point. RDF(x,r) much larger than 1 means x is an outlier.
The Outlier Detection Method Given a dataset X, a radius, r and a threshold, ε. let R be RemainingSet of points yet to consider (initially, R=X) Let O be the set of Outliers identified so far (initially empty). • Pick any xR. • Decide if DiskNbr(x,r) are outliers or not, by: If 1+ε < RDF(x,r), O:=O DiskNbr(x,r) (outliers) and R = R - DiskNbr(x,r) If RDF(x,r)<1/(1+ε), R= R - DiskNbr(x,r) (borderline cluster points) If 1/(1+ε) < RDF(x,r) < 1+ε, (deep cluster points) but before updating R, double r while 1/(1+ε)<RDF(x,2nr)<1+ε true, then increment r while 1/(1+ε)<RDF(x,2nr+m)<1+ε true, then R = R - RDF(x,2nr+m).
Start pt x Finding Outliers p r r 2r 4r 6r Prune out non-outlier Outliers!!! The Outlier Detection Method
Finding Outliers x • 1/(1+ε) ≤ RDF ≤ (1+ε) (b) RDF < 1/ (1+ε) (c) RDF > (1+ε) • deep within clusters hill-tops and bddrys valley (outliers)
Finding Outliers using Predicate-Trees • P-Tree based direct neighbors --- PDNx,r • PDNx,r = Px’>x-rOR Px’x+r • |DiskNbr(x,r)|= rc(PDNx,r) • P-Tree based indirect neighbors --- PINx,r • PINx,r = (ORq Nbr(x,r) PDNq,r) AND PDN’x,r • Pruning is done by P-Trees ANDing based on the above three distributions (a),(c): PU = PU AND PDN’x,rAND PIN’x,r (b): PU = PU AND PDN’x,r; where PU is a P-tree representing unprocessed data points
Start point x r 2r 4r r Prune out non-outlier Pruning Non-outliers The pruning is a neighborhood expanding process. It calculates RDF between {DiskNbr(x,2kr)-DiskNbr(x,kr)} and DiskNbr(x,kr) and prunes based on the value of RDF, where k is an integer. • 1/(1+ε) ≤ RDF ≤(1+ε)(density stay constant): continue expanding neighborhood by doubling the radius. • RDF < 1/(1+ε) (significant decrease of density): stop expanding, prune DiskNbr(x,kr), and call “Finding Outliers” Process; • RDF > (1+ε) (significant increase of density): stop expanding and call “Pruning Non-outliers”.
Pruning Non-outliers Using P-Trees • We define ξ- neighbors: it represents the neighbors with ξ bits of dissimilarity with x, e.g. ξ = 1, 2 ... 8 if x is an 8-bit value • For point x, let X= (x1,x2,…,xn) or X = (x1,m, …x1,0), (x2,m, …x2,0), …(xn,m, …xn,0), where xi,j is the jth bit value in the ith attribute. For the ith attribute, ξ- neighbors of x is calculated by ,where • The pruning is accomplished by: PU = PU AND PXξ’, where PXξ’ is the complement set of PXξ
RDF-based Outlier Detection Process • Algorithm: RDF-based Outlier Detection using P-Trees • Input: Dataset X, radius r, distribution parameter ε. • Output: Anoutlier set Ols. • // PU — unprocessed points represented by P-Trees; • // |PU| — number of points in PU • // PO --- outliers; • //Build up P-Trees for Dataset X • PU createP-Trees(X); • i 1; • WHILE |PU| > 0 DO • x PU.first; //pick an arbitrary point x • PO FindOutliers (x, r, ε); • i i+1 • ENDWHILE
Experimental Study • NHL data set (1996) • Compare with LOF, aLOCI • LOF: Local Outlier Factor Method • aLOCI: approximate Local Correlation Integral Method • Run Time Comparison • Scalability Comparison • Start from 16,384, outperform in terms of scalability and speed
Reference Reference • V.BARNETT, T.LEWIS, “Outliers in Statistic Data”, John Wiley’s Publisher • Knorr, Edwin M. and Raymond T. Ng. A Unified Notion of Outliers: Properties and Computation. 3rd International Conference on Knowledge Discovery and Data Mining Proceedings, 1997, pp. 219-222. • Knorr, Edwin M. and Raymond T. Ng. Algorithms for Mining Distance-Based Outliers in Large Datasets. Very Large Data Bases Conference Proceedings, 1998, pp. 24-27. • Knorr, Edwin M. and Raymond T. Ng. Finding Intentional Knowledge of Distance-Based Outliers. Very Large Data Bases Conference Proceedings, 1999, pp. 211-222. • Sridhar Ramaswamy, Rajeev Rastogi, Kyuseok Shim, “Efficient algorithms for mining outliers from large datasets”, International Conference on Management of Data and Symposium on Principles of Database Systems, Proceedings of the 2000 ACM SIGMOD international conference on Management of data Year of Publication:2000, ISSN:0163-5808 • Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, Jörg Sander, “LOF: Identifying Density-based Local Outliers”, Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data, Dalles, TX, 2000 • Spiros Papadimitriou,Hiroyuki Kitagawa, Phillip B. Gibbons, Christos Faloutsos, LOCI: Fast Outlier Detection Using the Local Correlation Integral, 19th International Conference on Data Engineering, March 05 - 08, 2003, Bangalore, India • A.K.Jain, M.N.Murty, and P.J.Flynn. Data clustering: A review. ACM Comp. Surveys, 31(3):264-323, 1999 • Arning, Andreas, Rakesh Agrawal, and Prabhakar Raghavan. A Linear Method for Deviation Detection in Large Databases. 2nd International Conference on Knowledge Discovery and Data Mining Proceedings, 1996, pp. 164-169. • S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-Driven Exploration of OLAP Data Cubes. EDBT'98. • Q. Ding, M. Khan, A. Roy, and W. Perrizo, The P-tree algebra. Proceedings of the ACM SAC, Symposium on Applied Computing, 2002. • W. Perrizo, “Peano Count Tree Technology,” Technical Report NDSU-CSOR-TR-01-1, 2001. • M. Khan, Q. Ding and W. Perrizo, “k-Nearest Neighbor Classification on Spatial Data Streams Using P-Trees” , Proc. Of PAKDD 2002, Spriger-Verlag LNAI 2776, 2002 • Wang, B., Pan, F., Cui, Y., and Perrizo, W., Efficient Quantitative Frequent Pattern Mining Using Predicate Trees, CAINE 2003 • Pan, F., Wang, B., Zhang, Y., Ren, D., Hu, X. and Perrizo, W., Efficient Density Clustering for Spatial Data, PKDD 2003
Determination of Parameters • Determination of r • Breunig et al. shows choosing miniPt = 10-30 works well in general [6] (miniPt-Neighborhood) • Choosing miniPts=20, get the average radius of 20-neighborhood, raverage. • In our algorithm, r = raverage=0.5 • Determination of ε • Selection of ε is a tradeoff between accuracy and speed. The larger ε is, the faster the algorithm works; the smaller ε is, the more accurate the results are. • We chose ε=0.8 experimentally, and get the same result (same outliers) as Breunig’s, but much faster. • The results shown in the experimental part is based on ε=0.8.
Vertical Data Structures History • In the 1980’s vertical data structures were proposed for record-based workloads • Decomposition Storage Model (DSM, Copeland et al) • Attribute Transposed File (ATF) • Bit Transposed File (BTF, Wang et al); Viper • Band Sequential Format (BSQ) for Remotely Sensed Imagery • DSM and BTF initiatives have disappeared. Why? (next slide) • Vertical auxiliary and system structures • Domain & Request Vectors (DVA/ROLL/ROCC Perrizo, Shi, et al) • vertical system structures (query optimization & synchronization) • Bit Mapped Indexes (BMIs - very popular in Data Warehouses) • all indexes are vertical auxiliary structures really • BMI’s use bit maps (positional approach to IDing records) • other indexes use RID lists (keyword or value approach)
Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans) R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 Horizontally structured records Scanned vertically = pure1? true=1 pure1? false=0 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 pure1? false=0 pure1? false=0 pure1? false=0 0 0 0 1 0 01 0 1 0 1 0 0 1 0 0 1 01 1. Whole is pure1? false 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 2. Left half pure1? false 0 P11 0 0 0 0 1 01 3. Right half pure1? false 0 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 0 0 0 1 10 0 0 0 0 1 10 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 1 0 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 4. Left half of rt half? false0 ^ 0 0 1 0 1 01 5. Rt half of right half? true1 0 1 0 6. Lf half of lf of rt? true1 7. Rt half of lf of rt? false0 then vertically project each bit position of each attribute, then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows: Base 2 Base 10 R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 R11 0 0 0 0 1 0 1 1 Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves (1/21 subsets), until purity is achieved. Horizontally AND basic Ptrees P11 And it’s pure so branch ends But it is pure (pure0) so this branch ends
R[A1] R[A2] R[A3] R[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 = R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 1 0 01 0 1 0 0 1 01 This 0 makes entire left branch 0 7 0 1 4 These 0s make this node 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 These 1s and these 0s make this 1 0 0 0 0 1 01 0 0 0 0 1 0 0 10 01 0 0 0 1 0 0 0 0 0 0 0 1 01 10 0 0 0 0 1 10 0 1 0 ^ ^ ^ ^ ^ ^ ^ ^ ^ 0 0 1 0 1 0 0 1 0 1 01 0 1 0 R(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 2 7 5 1 2 7 5 7 5 2 1 4 2 2 1 5 7 0 1 4 7 0 1 4 To count occurrences of 7,0,1,4 use pure111000001100: 0 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 01 ^ 21-level has the only 1-bit so the 1-count = 1*21 = 2
R11 0 0 0 0 1 0 1 1 Top-down construction of basic P-trees is best for understanding, but bottom-up is much more efficient. 0 0 0 0 0 1 0 0 0 0 0 1 1 1 Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal and the collapsing of pure siblings, as follow: P11 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0
1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 2-Dimensional P-trees:natural choice for, e.g., image files.For images, any ordering of pixels will work (raster, diagonalized, Peano, Hilbert, Jordan), but the space-filling “Peano” ordering has advantages for fast processing, yet compresses well in the presence of spatial continuity. For an image bit-file (e.g., hi-order bit of the red band of an image file): 1111110011111000111111001111111011110000111100001111000001110000 Which, in spatial raster order is: Top-down construction of its 2-dimensional P-tree is built by recording the truth of the universal predicate “pure 1” in a fanout=4 tree recursively on quarters, until purity is achieved. Pure-1? False=0 Pure! Pure! 0 1 0 0 0 pure! pure! pure! pure! pure! 0 0 1 0 1 1 0 1 1 1 1 0 0 0 1 0 1 1 0 1
Start here 1 0 0 0 1 1 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 Bottom-up construction of the 2-Dimensional P-tree is done using in-order traversal of a fanout=4, log4(64)=4-level tree and the collapsing pure siblings, as follow: From here on we will take 4 bit positions at a time, for efficiency.
1=001 0 0 level-3 (pure=43) 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 2 3 1 1 0 0 0 0 1 level-2 (pure=42) 1 2 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1 1 level-1 (pure=41) 3 7=111 1 1 1 1 1 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 0 0 1 1 level-0 (pure=40) 2 . 2 . 3 ( 7, 1 ) 10.10.11 ( 111, 001 ) Some aspects of 2-D P-trees: ROOT-COUNT = level-sum * level-purity-factor. Root Count = 7 * 40 + 4 * 41 + 2 * 42 = 55 Node ID (NID) = 2.2.3 Tree levels (going down): 3, 2, 1, 0, with purity-factors of 43 42 41 40 respectively Fan-out = 2dimension = 22 = 4
Logical Operations on Ptrees (are used to get counts of any pattern) Ptree dimension is a user parameter and can be chosen to fit the data; default=1-D Ptrees (recursive halving);Images2-D Ptrees (recursive quartering);3-D Solids3-D Ptrees (recursive eighth-ing) Or the dimension can be chosen based on other considerations (to optimize compression, increase processing speed ...) Ptree 1 Ptree 2 AND result OR result Ptree AND is faster than bit-by-bit AND since, any pure0 operand node means result node is pure0. e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes). Using logical operators on the basic P-trees (Predicate = universal predicate “purely 1-bits”), can construct, for any domain: constant-P-trees (predicate: “value=const”), range-Ptrees (predicate: “value range”), interval-P-tree (pred: “value interval). In fact, there is a domain P-tree for every predicate defined on it ANDing domain-predicate P-trees, tuple-predicate P-trees, e.g., rectangle-P-tree (pred: tuple rectangle). The next slide shows some of these constructions.
Basic Ptrees for a 7 column, 8 bit table e.g., P11, P12, … , P18, P21, …, P28, …, P71, …, P78 AND Target Attribute Target Bit Position Value Ptrees (predicate: quad is purely target value in target attribute) e.g., P1, 5 = P1, 101 = P11 AND P12’ AND P13 AND Target Attribute Target Value Tuple Ptrees (predicate: quad is purely target tuple) e.g., P(1, 2, 3) = P(001, 010, 111) = P1, 001 AND P2, 010 AND P3, 111 AND/OR Rectangle Ptrees (predicate: quad is purely in target rectangle (product of intervals) e.g., P([13],, [0.2]) = (P1,1 OR P1,2 OR P1,3) AND (P3,0 OR P3,1 OR P3,2) Basic, Value and Tuple Ptrees
R( A1 A2 A3 A4) 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 111 000 001 100 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 1 Horizontal Processing of Vertical Structuresfor Record-based Workloads • For record-based workloads (where the result is a set of records), changing the horizontal record structure and then having to reconstruct it, may introduce too much post processing? • For data mining workloads, the result is often a bit (Yes/No, True/False) or another unstructured result, where there is no reconstructive post processing?
But even for some standard SQL queries, vertical data may be faster (evaluating when this is true would be an excellent research project) • For example, the SQL query, • SELECT Count * FROM purchases WHERE price $4,000.00 AND 1000 sales 500. • The answer is the root-count of the P-tree resulting from ANDing the price-interval-P-tree, Pprice[4000,) and the sales-interval-P-tree, Psales[500,1000] .
YOUR DATA MINING YOUR DATA Data Integration Language DIL Ptree (Predicates) Query Language PQL DII (Data Integration Interface) DMI (Data Mining Interface) Data Repository lossless, compressed, distributed, vertically-structured database Architecture for the DataMIME™System(DataMIMEtm = data mining, NO NOISE) (PDMS = P-tree Data Mining System) Internet
Decimal Binary Unsorted relation Generalized Raster and Peano Sorting: generalizes to any table with numeric attributes (not just images). Raster Sorting: Attributes 1st Bit position 2nd Peano Sorting: Bit position 1st Attributes 2nd
Unsorted Generalized Raster Generalized Peano crop adult spam function mushroom Generalize Peano Sorting KNN speed improvement (using 5 UCI Machine Learning Repository data sets) 120 100 80 Time in Seconds 60 40 20 0