D.Y. Ye and Z.J. Chen College of Math. & Computer Science Fuzhou University, China

A New Algorithm for High Dimensional Outlier Detection Based on ConstrainedParticle Swarm Intelligence D.Y. Ye and Z.J. Chen College of Math. & Computer Science Fuzhou University, China

Inadequateness of proximity-based notion of outliers in high dimensional space • Outlier detection has become a hot issue in the area of data mining ; • Most of the existing algorithms for outlier detection use concepts of proximity to define and detect outliers; • In high dimensional space, the data are sparse and the proximity-based notion of outliers fails to retain effectiveness ( thesparsity of high dimensional data implies that every point is an almost equally good outlier from the perspective of proximity-based definitions ) (Beyer, K., et al, 1998)

A new definition of outliers by Aggarwal and Yu • An alternative is to deal with data in lower dimensionalprojections (or subspaces). • A data point is considered an outlier, if it is located in some abnormally low density subspace (Aggarwal and Yu ,2005) • The density of a lower k-dimensional projection is measured by “sparsity coefficient”.

Sparsity coefficient in k-dim subspace • Assume that there is an n-dimensional data set having a total of N points and the data are uniformly distributed; • Each attribute is divided into Φ equi-depth ranges. Each range contains a fraction f = 1/Φ of total points. • The most negativeS(D) value means that the cube D has least points.

Nature of the detection problem and approach used • Hence, the outlier detection in this context boils down to finding those combinations of dimensions with most abnormally sparse data( most negative S(D) values). • This turns out to be a very difficult combinatorial optimization problem since the combinations of dimensions exponentially increase with increasing dimensionality and it is hard to examine all possible subsets of dimensions. • We used PSO based algorithm instead of Genetic algorithms (as proposed by Aggarwal and Yu) to solve the problem.

Basic PSO algorithm

Effective use of PSO approach • Encoding (a particle is a vector in the discretized n-dimensional space and corresponds to a pattern using pattern conversion) • Fitness (using sparsity coefficient) • Constraints: Dimensionality preservation , i.e., search patterns within k-dimensional subspaces ( So, modify particle’s updating strategy)

Pattern Conversion and Fitness

Modify particle’s updating • Since the search for abnormally sparse lower dimensional projections should be conducted in subspaces of a given dimensionality, the traditional particle updating strategy (3) needs to be modified. • Three cases are considered.

The proposed outlier detection algorithm

Experimental results

Some remarks • Table 1 includes the results on the time cost(in second) as well as the average sparsity coefficients of the best 20 projections indicated under the column (quality). • We did not regain the results by using Gen0 as reported by Aggarwal and Yu because of our different choice of mutation probabilities, of the sizes of populations, of the differences in its implementations and of the results in a run being not necessarily the best possible solutions. • The experimental results show that CPSO works equally well as or sometimes even better than the baseline GA-based algorithm Gen0 in terms of computational efficiency and outlier detection quality.

Conclusion • We discussed the applicability of particle swarm optimization techniques to the problem of detecting outliers in high dimensional spaces where the outliers are defined as abnormally sparse lower dimensional patterns. • It turned out that PSO-based algorithms can also be used to effectively detect such outliers with suitably modified particle updating and search strategies.

Thanks

D.Y. Ye and Z.J. Chen College of Math. & Computer Science Fuzhou University, China