320 likes | 484 Views
Term Paper DETECTING OUTLIERS. Group – 5 Santhosh Kumar Kotagiri Allam Swetha Reddy Manideep Krishna Bhimavarapu. Outliers. Data that deviates from the normal data is called an outlier. An outlier can be: Any data that is inconsistent Rare data Deviant object
E N D
Term PaperDETECTING OUTLIERS Group – 5 Santhosh Kumar Kotagiri Allam Swetha Reddy Manideep Krishna Bhimavarapu
Outliers • Data that deviates from the normal data is called an outlier. • An outlier can be: • Any data that is inconsistent • Rare data • Deviant object • Exceptional transactions • Outlier Detection is a Data mining technique that detects outliers from given set of data
Papers Selected • Outlier Detection for Transaction Databases using Association Rules • Spatio-Temporal Outlier Detection in Large Databases • Detecting Spatio-temporal Outliers in Climate Dataset: A Method Study
Outlier Detection for Transaction Databases using Association Rules
Preliminaries • Sup & Min_sup • Frequency Itemset (FI) & maximal FI • Association Rule • Confidence & min_conf • High – confidence rules • Unobserved Rule • Associative Closure • Outlier Degree • Outlier Transaction
Basic Concept • Existing Model: Brute-Force algorithm • The paper presents two devices for faster detection of outliers • Remove redundant association rules • Prune candidates of transaction outliers utilizing maximal frequent item sets
Outlier Candidate Detection • Maximal Associative Closure • Upper Bound of Outlier Degrees • A transaction t, if the upper bound of t’s outlier degree is less than a given minimal outlier degree, are not regarded as an outlier at any price.
Pruning Redundant Rules • Nonredundant Rules • X → Y ∈ R is nonredundantrule when it has no other association rule Z→W ∈R and S→V ∈R such that (i) X∪Y = Z ∪W ∧X ⊃ Z and (ii) X = S ∧Y ⊂ V • Minimal rule set for R (Rmin) • |Rmin| ≤ |R|
Experiments and Results • Datasets • Intrusion • Synthetic • Accuracy measures
Drawbacks with other detection algorithms. • What are ST-Outliers? • 3 step approach • Clustering • Checking Spatial neighbors • Checking Temporal neighbors
Clustering: • DBSCAN algorithm • Modifications made: • To support temporal aspects • To find outliers from clusters with different densities. • Input Parameters: • Eps1 • Eps2 • MinPts • △E
Checking Spatial and Temporal Outliers • An object is considered as an S-outlier if it is outside the interval [L,U] • L=A-K0σ and L=A+K0σ • σ = SQRT(V) • Dataset: • wave height values of four seas: the Black Sea, the Marmara Sea, the Aegean Sea, and the east of the Mediterranean Sea.
Detecting Spatio-temporal Outliers in Climate Dataset: A Method Study
To detect useful and meaningful outliers in climate dataset, this paper introduces a formalized way to define outliers in Spatio-temporal data. • The definition of outlier needs to consider 3 aspects • Basic element • Compare element • The compare function
Location outliers given a time period The basic element • We focus on the spatial location in this dataset, so the basic element is just location or grid. • < i, Li ,Ti > is represents the attributes with whole observations of temperature time series at this location. • element with the ID of i • Li stands for its location • Ti stands for its temperature time series.
The compare element • Find the difference between the location and its neighbors. • The compare element is defined as some aggregation functions on the neighborhood.
The Compare Function • If f (i) ≥θ , we classify location i as a location outlier in the given time period. θ is a parameter that can be adjusted.
Time period outliers given as region • Location outlier can be extended to region outliers easily by only replacing the location in the basic element with region. • In other cases, we find the anomalous time period in a given area. • For instance, find the years that with too much precipitation. • This problem can be easily solved using simple statistics. • In a certain region, flood can’t be detected by only considering the average precipitation of the year.
As illustrated in Figure 2, although the average precipitation in 1994 and 2002 are larger than the year 1998, flood happened only in 1998.
Basic Element • This time we consider time period as basic element. • the basic element as < i ,Ti, STDistri i> • i means the id number • Ti means time period • STDistriimeans the spatio-temporal distribution of this time period.
Compare element • In general We compare each time period with every other time periods. • since we generally don’t just compare a certain year with its fore-and-aft years, but compare it with most of the other years. • It defined as some aggregation functions on all the time periods.
The Compare Function • Dimension of STDistrii extremely large (687 locations×12 months in our case), it is really hard to handle. • A simple method to solve the problem is just dividing the area into several regions, such as 8×3, and dividing time into 4 seasons.
Conclusion • An algorithm for detection of outliers in Transactional data, unlike numerical data is modelled using outlier Degree. • Outliers are detected in Spatio-Temporal data using a clustering method. • Outliers are detected using “basic element” and extending it to Spatio-Temporal data.