An extended K-means++ with mixed attributes for outlier detection

An extended K-means++ with mixed attributes for outlier detection Presented by Miss SarunyaKanjanawattana

Examination Committee Dr. SumantaGuha (Chairperson)Prof. Dr. Phan Minh Dung (Committee)Dr. Matthew N. Dailey (Committee)

:: Agenda :: • Background • Literaturereview • Methodologies

Background • Problem statement • Objective of the study • Scope and Limitation • Contribution

« Background » • Data mining : • huge volume of data and information are collected in databases. • These tremendous data has far exceeded the human ability to analyze extract valuable information for the purpose of decision-making support. “data mining helps to transform the collected data into valuable information”

« Background » • Outlier detection : • Outlier cluster is a popular methodology that uses to detect fraud in data sets. • identify data points as “normal” or “outlier” Outlier data point => fraudulent sample

« Background » • Fraud detection • Health insurance fraud detection is a beneficial and challenging task. • The detection helps to observe the fraud and abuse pattern. Example : Institutional or health professional led health insurance fraud include the falsification of information on forms.

« Background » • The National Health Security office • is an autonomous state agency, officially founded in 2002 , stated by the National Health Security Act • The vital duties of NHSO • are to manage the health security fund and allocate the subsidiary budget to 236 clinics and 963 hospitals to promote and develop a good health care system for all Thai people.

« Problem statement » • Fraud and abuse • led to significant additional expense in the health care system. • A case study : NHSO database • Occurred with the large number of data . • Many transactions emerge constantly daily hour. • These become huge and hard to use human inspections for detecting fraud. • Outlier clustering approach : • Need fast and more accuracy algorithm to monitor outliers

« Objective of the study » • To provide a process of extracting the fraud instances and uncover unusual activities in NHSO. • To develop the K-means++, that is another variation of standard k-means algorithm, with mixed attributes of dataset for detecting outliers. • To answer what is the optimal “”.

« Scope and Limitation » • The data source only involved in 4 provinces in Thailand • Nakhonratchasima, Chaiyaphom, Burirum and Surin. • The transaction comes from a group of High-costs diseases • There is high chance to occur fraudulent behaviors larger than other groups of diseases.

« Contribution » • The proposed study provides the methodology to detect fraud and abuse in NSHO, Thailand. It will present some results of outlier cluster. • This study proposes a novel algorithm based on extended K-means++ to work with mixed attributes and detect outliers.

Literaturereview • Fraud detection • The process of data mining

« Literaturereview » Frauddetection Yi et al. 2006 : • understand and detect suspicious health care frauds from large databases using clustering technique • Use two clusters to compare : SAS EM and CLUTO • As the experimental results indicate that CLUTO is faster than SAS EM while SAS EM provides more useful clusters than CLUTO.

« Literaturereview » Frauddetection Liou, Tang, and Chen 2008 : • Applies data mining techniques to detect fraudulent or abusive reporting by healthcare providers using their invoices for diabetic outpatient services. • Logistic regression, neural network, classification trees • The classification tree model performs the best with an overall correct identification rate of 99%.

« Literaturereview » The process of data mining • Data preprocessing • The data that obtain from the real databases are often incomplete, noisy and inconsistent. • The target of data preprocessing is to clean a rough data set for improve accuracy. • The process of data preprocessing : • data cleaning, data transformation and integration and data reduction.

« Literaturereview » The process of data mining • Data preprocessing Wang and Chiang 2009 : • presents an efficient data preprocessing procedure for the support of vector clustering (SVC) to reduce the size of a training dataset.

« Literaturereview » The process of data mining • K-means algorithm

« Literaturereview » The process of data mining • K-means algorithm • The benefits of K-means • fast and simplicity. Its algorithm is really easy to understand and implementation. • The shortcoming of K-means • number of clusters dependency • degeneracy

« Literaturereview » The process of data mining • K-means++ algorithm

« Literaturereview » The process of data mining • K-means++ algorithm • Arthur and Vassilvitskii 2007 • Fast and more efficient • K-means : O(i * n * k) • K-means++ : O(log k) • not pretty good to work with a dataset which combines categorical and numerical attribute

D(x) = • the shortest distance from • a data point x to the • closest center we have • already chosen. « Literaturereview » The process of data mining • K-means++ algorithm • Example (k=3)

« Literaturereview » The process of data mining • K-means++ algorithm • Example (k=3)

« Literaturereview » The process of data mining • K-means++ algorithm • Example D2=12+72 D2=82+42 D2=72+32 D2=22+12 (k=3)

« Literaturereview » The process of data mining • K-means++ algorithm • Example D2=12+72 D2=12+12 D2=22+12 (k=3)

« Literaturereview » The process of data mining • K-means++ algorithm • Example (k=3)

« Literaturereview » The process of data mining • Y-means algorithm

« Literaturereview » The process of data mining • Y-means algorithm • Guan, Ghorbani, and Belacel 2003 • based on the K-means algorithm • It overcomes two shortcomings of K-means: • number of clusters dependency and degeneracy

« Literaturereview » The process of data mining • Koufakou, Ortiz, Georgiopoulos, Anagnostopoulos, and Reynolds 2007 • Introduced a strategy named “Attribute Value Frequency (AVF)”. • That is a fast and scalable outlier detection strategy for categorical data.

Methodologies • Methodology • Data collection • Data evaluation • Tasks and timeline

« Methodologies » • It can divide into 3 phases. • Phases 1: Data preprocessing • Convert categorical data to numeric data • Phases 2: Clustering • Followed by K-means++ algorithm • Phases 3: Outlier detection • Local and global outlier • Determine what cluster is outlier

« Methodologies » • Overview of the extended K-means++ algorithm

« Methodologies » • Phases 1: Data preprocessing

« Methodologies » • Phases 1: Data preprocessing • Normalizes the numeric attributes’ value into the range of 0 and 1

« Methodologies » • Phases 1: Data preprocessing 2) A categorical attribute A with most number of items is selected to be the base attribute. • 2 items: A,B • 3 items: C,D,E

« Methodologies » • Phases 1: Data preprocessing 3) Counting the frequency of co-occurrence, represent by Matrix M • A B C D E • A • B • C • D • E • Matrix M = • 4 0 2 2 0 • 0 311 1 • 0 0 3 0 0 • 0 003 0 • 0 0 0 0 1

« Methodologies » • Phases 1: Data preprocessing 4) Calculate similarity between items represent by equation D • A B C D E • A • B • C • D • E • Matrix M = • 4 0 2 2 0 • 0 311 1 • 0 0 3 0 0 • 0 003 0 • 0 0 0 0 1

« Methodologies » • Phases 1: Data preprocessing 5) Find group variance of numerical value by following equation: • SSw(Y) = 0.04 • SSw(Z) = 1.294 • << Select Y

« Methodologies » • Phases 1: Data preprocessing 6) Every base item can be quantified by assigning mean of the mapping value in the selected numeric attribute.

« Methodologies » • Phases 1: Data preprocessing 7) All other categorical items can be quantified by applying the function: F(A) = 0.4 * 0.2 + 0.4 * 0.8 + 0 * 0.6 = 0.4 F(B) = 0.2 * 0.2 + 0.2 * 0.8 + 0.33 * 0.6 = 0.398 • *All data in data set are numeric now.

« Methodologies » • Phases 2: Clustering • Probability : • D(x) : denote the shortest distance from a data point x to the closest center we have already chosen.

« Methodologies » • Phases 2: Clustering • Define initial values: •  = Cluster width • for detect local outlier • Followed by previous study = 2.32. •  = Cluster population ratio • for detect global outlier • My assumption :  = 0.9 • Detection rate and false negative rate should be get the highest values with optimal “”.

« Methodologies » • Phases 3: Outlier detection

« Methodologies » • Phases 3: Outlier detection • There are 2 stages • Local outlier detection : •  = cluster width

« Methodologies » • Phases 3: Outlier detection • There are 2 stages • Global outlier detection •  = population ratio

« Data collection » • A real dataset provided by National Health Security office of Thailand was applied to demonstrate the effectiveness of the proposed method. • Primary data will gather information from database especially statement information that contains all financial transactions, Thailand.

« Data collection » • Overview of data set

An extended K-means++ with mixed attributes for outlier detection

An extended K-means++ with mixed attributes for outlier detection

Presentation Transcript

K-means algorithm

k -means Clustering

K-means Clustering

K-means Clustering

K-Means

Scalable K-Means++

K-means algorithm

K-means and Fuzzy K-means

K-MEANS CLUSTERING

K-Means Clustering

K-means Clustering

K-means algorithm

Cartesian k-means

K-means Clustering

K-means Clustering

Clustering: K-Means

K-means algorithm

K-means

K-means properties