260 likes | 409 Views
Combining Semi-Supervised Clustering with Social Network Analysis : A Case Study on Fraud Detection. Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, joao.botelho@ist.utl.pt |.
E N D
CombiningSemi-SupervisedClusteringwith Social NetworkAnalysis: A Case StudyonFraudDetection Mining Data Semantics (MDS'2011) Workshop in conjunction with SIGKDD2011, August 21-24, 2011, San Diego, CA, USA. João Botelho, joao.botelho@ist.utl.pt | Cláudia Antunes, claudia.antunes@ist.utl.pt
CONTENTS • Motivationandproblemstatement • S2C+SNA methodology • Case study • Conclusions
CONTENTS • Motivationandproblemstatement • S2C+SNA methodology • Case study • Conclusions
FRAUD DETECTION IN TAXES PAYMENTS • Fraudin Taxes Payments • Improper payments in taxes due to fraud, waste and abuse; • Involves millions of possible fraud targets; • Necessityof effective tools to prevent fraud or or just to identify it in time;
CONTENTS • MotivationandProblemstatement • S2C+SNA methodology • Case study • Conclusions
Metodologia da Solução S2C+SNA METHODOLOGY
DATA PREPARATION> DATASET Thismethodology assumes theexistenceoftwodatasets: - Datasetwithlabeledandunlabeledinstances; - Social network Data (describing interactions between these instances);
DATA PREPARATION>SNOWBALL SAMPLING • In order to discard un-useful components of the social network and optimize computational resources, the target population can be reached using snowball sampling.
DATA PREPARATION>BAD RANK • DerivedfromPageRank e HITS • Usedby Google to detectweb SPAM • Bad Rank allow us to identify the risk that is associated to a member by analyzing their links to other “bad” members.
DATA PREPARATION>BAD RANK • The application of Bad Rank results in a new attribute that will enrich the entity decriptionto be used in the classification process.
MODELING>SEMI-SUPERVISED CLUSTERING • The most common semi-supervised algorithms studied in this paper are modifications of the K-Means algorithm (unsupervised) to incorporate domain knowledge. • Typically, this knowledge can be incorporated: • when the initial centroids are chosen (by seeding) • Seeded-Kmeans • Constrained-Kmeans • in the form of constraints that have to be satisfied when grouping similar objects (constrained algorithms). • PCK-Means • MPCK-Means
CONTENTS • MotivationandProblemstatement • S2C+SNA methodology • Case study • Conclusions
CASE STUDY • Dataset: Fraudin Taxes Payments; • Since the experiments presented in this work will focus only in the problem of detecting fraud with small fractions of labeled data, it was extracted a balanced dataset with equal number of fraud and non fraud instances. • 3000 instances; • 50% Fraud; 50% NonFraud;
EXPERIMENTS SETUP • All the experiments were conducted selecting randomly 10 different sets of pre-labeled instances for each algorithm and for different fractions of incorporated labeled instances. • The results presented next report the best, worst and the average of the acuracy results obtained on these datasets.
CONTENTS • MotivationandProblemstatement • S2C+SNA methodology • Case study • Conclusions
CONCLUSIONS • It is clear to see that with a small fraction of labeled instances all the semi-supervised algorithms obtain a significant improvement when comparing to the unsupervised clustering (Kmeans). • Constrained K-Means have the best performance when comparing to other semi-supervised algorithms. • Semi-supervised clustering performs better when data is enriched with social network analysis. • BadRank, the results show significant improvements in all experiments, after 15% of labeled instances used.
CONCLUSIONS • This methodology can also be applied to other areas: • where supervised information is very difficult to achieve • where Social Network Analysis can provide important information about human entities, making visible patterns, linkages and connections that could not be discovered using only static data (transitional data). • Churn detection is a good candidate to apply this methodology.
FIM QUESTIONS?