180 likes | 286 Views
Personalized Presentation: Pick Your Choice. My PhD Theme Research that contributes to making data mining widely applicable Your choice of data mining topics: To Protect and Serve: Automated Construction of Classifiers for Scene Classification and Porn Filtering
E N D
Personalized Presentation:Pick Your Choice • My PhD Theme • Research that contributes to making data mining widely applicable • Your choice of data mining topics: • To Protect and Serve: Automated Construction of Classifiers for Scene Classification and Porn Filtering • Viral Mining: Benchmarking Artificial Immune Systems for Classification Tasks • Keep it Simple and Be Self Confident: a Bias Variance Analysis of the CoIL Challenge 2000 Data Mining Competition
Viral Mining: Benchmarking Artificial Immune Systems for Classification Tasks Ling Jun Meng & Peter van der Putten
Problem Statement • Artificial Immune Systems – the newest biologically inspired computing paradigm • Question • ‘Old wine in new bags’? Added value for real world data mining? • Approach: • Benchmark AIRS for classification by end user data mining (real world conditions) • Characterize AIRS relative to other algorithms and data sets
Background • ‘The Second Brain’ • Immune response • Primary: in response to intruder • Secondary: remember intrusion • Immune System Entities • Antigens: intruder • B-cells, antobodies, T-cells: cells / proteins produced in response • Memory cells: memorize intrusion
Memory cell ARB 1.present a training data 3. added into the memory cells pool or replace an existing memory cell 5. classification 2. generate a candidate memory cell 4. repeat until all the training instances are represented. Artificial Immune Recognition System (AIRS)
Overview of the AIRS algorithm • Seed the memory cell pool (MC) • For each training instance(agi), do: • If MC is empty, add agi to MC • Select the memory cell (mc) in MC of the same classification having the highest affinity to agi • Clone mc in proportion to its affinity to agi • Mutate each clone and add to ARB pool (AB) • Allocate resources to AB and remove the weak cells (Limited resource mechanism) • Calculate the average stimulation of AB to agi to check for termination. • If termination is not met, clone and mutate a random selection of ARB cells and then check termination again. Repeat until termination. • Select the ARB cell with the highest affinity as mccandiate, if mccandiate has higher affinity than mc, add mccandiate to MC. If mc and mccandiate are sufficiently similar, then remove mc from MC. • Perform kNN classification using MC.
Algorithm similarity methods • Correlation measure • Prediction better or worse than average • Correlation on standardized accuracy
The influence of data set size • Experimental design • ten artificial data sets from Diabetes-10%, Diabetes-20% to Diabetes-90%, and Diabetes-100% • use log trend curve to have a better appearance of the pattern.
Discussion • AIRS can be deemed as a reasonable classifier. However, inconsistent to early claims that AIRS performs far better than average algorithms, it is very close to the average of these algorithms. • AIRS behaves more like IBk and MLP over those benchmarking data sets. • With the increase of data set size, AIRS increases faster in performance than MLP, while increase slower in performance than IB1. It has similar increasing curve with other algorithms.