290 likes | 445 Views
Relevant characteristics extraction from semantically unstructured data. PhD title : Data mining in unstructured data Daniel I. MORARIU , MSc PhD Supervisor: Lucian N. VIN ŢAN. Sibiu, 200 6. Contents. Prerequisites Correlation of the SVM kernel’s parameters Polynomial kernel
E N D
Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor: Lucian N. VINŢAN Sibiu, 2006
Contents • Prerequisites • Correlation of the SVM kernel’s parameters • Polynomial kernel • Gaussian kernel • Feature selection using Genetic Algorithms • Chromosome encoding • Genetic operators • Meta-classifier with SVM • Non-adaptive method – Majority Vote • Adaptive methods • Selection based on Euclidean distance • Selection based on cosine • Initial data set scalability • Choosing training and testing data sets • Conclusions and further work
Prerequisites • Reuters Database Processing • 806791 total documents, 126 topics, 366 regions, 870 industry codes • Industry category selection – “system software” • 7083 documents (4722 training /2361 testing) • 19038 attributes (features) • 24 classes (topics) • Data representation • Binary • Nominal • Cornell SMART • Classifier using Support Vector Machine techniques • kernels
Correlation of the SVM kernel’s parameters • Polynomial kernel • Gaussian kernel
Polynomial kernel parameter’s correlation • Polynomial kernel • Commonly used kernel • d – degree of the kernel • b – the offset • Our suggestion • b = 2 * d
Gaussian kernel parameter’s correlation • Gaussian kernel • Commonly used kernel • C – usually represents the dimension of the set • Our suggestion • n – numbers of distinct features greater than 0
n – Gaussian kernel auto
Feature selection using Genetic Algorithms • Chromosome • Fitness (ci) = SVM (ci) • Methods of selecting parents • Roulette Wheel • Gaussian selection • Genetic operators • Selection • Mutation • Crossover
Methods of selecting the parents • Roulette Wheel • each individual is represented by a space that corresponds proportionally to its fitness • Gaussian : • maxim value (m=1) and dispersion (σ = 0.4)
Meta-classifier with SVM • Set of SVM’s • Polynomial degree 1, Nominal • Polynomial degree 2, Binary • Polynomial degree 2, Cornell Smart • Polynomial degree 3, Cornell Smart • Gaussian C=1.3, Binary • Gaussian C=1.8, Cornell Smart • Gaussian C=2.1, Cornell Smart • Gaussian C=2.8, Cornell Smart • Upper limit (94.21%)
Meta-classifier methods’ • Non-adaptive method • Majority Vote – each classifier votes a specific class for a current document • Adaptive methods - Compute the similarity between a current sample and error samples from the self queue • Selection based on Euclidean distance • First good classifier • The best classifier • Selection based on cosine • First good classifier • The best classifier • Using average
Initial data set scalability • Normalize each sample (7053) • Group initial set based on distance (4474) • Take relevant vector (4474) • Use relevant vector in classification process • Select only support vectors (847) • Take samples grouped in selected support vectors (4256) • Make the classification (with 4256 samples)
Conclusions – other results • Using our correlation • 3% better for Polynomial kernel • 15% better for Gaussian kernel • Reduced number of features between 2.5% (475) and 6% (1309) • GA _FS faster than SVM_FS • Polynomial kernel with nominal representation and small degree • Gaussian kernel with Cornell Smart representation • Reuter’s database is linearly separable • SBED is better and faster than SBCOS • Classification accuracy decreases with 1.2 % when the data set is reduced
Further work • Features extraction and selection • Association rules between words (Mutual Information) • Synonym and Polysemy problem • Using families of words (WordNet) • Web mining application • Classifying larger text data sets • A better method of grouping data • Using classification and clustering together