1 / 29

Relevant characteristics extraction from semantically unstructured data

Relevant characteristics extraction from semantically unstructured data. PhD title : Data mining in unstructured data Daniel I. MORARIU , MSc PhD Supervisor: Lucian N. VIN ŢAN. Sibiu, 200 6. Contents. Prerequisites Correlation of the SVM kernel’s parameters Polynomial kernel

vlad
Download Presentation

Relevant characteristics extraction from semantically unstructured data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor: Lucian N. VINŢAN Sibiu, 2006

  2. Contents • Prerequisites • Correlation of the SVM kernel’s parameters • Polynomial kernel • Gaussian kernel • Feature selection using Genetic Algorithms • Chromosome encoding • Genetic operators • Meta-classifier with SVM • Non-adaptive method – Majority Vote • Adaptive methods • Selection based on Euclidean distance • Selection based on cosine • Initial data set scalability • Choosing training and testing data sets • Conclusions and further work

  3. Prerequisites • Reuters Database Processing • 806791 total documents, 126 topics, 366 regions, 870 industry codes • Industry category selection – “system software” • 7083 documents (4722 training /2361 testing) • 19038 attributes (features) • 24 classes (topics) • Data representation • Binary • Nominal • Cornell SMART • Classifier using Support Vector Machine techniques • kernels

  4. Correlation of the SVM kernel’s parameters • Polynomial kernel • Gaussian kernel

  5. Polynomial kernel parameter’s correlation • Polynomial kernel • Commonly used kernel • d – degree of the kernel • b – the offset • Our suggestion • b = 2 * d

  6. Bias – Polynomial kernel

  7. Gaussian kernel parameter’s correlation • Gaussian kernel • Commonly used kernel • C – usually represents the dimension of the set • Our suggestion • n – numbers of distinct features greater than 0

  8. n – Gaussian kernel auto

  9. Feature selection using Genetic Algorithms • Chromosome • Fitness (ci) = SVM (ci) • Methods of selecting parents • Roulette Wheel • Gaussian selection • Genetic operators • Selection • Mutation • Crossover

  10. Methods of selecting the parents • Roulette Wheel • each individual is represented by a space that corresponds proportionally to its fitness • Gaussian : • maxim value (m=1) and dispersion (σ = 0.4)

  11. The process of obtaining the next generation

  12. GA_FS versus SVM_FS for 1309 features

  13. Training time, polynomial kernel, d= 2, NOM

  14. GA_FS versus SVM_FS for 1309 features

  15. Training time, Gaussian kernel, C=1.3, BIN

  16. Meta-classifier with SVM • Set of SVM’s • Polynomial degree 1, Nominal • Polynomial degree 2, Binary • Polynomial degree 2, Cornell Smart • Polynomial degree 3, Cornell Smart • Gaussian C=1.3, Binary • Gaussian C=1.8, Cornell Smart • Gaussian C=2.1, Cornell Smart • Gaussian C=2.8, Cornell Smart • Upper limit (94.21%)

  17. Meta-classifier methods’ • Non-adaptive method • Majority Vote – each classifier votes a specific class for a current document • Adaptive methods - Compute the similarity between a current sample and error samples from the self queue • Selection based on Euclidean distance • First good classifier • The best classifier • Selection based on cosine • First good classifier • The best classifier • Using average

  18. Selection based on Euclidean distance

  19. Selection based on cosine

  20. Comparison between SBED and SBCOS

  21. Comparison between SBED and SBCOS

  22. Initial data set scalability • Normalize each sample (7053) • Group initial set based on distance (4474) • Take relevant vector (4474) • Use relevant vector in classification process • Select only support vectors (847) • Take samples grouped in selected support vectors (4256) • Make the classification (with 4256 samples)

  23. Polynomial kernel – 1309 features, NOM

  24. Gaussian kernel – 1309 features, CS

  25. Training time

  26. Choosing training and testing data set

  27. Choosing training and testing data set

  28. Conclusions – other results • Using our correlation • 3% better for Polynomial kernel • 15% better for Gaussian kernel • Reduced number of features between 2.5% (475) and 6% (1309) • GA _FS faster than SVM_FS • Polynomial kernel with nominal representation and small degree • Gaussian kernel with Cornell Smart representation • Reuter’s database is linearly separable • SBED is better and faster than SBCOS • Classification accuracy decreases with 1.2 % when the data set is reduced

  29. Further work • Features extraction and selection • Association rules between words (Mutual Information) • Synonym and Polysemy problem • Using families of words (WordNet) • Web mining application • Classifying larger text data sets • A better method of grouping data • Using classification and clustering together

More Related