240 likes | 376 Views
Contributions to MiningMart. Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz. University of Economics, Prague. LISp - Laboratory for Intelligent Systems
Contributions to MiningMart Petr Berka Laboratory for Intelligent Systems University of Economics, Prague berka@vse.cz
University of Economics, Prague • LISp - Laboratory for Intelligent Systems • SALOME - Laboratory for Multidisciplinary Approaches to Decision-making Support in Economics and Management MiningMart prezentation (c) Petr Berka, LISp, 2001
LISp research • probabilistic methods - decomposable probability models and bayesian networks • symbolic ML methods - 4FT association rules and decision rules • logical calculi for knowledge discovery in databases MiningMart prezentation (c) Petr Berka, LISp, 2001
Organized conferences ECML’97, PKDD’99 Organized workshops Discovery Challenge (PKDD‘99, PKDD2000, PKDD20001), WUPES‘97, WUPES2000 International Projects MLNet, Sol-Eu-Net, EUNITE, MUM, MGT KDNet LISp activities MiningMart prezentation (c) Petr Berka, LISp, 2001
SALOME research • Quantitative and AI (pattern recognition, fuzzy, neural nets) approaches to support of decision making in econmics and management MiningMart prezentation (c) Petr Berka, LISp, 2001
Organized workshops STIPR‘97, MME‘99 International Projects Univ. Salzburg, Univ. Hokkaido, Univ. Cambridge SALOME activities MiningMart prezentation (c) Petr Berka, LISp, 2001
LISp software • LISp-Miner (data mining system) • DataSource (fordata manipulation) • 4FT Miner (4FT association rules) and • KEX (decision rules) • experimental software for building graphical models • preprocessing procedures • related to KEX • based on information theoretic approach MiningMart prezentation (c) Petr Berka, LISp, 2001
LISP-Miner procedures • DataSource creating new (virtual) attributes using SQL ekvidistant and equifrequent discretization grouping attribute values computing attribute-value frequencies MiningMart prezentation (c) Petr Berka, LISp, 2001
LISP-Miner procedures • 4FT-Miner (GUHA procedure) 4FT association rules in the form Ant ~ Suc / Cond • KEX weighted decision rules in the form Ant C (weight) MiningMart prezentation (c) Petr Berka, LISp, 2001
4FT-Miner basic idea • Generate a (potential) rule, e.g. COLOUR(red) SIZE(small) 0.9, 20 TEMP(high) AGE(21-30) SALARY(low) 0.85,15 PAYMENTS (High) LOAN(bad) • Verify a rule using four-fold table MiningMart prezentation (c) Petr Berka, LISp, 2001
KEX basic idea • Generate a (potential) rule, e.g. YEARS-IN-COMPANY(0-3) AGE(0-25) LOAN(GOOD) • If rule refines current set of rules (validity a/(a+b) differs from weight inferred during consultation) add into rule base with proper weight MiningMart prezentation (c) Petr Berka, LISp, 2001
LISp-Miner architecture MetaData (ODBC ACCESS) Results Data (ODBC ACCESS) LM Windows MiningMart prezentation (c) Petr Berka, LISp, 2001
Preprocessing (LISp) • KEX-oriented • (fuzzy) discretization + grouping of values • computing the amount of noise in data • random sampling + balancing of data • handling missing values • Information theory • attribute selection • attribute grouping MiningMart prezentation (c) Petr Berka, LISp, 2001
… fuzzy discretization MiningMart prezentation (c) Petr Berka, LISp, 2001
… amount of noise Amount of noise: 20% max. possible accuracy = 80% MiningMart prezentation (c) Petr Berka, LISp, 2001
… data sampling • random split into training and testing set • select random stratified sample • balance unbalanced classes MiningMart prezentation (c) Petr Berka, LISp, 2001
… handling missing values • remove example • substitute missing with new value • substitute missing with majority value • proportional substitution MiningMart prezentation (c) Petr Berka, LISp, 2001
… information theory • Attribute selection - based on mutual information • Attribute grouping - based on information content MiningMart prezentation (c) Petr Berka, LISp, 2001
Preprocessing architecture Input data (ASCII) Output data (ASCII) procedure Results Data (ASCII) procedure MiningMart prezentation (c) Petr Berka, LISp, 2001
SALOME software • Feature Selection Toolbox (Multi-Purpose Tool for Pattern Recognition) • feature selection • approximation-based modeling • classification a consulting system helping to choose the most suitable method is being developed MiningMart prezentation (c) Petr Berka, LISp, 2001
Search strategies for FS Search for a subset maximizing a criterion function (distance, divergence): • with apriori information • exhaustive search • branch and bound based algorithms • floating search algorithms • without apriori information • approximation method • divergence method MiningMart prezentation (c) Petr Berka, LISp, 2001
FST architecture Data (ASCII) Results FST Windows MiningMart prezentation (c) Petr Berka, LISp, 2001
References LISp-Miner: • Berka,P. - Ivanek,J.: Automated Knowledge Acquisition for PROSPECTOR-like Expert Systems. In: (Bergadano, deRaedt eds.) Proc. ECML'94, Springer 1994, 339-342. • Berka,P. - Rauch,J.: Data Mining using GUHA and KEX. In: (Callaos, Yang, Aguilar eds.) 4th. Int. Conf. on Information Systems, Analysis and Synthesis ISAS'98, 1998, Vol 2, 238- 244. • Rauch,J.: Classes of Four Fold Table Quantifiers. In: (Zytkow, Quafafou eds.) Principles of Data Mining and Knowledge Discovery. Springer 1998, 203 - 211. MiningMart prezentation (c) Petr Berka, LISp, 2001
References Preprocessing: • Bruha,I. - Berka,P.: Discretization and Fuzzification of Numerical Attributes in Attribute-Based Learning. In: Szepaniak, Lisboa, Kacprzyk (eds.): Fuzzy Systems in Medicine, Physica Verlag, 2000, 112-138. • Pudil, P., Novovičová J.: Novel Methods for Subset Selection with Respect to Problem Knowledge, IEEE Transactions on Intelligent Systems - Special Issue on Feature Transformation and Subset Selection 1998, 66-74 • J. Zvarova and M. Studeny: Information theoretical approach to constitution and reduction of medical data. International Journal of Medical Informatics 45 (1997), n. 1-2, pp. 65-74. MiningMart prezentation (c) Petr Berka, LISp, 2001