Self-Organised Data Mining – 20 Years after GUHA-80

Self-Organised Data Mining–20 Years after GUHA-80 Martin Kejkula KEG 8th April 2004 http://gama.vse.cz/keg/

Agenda • Idea of Self-Organised Data Mining • GUHA-80 revival • Process of Self-Organised Data Mining • Key factors for Self-Organised Data Mining • Metabase, Knowledge Base, etc. • Proposed EverMiner system for Self-Organised Data Mining

Introduction • Motivation: support X-Miner users • Best practices, known problems collection • Muller, Lemke: Self-Organising Data Mining (2000) • My thesis: • Design/test strings of jobs for EverMiner • Formalization/using heuristics

References (1) • Hájek, P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp. 107-134 • Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTAT’82, Wien, Physica Verlag 1982, pp. 54-60

References (2) • Hájek, P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982 • Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982, http://acl.eldoc.ub.rug.nl/mirror/C/C82/

References (3) • Rauch, J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner, 2003. • Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000.

GUHA-80: Main Features • Application of artificial intelligence to exploratory data analysis • To generate interesting views onto given empirical data (recognize interesting logical patterns) • Views: relevant, useful

GUHA-80 Sources (1) • GUHA • Automatically generate all interesting hypotheses • Lenat’s AM • Jobs (tasks) • Agenda of jobs • Hundreds of heuristical rules • Concepts

GUHA-80 Sources (2) • GUHA-80 vs. Lenat’s AM • Data • Data-processing procedures • Statistical program packages • Effective modules

GUHA-80 Paradigm • Open-ended data analysis • To maximize interestingness value • Hundreds of heuristic rules • Guide to define and study next step • Access potentially relevant rules, Find truly relevant rules, Follows truly relevant rules

Interestingness in GUHA-80 • No explicit definition • Determined by interplay • Heuristical rules • Weighting mechanisms • Testing in practice (adequately behaviour?) • No algorithm, but constraints

Principles of GUHA-80 • Domain dependence (…exploratory data analysis) • Join human possibilities with machine • More heuristics are relevant • Interactivity with user • Non routine (GUHA-80 not for every-day data processing)

GUHA-80 Structure (1)

GUHA-80 Structure (2) • Input empirical data • Input parameters • How understood “interestingness” • Effective modules (system’s knowledge) • Clustering procedures • GUHA procedures • Agenda of jobs (priority/weight)

GUHA-80 Structure (3) • Heuristics: optimal way to realize a job • Changing system of concepts • Hierarchy of concepts (applicability) • Possible unification of heuristics, jobs,…

GUHA-80 Input • Data • Input information • Decompositions/orderings of sets of quantities • Help understand “interestingness”

GUHA-80 Effective modules • Evaluation of usual statistical characteristics,… • Complicated procedures • Synthesis of parameters (“job on job”)

GUHA-80 • Hundreds of heuristic rules • No explicit definition of interestingness (exploration in a space) • Interactivity with the user • Non-routine character

Process of S-O Data Mining EmpiricalData Domain Knowledge,… Chains of Data & Knowledge Processing Tasks All Interesting Views, Patterns DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …

Process of S-O Data Mining

Key Factors of S-O Data Mining • Data Preparation • Modeling • Evaluation • Knowledge Base • Domain Knowledge

Data Preparation • Discretization • Attribute Type dependent: • Nominal/Ordinal/Interval/Ratio • Type of coefficient dependent • Discretization-Modeling Cycle (KL, 4ft, CF,…) • Known problem with intervals of categories without values • Usually not one target attribute

Attribute type dependent discretization • Nominal • Classes of values • Ordinal • Extrem/missing values • Type of coefficient • Usually not one target attribute

Intervals of Categories without Values

Intervals of Categories without Values Solution: • Statistics – extrem values • 4ft Task: correlations, implications • Potentially interesting patterns

Extrem/Missing Values 4ft: Find associations between extrem/missing values (impl/correl) CF, KL: Find patterns with extrem/missing values

Data Preparation • Classes of attributes • Partial cedents • Associations between attributes in one class • Associations between partial cedents

Evaluation-Modeling • Input information for partial cedents • Mining for Interesting Patterns • Exceptions • Missing values • Extrem values • Discovered hypotheses • Groups of hypotheses • Coverage hypotheses/input data

Heuristic Rules (1) • Examples: • IF more extrem/missing values found, search for association with extrem/missing values • IF 0 hypotheses found, set-up less strong quantifier (p, Base) values • IF subset of input data not covered by hypotheses THEN search for associations covering these data

Heuristic Rules (2) • Examples: • IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) • Use “subset” coefficient type for nominal attributes

Metabase, Knowledge Base • Metadata (Knowledge): • Results of Previous X-Miner Tasks • Domain Knowledge • Interaction with User (learning?)

GUHA-80 vs. X-Miner (1) • Task parameters (partial cedents, …) • SW, HW • Experiences with LM applications,…

GUHA-80 vs. X-Miner (2) • More complex heuristics

EverMiner – Features • Based on LispMiner (X-Miners) • Agenda of jobs, priority/strings • Heuristics • Interaction with user • Enables to repeat the process on new data (“check” vs. new KDD process)

EverMiner – where we are • Experiences (Medicine, traffic, shares, sociology,…) • Heuristics collection (www, brainstorming) • Co-operation with data preparation experts (FEL, SumatraTT) • Testing “Strings of jobs” (learning)

Discussion

Self-Organised Data Mining – 20 Years after GUHA-80

Self-Organised Data Mining – 20 Years after GUHA-80

Presentation Transcript

DCS 802 Data Mining Apriori Algorithm

DATA MINING Introductory and Advanced Topics Part II

Knime: a data mining platform

Data Mining: Concepts and Techniques — Slides for Textbook — — Chapter 6 —

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining

Data Mining Classification: Basic Concepts,

Data Mining Chapter 1

Data Mining: Concepts and Techniques — Chapter 5 — Mining Frequent Patterns

Data Mining Algorithms for Recommendation Systems

Weka – A Data Mining Toolkit

Data Mining: Concepts and Techniques

CENG 464 Introduction to Data Mining

CS 490 Sample Project Mining the Mushroom Data Set

Spatial Data Mining: Accomplishments and Research Needs

Data Mining: Concepts and Techniques

DATA WAREHOUSING AND DATA MINING

DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University

Association Rule Mining

15-826: Multimedia Databases and Data Mining