400 likes | 535 Views
Self-Organised Data Mining – 20 Years after GUHA-80. Martin Kejkula KEG 8 th April 2004 http://gama.vse.cz/keg/. Agenda. Idea of Self-Organised Data Mining GUHA-80 revival Process of Self-Organised Data Mining Key factors for Self-Organised Data Mining Metabase, Knowledge Base, etc.
E N D
Self-Organised Data Mining–20 Years after GUHA-80 Martin Kejkula KEG 8th April 2004 http://gama.vse.cz/keg/
Agenda • Idea of Self-Organised Data Mining • GUHA-80 revival • Process of Self-Organised Data Mining • Key factors for Self-Organised Data Mining • Metabase, Knowledge Base, etc. • Proposed EverMiner system for Self-Organised Data Mining
Introduction • Motivation: support X-Miner users • Best practices, known problems collection • Muller, Lemke: Self-Organising Data Mining (2000) • My thesis: • Design/test strings of jobs for EverMiner • Formalization/using heuristics
References (1) • Hájek, P. – Havránek, T.: GUHA 80: An Application of Artificial Intelligence to Data Analysis. Computers and Artificial Intelligence, Vol. 1, 1982, pp. 107-134 • Hájek, P. – Ivánek, J.: Artificial Intelligence and Data Analysis. Proc. COMPSTAT’82, Wien, Physica Verlag 1982, pp. 54-60
References (2) • Hájek, P. – Havránek, T.: GUHA-80 – An Application of Artificial Intelligence to Data Analysis. Matematické středisko biologických ústavů ČSAV, Praha, 1982 • Jirků, P. – Havránek, T.: On Verbosity Levels in Cognitive Problem Solvers. Proc. Computational Linguistics, 1982, http://acl.eldoc.ub.rug.nl/mirror/C/C82/
References (3) • Rauch, J.: EverMiner – studie projektu. Dokumentace projektu LISp-Miner, 2003. • Mueller, J.-A. – Lemke, F.: Self-Organising Data Mining. Extracting Knowledge from Data. Dresden, Berlin, 2000.
GUHA-80: Main Features • Application of artificial intelligence to exploratory data analysis • To generate interesting views onto given empirical data (recognize interesting logical patterns) • Views: relevant, useful
GUHA-80 Sources (1) • GUHA • Automatically generate all interesting hypotheses • Lenat’s AM • Jobs (tasks) • Agenda of jobs • Hundreds of heuristical rules • Concepts
GUHA-80 Sources (2) • GUHA-80 vs. Lenat’s AM • Data • Data-processing procedures • Statistical program packages • Effective modules
GUHA-80 Paradigm • Open-ended data analysis • To maximize interestingness value • Hundreds of heuristic rules • Guide to define and study next step • Access potentially relevant rules, Find truly relevant rules, Follows truly relevant rules
Interestingness in GUHA-80 • No explicit definition • Determined by interplay • Heuristical rules • Weighting mechanisms • Testing in practice (adequately behaviour?) • No algorithm, but constraints
Principles of GUHA-80 • Domain dependence (…exploratory data analysis) • Join human possibilities with machine • More heuristics are relevant • Interactivity with user • Non routine (GUHA-80 not for every-day data processing)
GUHA-80 Structure (2) • Input empirical data • Input parameters • How understood “interestingness” • Effective modules (system’s knowledge) • Clustering procedures • GUHA procedures • Agenda of jobs (priority/weight)
GUHA-80 Structure (3) • Heuristics: optimal way to realize a job • Changing system of concepts • Hierarchy of concepts (applicability) • Possible unification of heuristics, jobs,…
GUHA-80 Input • Data • Input information • Decompositions/orderings of sets of quantities • Help understand “interestingness”
GUHA-80 Effective modules • Evaluation of usual statistical characteristics,… • Complicated procedures • Synthesis of parameters (“job on job”)
GUHA-80 • Hundreds of heuristic rules • No explicit definition of interestingness (exploration in a space) • Interactivity with the user • Non-routine character
Process of S-O Data Mining EmpiricalData Domain Knowledge,… Chains of Data & Knowledge Processing Tasks All Interesting Views, Patterns DataSource, TimeTransf, SumatraTT, 4ft, KL, CF, …
Key Factors of S-O Data Mining • Data Preparation • Modeling • Evaluation • Knowledge Base • Domain Knowledge
Data Preparation • Discretization • Attribute Type dependent: • Nominal/Ordinal/Interval/Ratio • Type of coefficient dependent • Discretization-Modeling Cycle (KL, 4ft, CF,…) • Known problem with intervals of categories without values • Usually not one target attribute
Attribute type dependent discretization • Nominal • Classes of values • Ordinal • Extrem/missing values • Type of coefficient • Usually not one target attribute
Intervals of Categories without Values Solution: • Statistics – extrem values • 4ft Task: correlations, implications • Potentially interesting patterns
Extrem/Missing Values 4ft: Find associations between extrem/missing values (impl/correl) CF, KL: Find patterns with extrem/missing values
Data Preparation • Classes of attributes • Partial cedents • Associations between attributes in one class • Associations between partial cedents
Evaluation-Modeling • Input information for partial cedents • Mining for Interesting Patterns • Exceptions • Missing values • Extrem values • Discovered hypotheses • Groups of hypotheses • Coverage hypotheses/input data
Heuristic Rules (1) • Examples: • IF more extrem/missing values found, search for association with extrem/missing values • IF 0 hypotheses found, set-up less strong quantifier (p, Base) values • IF subset of input data not covered by hypotheses THEN search for associations covering these data
Heuristic Rules (2) • Examples: • IF nominal type of column (input data matrix) AND no associated table for discretization THEN each value is one category (attribute creation) • Use “subset” coefficient type for nominal attributes
Metabase, Knowledge Base • Metadata (Knowledge): • Results of Previous X-Miner Tasks • Domain Knowledge • Interaction with User (learning?)
GUHA-80 vs. X-Miner (1) • Task parameters (partial cedents, …) • SW, HW • Experiences with LM applications,…
GUHA-80 vs. X-Miner (2) • More complex heuristics
EverMiner – Features • Based on LispMiner (X-Miners) • Agenda of jobs, priority/strings • Heuristics • Interaction with user • Enables to repeat the process on new data (“check” vs. new KDD process)
EverMiner – where we are • Experiences (Medicine, traffic, shares, sociology,…) • Heuristics collection (www, brainstorming) • Co-operation with data preparation experts (FEL, SumatraTT) • Testing “Strings of jobs” (learning)