610 likes | 783 Views
Understanding of data using Computational Intelligence methods . Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland http://www.phys.uni.torun.pl/~duch.
E N D
Understanding of data using Computational Intelligence methods Włodzisław Duch Dept. of Informatics, Nicholas Copernicus University, Toruń, Poland http://www.phys.uni.torun.pl/~duch
Computational IntelligenceTheory: neural networks, decision trees, similarity-based methods, data mining & understanding.Applications: psychometry, medical diagnosis support, hematology project, Bayer Diagnostics. • Bioinformatics Children’s Medical Research Foundation, Cincinnati, Ohio, USA (J. Meller, R. Adamczak, L. Itert). • Cognitive Science.Brain, behavior and psychology: from neurodynamics to mind in psychological spaces; cognitive toys.
Plans for today: • Data and CI • What we hope for. • Forms of understanding. • Visualization. • Prototypes. • Logical rules. • Some knowledge discovered. • Expert system for psychometry. • Conclusions, or why am I saying this?
Types of Data • Data was precious! Now it is overwhelming ... • Statistical data – clean, numerical, controlled experiments, vector space model. • Relational data – marketing, finances. • Textual data – Web, NLP, search. • Complex structures – chemistry, economics. • Sequence data – bioinformatics. • Multimedia data – images, video. • Signals – dynamic data, biosignals. • AI data – logical problems, games, behavior …
Evolutionaryalgorithms PatternRecognition Multivariatestatistics Expert systems Fuzzylogic Machinelearning Visuali-zation Neuralnetworks Probabilistic methods Computational Intelligence Soft computing Computational IntelligenceData => KnowledgeArtificial Intelligence
CI & AI definition • Computational Intelligence is concerned with solving effectively non-algorithmic problems.This corresponds to all cognitive processes, including low-level ones (perception). • Artificial Intelligence is a part of CI concerned with solving effectively non-algorithmic problems requiring systematic reasoning and symbolic knowledge representation. Roughly this corresponds to high-level cognitive processes.
Turning data into knowledge What should CI methods do? • Provide descriptive and predictive non-parametric models of data. • Allow to classify, approximate, associate, correlate, complete patterns. • Allow to discover new categories and interesting patterns. • Help to visualize multi-dimensional relationships among data samples. • Allow to understand the data in some way. • Facilitate creation of ES and reasoning.
Forms of useful knowledge AI/Machine Learning camp: Neural nets are black boxes. Unacceptable! Symbolic rules forever. But ... knowledge accessible to humans is in: • symbols, • similarity to prototypes, • images, visual representations. What type of explanation is satisfactory? Interesting question for cognitive scientists. Different answers in different fields.
Data understanding • Humans remember examples of each category and refer to such examples – as similarity-based or nearest-neighbors methods do. • Humans create prototypes out of many examples – as Gaussian classifiers, RBF networks, neurofuzzy systems do. • Logical rules are the highest form of summarization of knowledge. Types of explanation: • visualization-based: maps, diagrams, relations ... • exemplar-based: prototypes and similarity; • logic-based: symbols and rules.
Visualization: dendrograms All projections (cuboids) on 2D subspaces are identical, dendrograms do not show the structure. Normal and malignant lymphocytes.
Visualization: 2D projections 3-bit parity + all 5-bit combinations, ex. 11100101. All projections (cuboids) on 2D subspaces are identical, dendrograms do not show any structure.
Visualization: MDS mapping Results of mapping using multidimensional scaling + centers of hierarchical clusters connected. 3-bit parity + all 5-bit combinations.
Visualization: 3D projections Only age is continuous, other values are binary Fine Needle Aspirate of Breast Lesions, red=malignant, green=benignA.J. Walker, S.S. Cross, R.F. Harrison, Lancet 1999, 394, 1518-1521
Visualization: MDS mappings Try to preserve all distances in 2D nonlinear mapping MDS large sets using LVQ + relative mapping.
Prototype-based rules C-rules (Crisp), are a special case of F-rules (fuzzy rules). F-rules (fuzzy rules) are a special case of P-rules (Prototype). P-rules may be crisp or fuzzy; crisp rules have the form: IF P = arg minR D(X,R) THAN Class(X)=Class(P) D(X,R) is a dissimilarity (distance) function, determining decision borders around prototype P. P-rules are easy to interpret! IF X=You are most similar to the P=SupermanTHAN You are in the Super-league. IF X=You are most similar to the P=Weakling THAN You are in the Failed-league. “Similar” may involve different features or D(X,P).
P-rules Euclidean distance leads to a Gaussian fuzzy membership functions + product as T-norm. Manhattan function =>m(X;P)=exp{-|X-P|} Various distance functions lead to different MF. Ex. data-dependent distance functions, for symbolic data:
Crisp P-rules New distance functions from info theory => interesting MF. Membership Functions => new distance function, with local D(X,R) for each cluster. Crisp logic rules: use Lnorm: D(X,P) = ||X-P|| = maxiWi |Xi-Pi| D(X,P) = const => rectangular contours. L (Chebyshev) distance with thresholds P IF D(X,P) PTHENC(X)=C(P) is equivalent to a conjunctive crisp rule IFX1[P1-P/W1,P1+P/W1] ……XN[PN -P/WN,PN+P/WN]THENC(X)=C(P)
Decision borders D(P,X)=const and decision borders D(P,X)=D(Q,X). Euclidean distance from 3 prototypes, one per class. Minkovski a=20 distance from 3 prototypes.
P-rules for Wine L distance (crisp rules): 15 prototypes kept, 5 errors, f2, f8, f10 removed Euclidean distance: 11 prototypes kept, 7 errors Manhattan distance: • prototypes kept, 4 errors, f2 removed Many other solutions. Prototypes: SV & clusters.
Complex objects Vector space concept is not sufficient for complex objects; a common set of features for such objects may not exist. AI: complex objects, states, subproblems. Evaluate similarity D(Oi,Oj), it is sufficient for classification! Compare Oi, Oj: define transformation Elementary operators Wk, eg. substring’s substitutions. Many T connecting a pair of objects Oiand Oj objects exist. Cost of transformation = sum of Wk costs. Similarity: lowest transformation costs. Bioinformatics: sophisticated similarity functions for sequences.Dynamic programming finds similarities in reasonable time. Use adaptive costs and general framework for SBM methods.
Promoters DNA strings, 57 aminoacids, 53 + and 53 - samples tactagcaatacgcttgcgttcggtggttaagtatgtataatgcgcgggcttgtcgt Euclidean distance, symbolic s =a, c, t, g replaced by x=1, 2, 3, 4 PDF distance, symbolic s=a, c, t, g replaced by p(s|+)
Connection of CI with AI AI/CI division is harmful for science! GOFAI: operators, state transformations and search techniques are basic tools in AI solving problems requiring systematic reasoning. CI methods may provide useful heuristics for AI and define metric relations between states, problems or complex objects. Example: combinatorial productivity in AI systems and FSM. Later: decision tree for complex structures.
Electric circuit example Answering questions in complex domains requires reasoning. Qualitative behavior of electric circuit: 7 variables, but Ohm’s law V=I*R, or Kirhoff’s law Vt=V1+V2 Train a NeuroFuzzy system on Ohm’s and Kirhoff’s laws. Without solving equations; answer questions of the type: If R2grows, R1& Vtare constant, what will happen with the current I and voltages V1, V2 ? (taken from the PDP book, McClleland, Rumelhart, Hinton)
Electric circuit search AI: create search tree, CI: provide guiding intuition. Any law of the form A=B*C or A=B+C, ex: V=I*R, has 13 true facts, 14 false facts and may be learned by NF system. Geometrical representation: + increasing, - decreasing, 0 constant Find combination of Vt, Rt, I, V1, V2, R1, R2for which all 5 constraints are fulfilled. For 111 cases put of37=2187 Search and check if X can be +, 0, -, laws are not satisfied if F(Vt=0, Rt, I,V1, V2, R1=0, R2=+) =0
Heuristic search If R2grows, R1& Vtare constant, what will happen with the current I and voltages V1, V2 ? We know that: R2=+, R1=0, Vt=0, V1=?, V2=?, Rt=?, I =? Take V1=+ and check if:F(Vt=0, Rt=?, I=?,V1=+, V2=?, R1=0, R2=+) >0 Since for all V1=+, 0 and – the function is F()>0 take variable that leads to unique answer, Rt Single search path solves the problems. Useful also in approximate reasoning where only some conditions are fulfilled.
Crisp logic rules: for continuous x use linguistic variables (predicate functions). Logical rules sk(x) şTrue [XkŁxŁX'k], for example: small(x) = True{x|x<1} medium(x) = True{x|xÎ[1,2]} large(x) = True{x|x>2} Linguistic variables are used in crisp (prepositional, Boolean) logic rules: IF small-height(X) AND has-hat(X) AND has-beard(X) THEN (X is a Brownie) ELSE IF ... ELSE ...
Crisp logic is based on rectangular membership functions: Crisp logic decisions True/False values jump from 0 to 1. Step functions are used for partitioning of the feature space. Very simple hyper-rectangular decision borders. Severe limitation on the expressive power of crisp logical rules!
Decision trees lead to specific decision borders. SSV tree on Wine data, proline + flavanoids content DT decisions borders Decision tree forests: many decision trees of similar accuracy, but different selectivity and specificity.
Logical rules, if simple enough, are preferable. IF the number of rules is relatively small AND the accuracy is sufficiently high. THEN rules may be an optimal choice. Logical rules - advantages • Rules may expose limitations of black box solutions. • Only relevant features are used in rules. • Rules may sometimes be more accurate than NN and other CI methods. • Overfitting is easy to control, rules usually have small number of parameters. • Rules forever !? A logical rule about logical rules is:
Logical rules are preferred but ... Logical rules - limitations • Only one class is predicted p(Ci|X,M) = 0 or 1 black-and-white picture may be inappropriate in many applications. • Discontinuous cost function allow only non-gradient optimization. • Sets of rules are unstable: small change in the dataset leads to a large change in structure of complex sets of rules. • Reliable crisp rules may reject some cases as unclassified. • Interpretation of crisp rules may be misleading. • Fuzzy rules are not so comprehensible.
Rules - choices Simplicity vs. accuracy. Confidence vs. rejection rate. p++ is a hit; p-+ false alarm; p+- is a miss.
Inputs: -1 65 1 5 3 1 Pain Intensity Neural networksand rules ~ p(MI|X) Myocardial Infarction 0.7 Outputweights Inputweights Sex Age Smoking Elevation Pain ECG: ST Duration
Knowledge from networks Simplify networks: force most weights to 0, quantize remaining parameters, be constructive! • Regularization: mathematical technique improving predictive abilities of the network. • Result: MLP2LN neural networks that are equivalent to logical rules.
Converts MLP neural networks into a network performing logical operations (LN). MLP2LN Input layer Output: one node per class. Aggregation: better features Linguistic units: windows, filters Rule units: threshold logic
Learning dynamics Decision regions shown every 200 training epochs in x3, x4 coordinates; borders are optimally placed with wide margins.
Neurofuzzy systems Fuzzy: m(x)=0,1 (no/yes) replaced by a degree m(x)[0,1]. Triangular, trapezoidal, Gaussian ...MF. Feature Space Mapping (FSM) neurofuzzy system. Neural adaptation, estimation of probability density distribution (PDF) using single hidden layer network (RBF-like) with nodes realizing separable functions: M.f-s in many dimensions:
Heterogeneous systems Homogenous systems: one type of “building blocks”, same type of decision borders. Ex: neural networks, SVMs, decision trees, kNNs …. Committees combine many models together, but lead to complex models that are difficult to understand. Discovering simplest class structures, its inductive bias: requires heterogeneous adaptive systems (HAS). Ockham razor: simpler systems are better. HAS examples: NN with many types of neuron transfer functions. k-NN with different distance functions. DT with different types of test criteria.
GhostMiner Philosophy GhostMiner, data mining tools from our lab. http://www.fqspl.com.pl/ghostminer/ • Separate the process of model building and knowledge discovery from model use => GhostMiner Developer & GhostMiner Analyzer. • There is no free lunch – provide different type of tools for knowledge discovery. Decision tree, neural, neurofuzzy, similarity-based, committees. • Provide tools for visualization of data. • Support the process of knowledge discovery/model building and evaluating, organizing it into projects.
Recurrence of breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. 286 cases, 201 no recurrence (70.3%), 85 recurrence cases (29.7%) no-recurrence-events, 40-49, premeno, 25-29, 0-2, ?, 2, left, right_low, yes 9 nominal features: age (9 bins), menopause, tumor-size (12 bins), nodes involved (13 bins), node-caps, degree-malignant (1,2,3), breast, breast quad, radiation.
Recurrence of breast cancer Data from: Institute of Oncology, University Medical Center, Ljubljana, Yugoslavia. Many systems used, 65-78% accuracy reported. Single rule:IF (nodes-involved [0,2] Ùdegree-malignant = 3 THEN recurrence, ELSE no-recurrence 76.2% accuracy, only trivial knowledge in the data: “Highly malignant breast cancer involving many nodes is likely to strike back.”
Recurrence - comparison. Method 10xCV accuracy MLP2LN 1 rule 76.2 SSV DTstable rules 75.7 1.0 k-NN, k=10, Canberra 74.1 1.2 MLP+backprop. 73.5 9.4 (Zarndt)CART DT 71.4 5.0 (Zarndt) FSM, Gaussian nodes 71.7 6.8 Naive Bayes 69.3 10.0 (Zarndt) Other decision trees < 70.0
Breast cancer diagnosis. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. 699 cases, 9 cell features quantized from 1 to 10: clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli, mitoses. Tasks: distinguish benign from malignant cases.
Breast cancer rules. Data from University of Wisconsin Hospital, Madison, collected by dr. W.H. Wolberg. Simplest rule from MLP2LN, large regularization: If uniformity of cell size < 3 Thenbenign Elsemalignant Sensitivity=0.97, Specificity=0.85 More complex solutions (3 rules) give in 10CV: Sensitivity =0.95, Specificity=0.96, Accuracy=0.96
Breast cancer comparison. Method 10xCV accuracy k-NN, k=3, Manh 97.0 2.1 (GM)FSM, neurofuzzy96.9 1.4 (GM) Fisher LDA 96.8 MLP+backprop. 96.7 (Ster, Dobnikar)LVQ 96.6 (Ster, Dobnikar) IncNet (neural) 96.42.1 (GM)Naive Bayes 96.4 SSV DT, 3 crisp rules 96.0 2.9 (GM)LDA (linear discriminant) 96.0 Various decision trees 93.5-95.6
SSV HAS Wisconsin Heterogeneous decision tree that searches not only for logical rules but also for prototype-based rules. Single P-rule gives simplest known description of this data: IF ||X-R303|| < 20.27 then malignant else benign; 18 errors, acc=97.3%, Se=97.9%, Sp=96.9% Good prototype for malignant! Simple thresholds, that’s what MDs like the most! Best L1O error 98.3% (FSM), best 10CV around 97.5% (Naïve Bayes + kernel, SVM) C 4.5 gives 94.7±2.0% SSV without distances: 96.4±2.1% Several simple rules of similar accuracy in CV tests exist.
Melanoma skin cancer • Collected in the Outpatient Center of Dermatology in Rzeszów, Poland. • Four types of Melanoma: benign, blue, suspicious, or malignant. • 250 cases, with almost equal class distribution. • Each record in the database has 13 attributes: asymmetry, border, color (6), diversity (5). • TDS (Total Dermatoscopy Score) - single index, linear combination of melanoma spot properties. • Goal: hardware scanner for preliminary diagnosis.
Melanoma results Method Rules Training % Test% MLP2LN, crisp rules 4 98.0 all 100 SSV Tree, crisp rules 4 97.5±0.3 100FSM, rectangular f. 7 95.5±1.0 100 knn+ prototype selection 13 97.5±0.0 100 FSM, Gaussian f. 15 93.7±1.0 95±3.6 knn k=1, Manh, 2 features -- 97.4±0.3 100 LERS, rough rules 21 -- 96.2
Antibiotic activity of pyrimidine compounds. Pyrimidines: which compound has stronger antibiotic activity? Common template, substitutions added at 3 positions, R3, R4 and R5. 27 features taken into account: polarity, size, hydrogen-bond donor or acceptor, pi-donor or acceptor, polarizability, sigma effect. Pairs of chemicals, 54 features, are compared, which one has higher activity? 2788 cases, 5-fold crossvalidation tests.
Antibiotic activity - results. Pyrimidines: which compound has stronger antibiotic activity? Mean Spearman's rank correlation coefficient used: -1< rs < +1 Method Rank correlation FSM, 41 Gaussian rules 0.77±0.03Golem (ILP) 0.68Linear regression 0.65CART (decision tree) 0.50