390 likes | 412 Views
DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS. Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO School Porto, Portugal May 13-15, 2002. COMPUTING DISSIMILARITIES: WHY?.
E N D
DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO School Porto, Portugal May 13-15, 2002
COMPUTING DISSIMILARITIES: WHY? • Several data analysis techniques are based on quantifying a dissimilarity (or similarity) measure between multivariate data. • Clustering • Discriminant analysis • Visualization-based approaches • Symbolic objects are a kind of multivariate data. • Ex.: [colour={red, black}][weight {60,70,80}][height []1.50,1.60] • The dissimilarity measures presented here are among those investigated in the SODAS Project.
BOOLEAN SYMBOLIC OBJECTS (BSO’S) A BSO is a conjunction of boolean elementary events: [Y1=A1] [Y2=A2] ... [Yp=Ap] where each variableYitakes values in Yiand Aiis a subset of Yi Let a and b be two BSO’s: a = [Y1=A1] [Y2=A2] ... [Yp=Ap] b = [Y1=B1] [Y2=B2] ... [Yp=Bp] where each variable Yjtakes values in Yjand Aj and Bjare subsets of Yj. We are interested to compute the dissimilarity d(a,b).
CONSTRAINED BSO’S Two types of dependencies between variables: • Hierarchical dependence (mother-daughter): A variable Yi may be inapplicable if another variable Yj takes its values in a subset SjYj. This dependence is expressed as a rule: if[Yj= Sj]then[Yi = NA] • Logical dependence: This case occurs, if a subset Sj Yj of a variable Yj is related to a subset Si Yi of a variable Yiby a rule such as: if[Yj = Sj]then[Yi = Si]
DISSIMILARITY AND SIMILARITY MEASURES Dissimilarity Measure d:EER such thatd*a =d(a,a) d(a,b) = d(b,a) <a,bE Similarity Measure s:EE R such that s*a= s(a,a) s(a,b) = s(b,a) 0a,bE Generally: aE: d*a=d*and s*a=s*andspecifically, d* = 0 while s*= 1 Dissimilarity measures can be transformed into similarity measures (and viceversa): d=(s) ( s=-1(d) ) where: (s) strictly decreasing function, and (1) = 0, (0) =
DISSIMILARITY AND SIMILARITY MEASURES: PROPERTIES Some propertiesthat a dissimilarity measure d on E may satisfy are: 1. d(a, b) = 0 c E: d(a, c) = d(b, c) (eveness) 2. d(a, b) = 0 a = b (definiteness) 3. d(a, b) d(a, c) + d(c, b) (triangle inequality) 4. d(a, b) max(d(a, c), d(c, b)) (ultrametric inequality ) 5. d(a, b) + d(c, d) max(d(a, c) + d(b, d), d(a, d) +d(b, c)) (Buneman's inequality) 6. Let (E, +) be a group, then d(a, b) = d(a+c, b+c) (translation invariance) • A dissimilarity function that satisfies proprieties 2 and 3 is called metric. • A dissimilarity function that satisfies only property 3 is called pseudo metric or semi- distance.
DISSIMILARITY MEASURES BETWEEN BSO’S Author(s) (Year) Notation from the SODAS Package • Gowda & Diday (1991) U_1 • Ichino & Yaguchi (1994) U_2, U_3, U_4 • De Carvalho (1994) SO_1, SO_2 • De Carvalho (1996, 1998) SO_3, SO_4, SO_5, C_1 U: only for unconstrained BSO’s C: only for constrained BSO’s SO: for both constrained and unconstrained BSO’s
Aj Bj D Ds Dc GOWDA & DIDAY’S DISSIMILARITY MEASURE Gowda & Diday’s dissimilarity measures for two BSO’s a and b: U_1 D(a, b) = If Yj is a continuous variable: D(Aj, Bj) = D(Aj, Bj) + Ds(Aj, Bj) + Dc(Aj, Bj) while if Yj is a nominal variable: D(Aj, Bj) = Ds(Aj, Bj) + Dc(Aj, Bj) where the components are defined so that their values are normalized between 0 and 1: • D(Aj, Bj) due to position, • Ds(Aj, Bj) due to span, • Dc(Aj, Bj) due to content
GOWDA & DIDAY’S DISSIMILARITY MEASURE Properties: D(a, b) = 0 a = b (definiteness property), No proof is reported for the triangle inequality property
AjBj Aj Bj AjBj ICHINO & YAGUCHI’S DISSIMILARITY MEASURES Ichino & Yaguchi’s dissimilarity measures are based on the Cartesian operators join and meet. For continuous variables: AjBj AjBj while for nominal variables: AjBj = AjBj AjBj = AjBj Given a pair of subsets (Aj, Bj) of Yj the componentwise dissimilarity(Aj,Bj) is: (Aj, Bj) =AjBjAj Bj+(2Aj BjAjBj) where 0 0.5 and Ajis defined depending on variable types.
ICHINO & YAGUCHI’S DISSIMILARITY MEASURES (Aj,Bj)are aggregated by an aggregation function such as the generalised Minkowski’s distance of order q: U_2 Drawback: dependence on the chosen units of measurements. Solution: normalization of the componentwise dissimilarity: U_3 The weighted formulation guarantees that dq(a,b)[0,1]. U_4 The above measures are metrics
DE CARVALHO’S DISSIMILARITY MEASURES A straightforward extension of similarity measures for classical data matrices with nominal variables. where (Vj) is either the cardinality of the set Vj(if Yjis a nominal variable) or the length of the interval Vj(if Yjis a continuous variable).
DE CARVALHO’S DISSIMILARITY MEASURES Five different similarity measures si, i = 1, ..., 5, are defined: The corresponding dissimilarities are di = 1 si. The diare aggregated by an aggregation function AF such as the generalised Minkowski metric, thus obtaining: SO_1
DE CARVALHO’S EXTENSION OF ICHINO & YAGUCHI’S DISSIMILARITY MEASURE A different componentwise dissimilarity measure: where is defined as in Ichino & Yaguchi’s dissimilarity measure. The aggregation function AF suggested by De Carvalho is: SO_2 This measure is a metric.
THE DESCRIPTION-POTENTIAL APPROACH All dissimilarity measures considered so far are defined by two functions: a comparison function (componentwise measure) and an aggregation function. A different approach is based on the concept of description potential(a) of a symbolic object a. where (Vj) is either the cardinality of the set Vj(if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable).
THE DESCRIPTION-POTENTIAL APPROACH SO_3 SO_4 SO_5 The triangular inequality does not hold for SO_3 and SO_4, which are equivalent. SO_5 is a metric.
DESCRIPTION POTENTIAL FOR CONSTRAINED BSO’S Given a BSO a and a logical dependence expressed by the rule: if[Yj = Sj]then[Yi = Si] the incoherent restrictiona’of a is defined as: a’= [Y1=A1] ... [Yj-1=Aj-1] [Yj=Aj Sj] ... [Yi-1=Ai-1] [Yi=Ai (Yi\Si)] ... [Yp=Ap] Then the description potential of a is: A similar extension exists for hierarchical dependencies.
DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S • The extended definition of description potential can be applied to the computation of the distances SO_3, SO_4 and SO_5. • De Carvalho proposed an extension of ’, so that SO_2 can also be applied to constrained BSO. • He also proposed an extension of , , , and in order to take into account of constraints. Therefore, SO_1 can also be applied to constrained BSO. Finally, C_1 is defined as follows: where: If all BSO’s are coherent, then the dissimilarity measures do not change.
MATCHING • Matching is the process of comparing two or more structures to discover their similarities or differences. • Similarity judgements in the matching process are directional: They have a • referent, a, a prototype or the description of a class of objects • subject, b, a variant of the prototype or an instance of a class of objects. • Matching two structures is a common problem to many domains, like symbolic classification, pattern recognition, data mining and expert systems.
MATCHING BSO’S • Generally, a BSO represents a class description and plays the role of the referent in the matching process. a: [color = {black, white}] [height =[170, 200]] describes a set of individuals either black or white, whose height is in the interval [170,200]. Such a set of individuals is called extension of the BSO. The extension is a subset of the universe of individuals. Given two BSO’s a and b, the matching operators define whether b is the description of an individual in the extension of a. • In the SODAS software two matching operators for BSO’s have been defined.
CANONICAL MATCHING OPERATOR • The result of the canonical matching operator is either 0 (false) or 1 (true). • If E denotes the space of BSO’s described by a set of p variables Yi taking values in the corresponding domains Yi, then the matching operator is a function: Match: E × E {0, 1} such that for any two BSO’s a, b E: a = [Y1=A1] [Y2=A2] ... [Yp=Ap] b = [Y1=B1] [Y2=B2] ... [Yp=Bp] it happens that: • Match(a,b) = 1 if BiAi for each i=1, 2, , p, • Match(a,b) = 0 otherwise.
CANONICAL MATCHING OPERATOR Examples: District1 = [profession={farmer, driver}] [age=[24,34]] Indiv1 = [profession=farmer] [age=28] Indiv2 = [profession=salesman] [age=[27,28]] Match(District1,Indiv1) = 1 Match(District1,Indiv2) = 0
CANONICAL MATCHING OPERATOR • The canonical matching function satisfies two out of three properties of a similarity measure: • a, b E: Match(a, b) 0 • a, b E: Match(a, a) Match(a, b) while it does not satisfy the commutativity or simmetry property: • a, b E: Match(a, b) = Match(b, a) because of the different role played by a and b.
FLEXIBLE MATCHING OPERATOR • The requirement BiAifor each i=1, 2, , p, might be too strict for real-world problems, because of the presence of noise in the description of the individuals of the universe. • Example: District1 = [profession={farmer, driver}] [age=[24,34]] Indiv3 = [profession=farmer] [age=23] Match(District1,Indiv3) = 0 • It is necessary to rely on a flexible definition of matching operator, which returns a number in [0,1] corresponding to the degree of match between two BSO’s, that is flexible-matching: E × E® [0,1]
FLEXIBLE MATCHING OPERATOR For any two BSO’s a and b, i) flexible-matching(a,b)=1 if Match(a,b)=true, ii) flexible-matching(a,b)Î[0,1) otherwise. The result of the flexible matching can be interpreted as the probability of amatching b provided that a change is made in b. Let Ea = {b'E | Match(a,b')=1}and P(b | b') be the conditional probability of observing b given that the original observation was b'. Then that is flexible-matching(a,b) equals the maximum conditional probability over the space of BSO’s canonically matched by a.
FLEXIBLE MATCHING: AN APPLICATION • Credit card applications (Quinlan) • Fifteen variables whose names and values have been changed to meaningless symbols to protect the confidentiality of the data. + • class variable: positive in case of approval of credit facilities, negative otherwise. • Training set: 490 cases • 6 rules generated by Quinlan’s system C4.5
FLEXIBLE MATCHING: AN APPLICATION • Such rules can be easily represented by means of Boolean symbolic objects. • Both matching operators can be considered in order to test the validity of the induced rules.
REFERENCES • Esposito F., Malerba D., V. Tamma, H.-H. Bock. Classical resemblance measures. Chapter 8.1 • Esposito F., Malerba D., V. Tamma. Dissimilarity measures for symbolic objects. Chapter 8.3 • Esposito F., Malerba D., F.A. Lisi. Matching symbolic objects. Chapter 8.4 in H.-H. Bock, E. Diday (eds.): Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg, 2000. • D. Malerba, L. Sanarico, & V. Tamma (2000). A comparison of dissimilarity measures for Boolean symbolic data. In P. Brito, J. Costa, & D. Malerba (Eds.), Proc. of the ECML 2000 Workshop on “Dealing with Structured Data in Machine Learning and Statistics”, Barcelona. • D. Malerba, F. Esposito, V. Gioviale, & V. Tamma. Comparing Dissimilarity Measures in Symbolic Data Analysis. Pre-Proceedings of EKT-NTTS, vol. 1, pp. 473-481.
METHOD DI • Both dissimilarity measures and matching operators between BSO’s are available in the method DI (Distance matrix) of the SODAS software. • Input: Sodas file of BSO’s • Output for dissimilarities: Report + Sodas file with distance matrix • Output for matching operators: Report • Developer: Dipartimento di Informatica, University of Bari, Italy. DI method Report file
Create a SODAS chaining with the DI method Create a SODAS chaining with the DI method Set up parameters of the DI method Set up parameters of the DI method User Run the DI method and generate a report file Run the DI method and generate a new SODAS file with a dissimilarity matrix User View report file Create a new chaining with the new SODAS file TWO USE CASE DIAGRAMS
PARAMETER SETUP • The user can select a subset of variables Yi on which the dissimilarity measure or the matching operator has to computed .
PARAMETER SETUP • The user can select a number of parameters. Dissimilarity measure or matching operator Name of the new SODAS file
OUTPUT SODAS FILE • The output SODAS file contains both the same input data and an additional dissimilarity matrix. The dissimilarity between the i-th and the j-th BSO is written in the cell (entry) (i, j) of the matrix. • Only the lower part of the dissimilarity matrix is reported in the file, since dissimilarities are symmetric. TRIANGE_MATRIX = ( (0) , (0.20531, 0) , (12.8626, 15.0793, 0) , (14.0338, 15.0403, 8.64626, 0) , (14.1655, 15.1651, 11.4512, 6.90074, 0) , …)
OUTPUT REPORT FILE The report file is organized as follows: Page 1 SODAS 07/26/01 Sodas The Statistical Package for Symbolic Data Analysis Version 1.0 - 05 January 2001 **************D I S T A N C E M E A S U R E S*********** Data Information:
OUTPUT REPORT FILE Input Sodas File: C:\ASSO_L~1\SPERIM~3\NUOVAS~1\ABALO.SDS 28 Boolean Symbolic Objects (BSOs) read. 8 Variables selected for each BSO: 1 -- 8 Selected Distance Function: U_1 Gowda & Diday Distance Matrix BSO 1 2 3 4 1 0 2 0.2053 0 3 12.86 15.08 0 4 14.03 15.04 8.646 0 5 14.17 15.17 11.45 6.901 ------------------------------------------------------------------------- Page 2 SODAS 07/26/01
PARAMETER SETUP • Only one parameter has to be specified for matching BSO’s. • The Class BSO represents the BSO that will be used as referent (default: 2nd BSO). • In the case of canonical matching the subject can be any BSO. • In the case of flexible matching the subject must be a BSO representing an individual.
OUTPUT REPORT FILE The report file is organized as follows: Page 1 SODAS 12/31/99 Sodas The Statistical Package for Symbolic Data Analysis *********** C A N O N I C A L M A T C H I N G ********** Data Information 29 Boolean Symbolic Objects (BSOs) read. 1 : Boolean Symbolic Object (BSO) selected as class. Matching Vector BSO 1 1 Match 2 No Match ... 29 No Match This procedure was completed at 17:57:33
FURTHER DEVELOPMENTS IN THE ASSO PROJECT Two modules, DISS and MATCH. Both modules will be upgraded to work with both BSO’s and Problabilistic Symbolic Objects (PSO’s). Formally, a PSO, which is made up of M probabilistic elementary events (PEE) ai , is defined as: a = = Each PEE ai is attached to an observed quantitative variables Yiwhich takes kivalues cijawith probability pija (which sum up to 1).