1 / 39

DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS

DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS. Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO School Porto, Portugal May 13-15, 2002. COMPUTING DISSIMILARITIES: WHY?.

edda
Download Presentation

DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO School Porto, Portugal May 13-15, 2002

  2. COMPUTING DISSIMILARITIES: WHY? • Several data analysis techniques are based on quantifying a dissimilarity (or similarity) measure between multivariate data. • Clustering • Discriminant analysis • Visualization-based approaches • Symbolic objects are a kind of multivariate data. • Ex.: [colour={red, black}][weight  {60,70,80}][height  []1.50,1.60] • The dissimilarity measures presented here are among those investigated in the SODAS Project.

  3. BOOLEAN SYMBOLIC OBJECTS (BSO’S) A BSO is a conjunction of boolean elementary events: [Y1=A1]  [Y2=A2] ...  [Yp=Ap] where each variableYitakes values in Yiand Aiis a subset of Yi Let a and b be two BSO’s: a = [Y1=A1]  [Y2=A2] ...  [Yp=Ap] b = [Y1=B1]  [Y2=B2] ...  [Yp=Bp] where each variable Yjtakes values in Yjand Aj and Bjare subsets of Yj. We are interested to compute the dissimilarity d(a,b).

  4. CONSTRAINED BSO’S Two types of dependencies between variables: • Hierarchical dependence (mother-daughter): A variable Yi may be inapplicable if another variable Yj takes its values in a subset SjYj. This dependence is expressed as a rule: if[Yj= Sj]then[Yi = NA] • Logical dependence: This case occurs, if a subset Sj Yj of a variable Yj is related to a subset Si Yi of a variable Yiby a rule such as: if[Yj = Sj]then[Yi = Si]

  5. DISSIMILARITY AND SIMILARITY MEASURES Dissimilarity Measure d:EER such thatd*a =d(a,a)  d(a,b) = d(b,a) <a,bE Similarity Measure s:EE  R such that s*a= s(a,a) s(a,b) = s(b,a) 0a,bE Generally: aE: d*a=d*and s*a=s*andspecifically, d* = 0 while s*= 1 Dissimilarity measures can be transformed into similarity measures (and viceversa): d=(s) ( s=-1(d) ) where: (s) strictly decreasing function, and (1) = 0, (0) = 

  6. DISSIMILARITY AND SIMILARITY MEASURES: PROPERTIES Some propertiesthat a dissimilarity measure d on E may satisfy are:  1. d(a, b) = 0  c  E: d(a, c) = d(b, c) (eveness) 2. d(a, b) = 0  a = b (definiteness) 3. d(a, b)  d(a, c) + d(c, b) (triangle inequality) 4. d(a, b)  max(d(a, c), d(c, b)) (ultrametric inequality ) 5. d(a, b) + d(c, d)  max(d(a, c) + d(b, d), d(a, d) +d(b, c)) (Buneman's inequality) 6. Let (E, +) be a group, then d(a, b) = d(a+c, b+c) (translation invariance) • A dissimilarity function that satisfies proprieties 2 and 3 is called metric. • A dissimilarity function that satisfies only property 3 is called pseudo metric or semi- distance.

  7. DISSIMILARITY MEASURES BETWEEN BSO’S Author(s) (Year)  Notation from the SODAS Package • Gowda & Diday (1991)  U_1 • Ichino & Yaguchi (1994)  U_2, U_3, U_4 • De Carvalho (1994)  SO_1, SO_2 • De Carvalho (1996, 1998)  SO_3, SO_4, SO_5, C_1 U: only for unconstrained BSO’s C: only for constrained BSO’s SO: for both constrained and unconstrained BSO’s

  8. Aj Bj D Ds Dc GOWDA & DIDAY’S DISSIMILARITY MEASURE Gowda & Diday’s dissimilarity measures for two BSO’s a and b: U_1 D(a, b) = If Yj is a continuous variable: D(Aj, Bj) = D(Aj, Bj) + Ds(Aj, Bj) + Dc(Aj, Bj) while if Yj is a nominal variable: D(Aj, Bj) = Ds(Aj, Bj) + Dc(Aj, Bj) where the components are defined so that their values are normalized between 0 and 1: • D(Aj, Bj) due to position, • Ds(Aj, Bj) due to span, • Dc(Aj, Bj) due to content

  9. GOWDA & DIDAY’S DISSIMILARITY MEASURE Properties: D(a, b) = 0 a = b (definiteness property), No proof is reported for the triangle inequality property

  10. AjBj Aj Bj AjBj ICHINO & YAGUCHI’S DISSIMILARITY MEASURES Ichino & Yaguchi’s dissimilarity measures are based on the Cartesian operators join and meet. For continuous variables: AjBj AjBj while for nominal variables: AjBj = AjBj AjBj = AjBj Given a pair of subsets (Aj, Bj) of Yj the componentwise dissimilarity(Aj,Bj) is: (Aj, Bj) =AjBjAj Bj+(2Aj BjAjBj) where 0  0.5 and Ajis defined depending on variable types.

  11. ICHINO & YAGUCHI’S DISSIMILARITY MEASURES (Aj,Bj)are aggregated by an aggregation function such as the generalised Minkowski’s distance of order q: U_2 Drawback: dependence on the chosen units of measurements. Solution: normalization of the componentwise dissimilarity: U_3 The weighted formulation guarantees that dq(a,b)[0,1]. U_4 The above measures are metrics

  12. DE CARVALHO’S DISSIMILARITY MEASURES A straightforward extension of similarity measures for classical data matrices with nominal variables. where (Vj) is either the cardinality of the set Vj(if Yjis a nominal variable) or the length of the interval Vj(if Yjis a continuous variable).

  13. DE CARVALHO’S DISSIMILARITY MEASURES Five different similarity measures si, i = 1, ..., 5, are defined: The corresponding dissimilarities are di = 1 si. The diare aggregated by an aggregation function AF such as the generalised Minkowski metric, thus obtaining: SO_1

  14. DE CARVALHO’S EXTENSION OF ICHINO & YAGUCHI’S DISSIMILARITY MEASURE A different componentwise dissimilarity measure: where  is defined as in Ichino & Yaguchi’s dissimilarity measure. The aggregation function AF suggested by De Carvalho is: SO_2 This measure is a metric.

  15. THE DESCRIPTION-POTENTIAL APPROACH All dissimilarity measures considered so far are defined by two functions: a comparison function (componentwise measure) and an aggregation function. A different approach is based on the concept of description potential(a) of a symbolic object a. where (Vj) is either the cardinality of the set Vj(if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable).

  16. THE DESCRIPTION-POTENTIAL APPROACH SO_3 SO_4 SO_5 The triangular inequality does not hold for SO_3 and SO_4, which are equivalent. SO_5 is a metric.

  17. DESCRIPTION POTENTIAL FOR CONSTRAINED BSO’S Given a BSO a and a logical dependence expressed by the rule: if[Yj = Sj]then[Yi = Si] the incoherent restrictiona’of a is defined as: a’= [Y1=A1]  ...  [Yj-1=Aj-1]  [Yj=Aj Sj]  ...  [Yi-1=Ai-1] [Yi=Ai (Yi\Si)]  ...  [Yp=Ap] Then the description potential of a is: A similar extension exists for hierarchical dependencies.

  18. DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S • The extended definition of description potential can be applied to the computation of the distances SO_3, SO_4 and SO_5. • De Carvalho proposed an extension of ’, so that SO_2 can also be applied to constrained BSO. • He also proposed an extension of , , , and  in order to take into account of constraints. Therefore, SO_1 can also be applied to constrained BSO. Finally, C_1 is defined as follows: where: If all BSO’s are coherent, then the dissimilarity measures do not change.

  19. MATCHING • Matching is the process of comparing two or more structures to discover their similarities or differences. • Similarity judgements in the matching process are directional: They have a • referent, a, a prototype or the description of a class of objects • subject, b, a variant of the prototype or an instance of a class of objects. • Matching two structures is a common problem to many domains, like symbolic classification, pattern recognition, data mining and expert systems.

  20. MATCHING BSO’S • Generally, a BSO represents a class description and plays the role of the referent in the matching process. a: [color = {black, white}]  [height =[170, 200]] describes a set of individuals either black or white, whose height is in the interval [170,200]. Such a set of individuals is called extension of the BSO. The extension is a subset of the universe  of individuals. Given two BSO’s a and b, the matching operators define whether b is the description of an individual in the extension of a. • In the SODAS software two matching operators for BSO’s have been defined.

  21. CANONICAL MATCHING OPERATOR • The result of the canonical matching operator is either 0 (false) or 1 (true). • If E denotes the space of BSO’s described by a set of p variables Yi taking values in the corresponding domains Yi, then the matching operator is a function: Match: E × E {0, 1} such that for any two BSO’s a, b  E: a = [Y1=A1]  [Y2=A2] ...  [Yp=Ap] b = [Y1=B1]  [Y2=B2] ...  [Yp=Bp] it happens that: • Match(a,b) = 1 if BiAi for each i=1, 2, , p, • Match(a,b) = 0 otherwise.

  22. CANONICAL MATCHING OPERATOR Examples: District1 = [profession={farmer, driver}]  [age=[24,34]] Indiv1 = [profession=farmer]  [age=28] Indiv2 = [profession=salesman]  [age=[27,28]] Match(District1,Indiv1) = 1 Match(District1,Indiv2) = 0

  23. CANONICAL MATCHING OPERATOR • The canonical matching function satisfies two out of three properties of a similarity measure: •  a, b  E: Match(a, b)  0 •  a, b  E: Match(a, a)  Match(a, b) while it does not satisfy the commutativity or simmetry property: •  a, b  E: Match(a, b) = Match(b, a) because of the different role played by a and b.

  24. FLEXIBLE MATCHING OPERATOR • The requirement BiAifor each i=1, 2, , p, might be too strict for real-world problems, because of the presence of noise in the description of the individuals of the universe. • Example: District1 = [profession={farmer, driver}]  [age=[24,34]] Indiv3 = [profession=farmer]  [age=23] Match(District1,Indiv3) = 0 • It is necessary to rely on a flexible definition of matching operator, which returns a number in [0,1] corresponding to the degree of match between two BSO’s, that is flexible-matching: E × E® [0,1]

  25. FLEXIBLE MATCHING OPERATOR For any two BSO’s a and b, i) flexible-matching(a,b)=1 if Match(a,b)=true, ii) flexible-matching(a,b)Î[0,1) otherwise. The result of the flexible matching can be interpreted as the probability of amatching b provided that a change is made in b. Let Ea = {b'E | Match(a,b')=1}and P(b | b') be the conditional probability of observing b given that the original observation was b'. Then that is flexible-matching(a,b) equals the maximum conditional probability over the space of BSO’s canonically matched by a.

  26. FLEXIBLE MATCHING: AN APPLICATION • Credit card applications (Quinlan) • Fifteen variables whose names and values have been changed to meaningless symbols to protect the confidentiality of the data. + • class variable: positive in case of approval of credit facilities, negative otherwise. • Training set: 490 cases • 6 rules generated by Quinlan’s system C4.5

  27. FLEXIBLE MATCHING: AN APPLICATION • Such rules can be easily represented by means of Boolean symbolic objects. • Both matching operators can be considered in order to test the validity of the induced rules.

  28. REFERENCES • Esposito F., Malerba D., V. Tamma, H.-H. Bock. Classical resemblance measures. Chapter 8.1 • Esposito F., Malerba D., V. Tamma. Dissimilarity measures for symbolic objects. Chapter 8.3 • Esposito F., Malerba D., F.A. Lisi. Matching symbolic objects. Chapter 8.4 in H.-H. Bock, E. Diday (eds.): Analysis of Symbolic Data. Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg, 2000. • D. Malerba, L. Sanarico, & V. Tamma (2000). A comparison of dissimilarity measures for Boolean symbolic data. In P. Brito, J. Costa, & D. Malerba (Eds.), Proc. of the ECML 2000 Workshop on “Dealing with Structured Data in Machine Learning and Statistics”, Barcelona. • D. Malerba, F. Esposito, V. Gioviale, & V. Tamma. Comparing Dissimilarity Measures in Symbolic Data Analysis. Pre-Proceedings of EKT-NTTS, vol. 1, pp. 473-481.

  29. METHOD DI • Both dissimilarity measures and matching operators between BSO’s are available in the method DI (Distance matrix) of the SODAS software. • Input: Sodas file of BSO’s • Output for dissimilarities: Report + Sodas file with distance matrix • Output for matching operators: Report • Developer: Dipartimento di Informatica, University of Bari, Italy. DI method Report file

  30. Create a SODAS chaining with the DI method Create a SODAS chaining with the DI method Set up parameters of the DI method Set up parameters of the DI method User Run the DI method and generate a report file Run the DI method and generate a new SODAS file with a dissimilarity matrix User View report file Create a new chaining with the new SODAS file TWO USE CASE DIAGRAMS

  31. PARAMETER SETUP • The user can select a subset of variables Yi on which the dissimilarity measure or the matching operator has to computed .

  32. PARAMETER SETUP • The user can select a number of parameters. Dissimilarity measure or matching operator Name of the new SODAS file

  33. OUTPUT SODAS FILE • The output SODAS file contains both the same input data and an additional dissimilarity matrix. The dissimilarity between the i-th and the j-th BSO is written in the cell (entry) (i, j) of the matrix. • Only the lower part of the dissimilarity matrix is reported in the file, since dissimilarities are symmetric. TRIANGE_MATRIX = ( (0) , (0.20531, 0) , (12.8626, 15.0793, 0) , (14.0338, 15.0403, 8.64626, 0) , (14.1655, 15.1651, 11.4512, 6.90074, 0) , …)

  34. OUTPUT REPORT FILE The report file is organized as follows: Page 1 SODAS 07/26/01 Sodas The Statistical Package for Symbolic Data Analysis Version 1.0 - 05 January 2001 **************D I S T A N C E M E A S U R E S*********** Data Information:

  35. OUTPUT REPORT FILE Input Sodas File: C:\ASSO_L~1\SPERIM~3\NUOVAS~1\ABALO.SDS 28 Boolean Symbolic Objects (BSOs) read. 8 Variables selected for each BSO: 1 -- 8 Selected Distance Function: U_1 Gowda & Diday Distance Matrix BSO 1 2 3 4 1 0 2 0.2053 0 3 12.86 15.08 0 4 14.03 15.04 8.646 0 5 14.17 15.17 11.45 6.901 ------------------------------------------------------------------------- Page 2 SODAS 07/26/01

  36. PARAMETER SETUP • Only one parameter has to be specified for matching BSO’s. • The Class BSO represents the BSO that will be used as referent (default: 2nd BSO). • In the case of canonical matching the subject can be any BSO. • In the case of flexible matching the subject must be a BSO representing an individual.

  37. OUTPUT REPORT FILE The report file is organized as follows: Page 1 SODAS 12/31/99 Sodas The Statistical Package for Symbolic Data Analysis *********** C A N O N I C A L M A T C H I N G ********** Data Information 29 Boolean Symbolic Objects (BSOs) read. 1 : Boolean Symbolic Object (BSO) selected as class. Matching Vector BSO 1 1 Match 2 No Match ... 29 No Match This procedure was completed at 17:57:33

  38. FURTHER DEVELOPMENTS IN THE ASSO PROJECT Two modules, DISS and MATCH. Both modules will be upgraded to work with both BSO’s and Problabilistic Symbolic Objects (PSO’s). Formally, a PSO, which is made up of M probabilistic elementary events (PEE) ai , is defined as: a = = Each PEE ai is attached to an observed quantitative variables Yiwhich takes kivalues cijawith probability pija (which sum up to 1).

More Related