1 / 28

Tutorial 1

Tutorial 1. General Introduction to SDA. Yin-Jing Tien ( 田銀錦 ) Institute of Statistical Science Academia Sinica gary@stat.sinica.edu.tw June 13, 2014. Symbolic data Analysis (SDA) ( Diday 1987). Text: Billard and Diday (2006):

cerise
Download Presentation

Tutorial 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial 1 General Introduction to SDA Yin-Jing Tien (田銀錦) Institute of Statistical Science Academia Sinica gary@stat.sinica.edu.tw June 13, 2014

  2. Symbolic data Analysis (SDA) (Diday 1987) Text: Billard and Diday (2006): Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley. Diday, E., Noirhomme-Fraiture, M. (2008): Symbolic Data Analysis and The SODAS Software. JohnWiley & Sons Ltd., Chichester, England.

  3. Symbolic data (Diday 1987) • Classical Data : Individuals:single value • Single player • age = 25, eye color = blue • Symbolic Data : Symbolic units (Concept: groups) • Team • interval : age range = [20, 36] • multiple values: eye color = {blue,brown,black}

  4. Symbolic data analysis When? • When we are interested the higher level units (Concept: groups/classes ). • When the initial data are composed by • Symbolic data tables • When the data is BIG

  5. Symbolic data types

  6. Symbolic data types (quantitative) Multi-valued symbolic random variable Y is one or more values {12,23,20} Interval-valued symbolic random variable Y is one that takes values in an interval [17, 25] Modal multi-valued {0.5, 3/8, 1.5, 4/8, 2, 1/8} Modal interval-valued (Histogram) {[12,40), 1/7, [40, 60), 2/7, [60, 80], 4/7}

  7. Symbolic data types (qualitative) Multi-valued symbolic random variable Y is one or more values E.g., Bird Colors, Y=color Modal multi-valued {single, 3/8, married, 5/8}

  8. Basic Descriptive Statistics: Interval Value Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k. Sample Mean of Iiis Sample Variance of Ziis

  9. Basic Descriptive Statistics: Interval Value Rewrite as Total Variation = Within Variation + Between Variation Within Variation Between Variation

  10. Similarity between Variables (interval-valued data) (Billard and Diday (2006)) The empirical covariancefunction between Ziand Zjis The empirical correlation coefficient between Ziand Zjis Where

  11. Distance between concept Definition 7.6: The Cartesian join A⊕B between two sets A and B is their componentwise union, Definition 7.7: The Cartesian meet A⊗B between two sets A and B is their componentwise intersection,

  12. Distance between concept

  13. Distance between concept (Multi-valued) The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991) (relative sizes) (relative content)

  14. Distance between concept (Multi-valued) Example: Color and Habitat of Birds (Table 7.2) Y1 = Color, Y2 = Habitat For Y1: D11(ω1, ω2)=(|2-1|)/2 = 1/2 D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2 p=2 The Gowda-Didaydissimilarity For Y2:D11(ω1, ω2)=(|2-1|)/2 = 1/2 D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2 D(ω1, ω2)=(1/2+1/2)+(1/2+1/2) = 2 Normalized (adjust for scale) weights are32 D(ω1, ω2)=(1/2+1/2)/3+(1/2+1/2)/2= 5/6

  15. Distance between concept (Multi-valued) TheIchino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994) ϕj(ω1, ω2)= ) ϕ1(ω1, ω2)= 2-1+γ(2*1-2-1) = 1-γ For Y1: ϕ2(ω1, ω2)= 2-1+γ(2*1-2-1) = 1-γ For Y2: Takingγ=0.5 UnweightedMinkowskidistance Dq(ω1, ω2)= (0.5q+0.5q)1/q Weighted Minkowskidistance ( ) Dq(ω1, ω2)= ((0.5/3)q+(0.5/2)q)1/q

  16. Distance between concept (Interval-valued) Let Zi= (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k. The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991) Dj(ω1, ω2) for the variable Yj D(ω1, ω2) = (relative length) (relative content) (relative position) length of the entire distance spanned by ω1andω2 , if the intervals overlap length of the intersection , otherwise total length in covered by the observe values of Yj

  17. Distance between concept (Interval-valued) The Ichino-Yaguchi dissimilarity measure(Ichino and Yaguchi, 1994) ϕj(ω1, ω2) = ) = (empty if no interaction) = The generalized Minkowski distance of order q ≥1 between two interval-valued observations ξ(ω1) and ξ(ω2) is dq(ω1, ω2) Where ϕj(ω1, ω2) is the Ichino-Yaguchidistance and is a weight function associated with variable Yj . ϕj(ω1, ω2) When q = 1  City Block distance When q = 2 Euclidean distance

  18. Distance between concept (Interval-valued) The Hausdorff Distance(Chavent and Lechevallier, 2002) ϕj(ω1, ω2)) d(ω1, ω2) The Euclidean Hausdorff Distance d(ω1, ω2) Where ϕj(ω1, ω2) is the HausdorffDistance The Normalization Euclidean Hausdorff Distance Where d(ω1, ω2) The Span Normalization Euclidean Hausdorff Distance Where the span = d(ω1, ω2)

  19. Distance between concept (Interval-valued) Example: Take the first 3 observations only of veterinary data D(ω1, ω2) = Gowda-Didaydissimilarity (Y1) |120-158|/65] (Y2)

  20. Distance between concept (Interval-valued) TheIchino-Yaguchidissimilarity ϕj(ω1, ω2) = ) = (empty if no interaction) = ϕ1(ω1, ω2) = |180-120|) = 58+(-58) ϕ2(ω1, ω2) = |355-222.2|) = 100.8+ The generalized Minkowski distance When q = 1  City Block distance When q = 2 Euclidean distance dq(ω1, ω2)

  21. Distance between concept (Interval-valued) TheHausdorffDistance ϕj(ω1, ω2)) d(ω1, ω2) ϕ1(ω1, ω2))38 38 + 99.8 = 137.8 ϕ2(ω1, ω2))99.8 The Euclidean Hausdorff Distance d(ω1, ω2) The Normalization Euclidean Hausdorff Distance ]288.78 d(ω1, ω2) The Span Normalization Euclidean Hausdorff Distance = = 185-120 = 65 d(ω1, ω2) = 355-117.2 = 237.8

  22. Distance between concept (group) of interval-valued data

  23. Comparison of between-concept distance measures

  24. Interval-valued symbolic data analysis • Books(Bock and Diday (2000), Billard and Diday (2003, • 2006), and Diday and Noirhomme-Fraiture (2008)) • PCA(Chouakria, Cazes, and Diday (2000); Palumbo and • Lauro (2003); Gioia and Lauro (2006); Hamada, • Minami, and Mizuta (2008)) • Clustering analysis ( Brito (2002); Souza and de • Carvalho (2004); Chavent et al. (2006); Bock (2008)) • Discriminant analysis (Lauro, Verde, and Palumbo (2000); • Duarte Silva and Brito (2006)) • MDS (Groenen et al. (2006); Minami and Mizuta (2008) • Regression (Billard and Diday (2000); de Carvalho et al. • (2004))

  25. Visualization Tools for Symbolic Data (Analysis)

  26. Symbolic Data Analysis Software • SODAS (2003) FREE from 2 European Consortium • SYR (2008) More professional from SYROKKO Company www.syrokko.com

More Related