290 likes | 575 Views
Tutorial 1. General Introduction to SDA. Yin-Jing Tien ( 田銀錦 ) Institute of Statistical Science Academia Sinica gary@stat.sinica.edu.tw June 13, 2014. Symbolic data Analysis (SDA) ( Diday 1987). Text: Billard and Diday (2006):
E N D
Tutorial 1 General Introduction to SDA Yin-Jing Tien (田銀錦) Institute of Statistical Science Academia Sinica gary@stat.sinica.edu.tw June 13, 2014
Symbolic data Analysis (SDA) (Diday 1987) Text: Billard and Diday (2006): Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley. Diday, E., Noirhomme-Fraiture, M. (2008): Symbolic Data Analysis and The SODAS Software. JohnWiley & Sons Ltd., Chichester, England.
Symbolic data (Diday 1987) • Classical Data : Individuals:single value • Single player • age = 25, eye color = blue • Symbolic Data : Symbolic units (Concept: groups) • Team • interval : age range = [20, 36] • multiple values: eye color = {blue,brown,black}
Symbolic data analysis When? • When we are interested the higher level units (Concept: groups/classes ). • When the initial data are composed by • Symbolic data tables • When the data is BIG
Symbolic data types (quantitative) Multi-valued symbolic random variable Y is one or more values {12,23,20} Interval-valued symbolic random variable Y is one that takes values in an interval [17, 25] Modal multi-valued {0.5, 3/8, 1.5, 4/8, 2, 1/8} Modal interval-valued (Histogram) {[12,40), 1/7, [40, 60), 2/7, [60, 80], 4/7}
Symbolic data types (qualitative) Multi-valued symbolic random variable Y is one or more values E.g., Bird Colors, Y=color Modal multi-valued {single, 3/8, married, 5/8}
Basic Descriptive Statistics: Interval Value Let Zi = (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k. Sample Mean of Iiis Sample Variance of Ziis
Basic Descriptive Statistics: Interval Value Rewrite as Total Variation = Within Variation + Between Variation Within Variation Between Variation
Similarity between Variables (interval-valued data) (Billard and Diday (2006)) The empirical covariancefunction between Ziand Zjis The empirical correlation coefficient between Ziand Zjis Where
Distance between concept Definition 7.6: The Cartesian join A⊕B between two sets A and B is their componentwise union, Definition 7.7: The Cartesian meet A⊗B between two sets A and B is their componentwise intersection,
Distance between concept (Multi-valued) The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991) (relative sizes) (relative content)
Distance between concept (Multi-valued) Example: Color and Habitat of Birds (Table 7.2) Y1 = Color, Y2 = Habitat For Y1: D11(ω1, ω2)=(|2-1|)/2 = 1/2 D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2 p=2 The Gowda-Didaydissimilarity For Y2:D11(ω1, ω2)=(|2-1|)/2 = 1/2 D21(ω1, ω2)=(|2+1-2*1|)/2 = 1/2 D(ω1, ω2)=(1/2+1/2)+(1/2+1/2) = 2 Normalized (adjust for scale) weights are32 D(ω1, ω2)=(1/2+1/2)/3+(1/2+1/2)/2= 5/6
Distance between concept (Multi-valued) TheIchino-Yaguchi dissimilarity measure (Ichino and Yaguchi, 1994) ϕj(ω1, ω2)= ) ϕ1(ω1, ω2)= 2-1+γ(2*1-2-1) = 1-γ For Y1: ϕ2(ω1, ω2)= 2-1+γ(2*1-2-1) = 1-γ For Y2: Takingγ=0.5 UnweightedMinkowskidistance Dq(ω1, ω2)= (0.5q+0.5q)1/q Weighted Minkowskidistance ( ) Dq(ω1, ω2)= ((0.5/3)q+(0.5/2)q)1/q
Distance between concept (Interval-valued) Let Zi= (I1i, I2i, . . . , Iki)T be the interval data for the ith variable with k concepts, where Ici = [aci, bci], c = 1, 2, . . . , k. The Gowda-Diday dissimilarity measure (Gowda and Diday, 1991) Dj(ω1, ω2) for the variable Yj D(ω1, ω2) = (relative length) (relative content) (relative position) length of the entire distance spanned by ω1andω2 , if the intervals overlap length of the intersection , otherwise total length in covered by the observe values of Yj
Distance between concept (Interval-valued) The Ichino-Yaguchi dissimilarity measure(Ichino and Yaguchi, 1994) ϕj(ω1, ω2) = ) = (empty if no interaction) = The generalized Minkowski distance of order q ≥1 between two interval-valued observations ξ(ω1) and ξ(ω2) is dq(ω1, ω2) Where ϕj(ω1, ω2) is the Ichino-Yaguchidistance and is a weight function associated with variable Yj . ϕj(ω1, ω2) When q = 1 City Block distance When q = 2 Euclidean distance
Distance between concept (Interval-valued) The Hausdorff Distance(Chavent and Lechevallier, 2002) ϕj(ω1, ω2)) d(ω1, ω2) The Euclidean Hausdorff Distance d(ω1, ω2) Where ϕj(ω1, ω2) is the HausdorffDistance The Normalization Euclidean Hausdorff Distance Where d(ω1, ω2) The Span Normalization Euclidean Hausdorff Distance Where the span = d(ω1, ω2)
Distance between concept (Interval-valued) Example: Take the first 3 observations only of veterinary data D(ω1, ω2) = Gowda-Didaydissimilarity (Y1) |120-158|/65] (Y2)
Distance between concept (Interval-valued) TheIchino-Yaguchidissimilarity ϕj(ω1, ω2) = ) = (empty if no interaction) = ϕ1(ω1, ω2) = |180-120|) = 58+(-58) ϕ2(ω1, ω2) = |355-222.2|) = 100.8+ The generalized Minkowski distance When q = 1 City Block distance When q = 2 Euclidean distance dq(ω1, ω2)
Distance between concept (Interval-valued) TheHausdorffDistance ϕj(ω1, ω2)) d(ω1, ω2) ϕ1(ω1, ω2))38 38 + 99.8 = 137.8 ϕ2(ω1, ω2))99.8 The Euclidean Hausdorff Distance d(ω1, ω2) The Normalization Euclidean Hausdorff Distance ]288.78 d(ω1, ω2) The Span Normalization Euclidean Hausdorff Distance = = 185-120 = 65 d(ω1, ω2) = 355-117.2 = 237.8
Interval-valued symbolic data analysis • Books(Bock and Diday (2000), Billard and Diday (2003, • 2006), and Diday and Noirhomme-Fraiture (2008)) • PCA(Chouakria, Cazes, and Diday (2000); Palumbo and • Lauro (2003); Gioia and Lauro (2006); Hamada, • Minami, and Mizuta (2008)) • Clustering analysis ( Brito (2002); Souza and de • Carvalho (2004); Chavent et al. (2006); Bock (2008)) • Discriminant analysis (Lauro, Verde, and Palumbo (2000); • Duarte Silva and Brito (2006)) • MDS (Groenen et al. (2006); Minami and Mizuta (2008) • Regression (Billard and Diday (2000); de Carvalho et al. • (2004))
Symbolic Data Analysis Software • SODAS (2003) FREE from 2 European Consortium • SYR (2008) More professional from SYROKKO Company www.syrokko.com