500 likes | 726 Views
Understanding on Data. 201 4 . 9. What is Data?. Attributes. Data sets are made up of data objects. Data objects are described by attributes An attribute is a property or characteristic of an object Examples: eye color of a person, temperature, etc.
E N D
Understanding on Data 2014. 9
What is Data? Attributes • Data sets are made up of data objects. • Data objects are described by attributes • An attribute is a property or characteristic of an object • Examples: eye color of a person, temperature, etc. • Attribute is also known as variable, field, characteristic, dimension, or feature • Object is also known as record, point, case, sample, entity, tuple, or instance Objects
Attributes • Attribute: a data field, representing a characteristic or feature of a data object. • E.g., customer _ID, name, address • Types: • Nominal • Binary • Ordinal • Numeric: quantitative • Interval-scaled • Ratio-scaled
Attributes • Nominal: categories, states, or “names of things” • Hair_color = {auburn, black, blond, brown, grey, red, white} • marital status, occupation, ID numbers, zip codes, nationality • Binary • Nominal attribute with only 2 states (0 and 1) • Symmetric binary: both outcomes equally important • e.g., gender • Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome (e.g., HIV positive) • Ordinal • Values have a meaningful order (ranking) but magnitude between successive values is not known. • Size = {small, medium, large}, grades, army rankings
Numeric Attribute Types • Quantity (integer or real-valued) • Interval-scaled • Measured on a scale of equal-sized units • Values have order • Difference ( but not ratio) is meaningful • E.g., temperature in C˚or F˚, calendar dates • No true zero-point • Ratio-scaled • Inherent zero-point • Both difference and ratio are meaningful • 10 K˚ is twice as high as 5 K˚ • e.g., temperature in Kelvin, length, counts, monetary quantities, age, etc
Properties of Attribute Values • The type of an attribute depends on which of the following properties it possesses: • Distinctness: = • Order: < > • Addition: + - • Multiplication: * / • Nominal attribute: distinctness • Ordinal attribute: distinctness & order • Interval attribute: distinctness, order & addition • Ratio attribute: all 4 properties
Discrete vs. Continuous Attributes • DiscreteAttribute • Has only a finite or countably infinite set of values • E.g., zip codes, profession, or the set of words in a collection of documents • Sometimes, represented as integer variables • Note: Binary attributes are a special case of discrete attributes • ContinuousAttribute • Has real numbers as attribute values • E.g., temperature, height, or weight • Practically, real values can only be measured and represented using a finite number of digits • Continuous attributes are typically represented as floating-point variables
Types of Data and Data Sets • Record • Relational records • Data matrix, e.g., numerical matrix, crosstabs • Document data: text documents: term-frequency vector • Transaction data • Graph and network • World Wide Web • Social or information networks • Molecular Structures • Ordered • Video data: sequence of images • Temporal data: time-series • Sequential Data: transaction sequences • Genetic sequence data • Spatial, image and multimedia: • Spatial data: maps • Image data: • Video data:
Important Characteristics of Structured Data • Dimensionality • Curse of dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale • Distribution • Centrality and dispersion
Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes
Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute
Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.
Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
Graph Data • Examples: Generic graph and HTML Links
Chemical Data • Benzene Molecule: C6H6
Ordered Data • Sequences of transactions Items/Events An element of the sequence
Ordered Data • Genomic sequence data
Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean
Summary Statistics • Summary statistics are numbers that summarize properties of the data • Summarized properties include frequency, location and spread • Examples: location - meanspread - standard deviation • Most summary statistics can be calculated in a single pass through the data
Frequency and Mode • The frequency of an attribute value is the percentage of time the value occurs in the data set • For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. • The mode of a an attribute is the most frequent attribute value • The notions of frequency and mode are typically used with categorical data
Measures of Location: Mean and Median • The mean is the most common measure of the location of a set of points. • However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also commonly used.
Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data
Measuring the Dispersion of Data • Variance and standard deviation • Variance: • Standard deviationσis the square root of variance σ2 • Percentile • The value that meets the given percentage of data sets • Five number summary • min, Q1, median,Q3, max • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Outlier: points beyond a specified outlier threshold, plotted individually • Graphically represented as a boxplot
Variance • The variance or standard deviation is the most common measure of the spread of a set of points. • However, this is also sensitive to outliers, so that other measures are often used.
Percentiles • For continuous data, the notion of a percentile is more useful. • Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than . • For instance, the 50th percentile is the value such that 50% of all values of x are less than . • Matlab: prctile
Boxplot Analysis • Five-number summary of a distribution • Minimum, Q1, Median, Q3, Maximum • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number: min, Q1, median,Q3, max • Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR • The median is marked by a line within the box • Whiskers: two lines outside the box extended to Minimum and Maximum • Outliers: points beyond a specified outlier threshold, plotted individually • X > Q3 + 1.5 *IQR or X < Q1 – 1.5*IQR
Example of Box Plots • Box plots can be used to compare attributes
Homework • DB_score 파일의 모든 필드(attendance, mid-term, final, report, project, score)에 대하여 다음을 계산하시오. • mean, median • standard deviation, AAD, MAD (3)five-number summary,outliers (boxplot은 그림으로 표현할 것.) • Programming language: C, C++, Java
Graphic Displays of Basic Statistical Descriptions • Boxplot: graphic display of five-number summary • Matlab: boxplot • Histogram: x-axis are values, y-axis represents frequencies • Matlab: hist • Quantile plot: each value xiis paired with fi indicating that approximately 100*fi % of data are xi • Matlab: quantile • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane • Matlab: scatter
Histogram Analysis • Histogram: Graph display of tabulated frequencies, shown as bars • It shows what proportion of cases fall into each of several categories • The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent
Histograms Often Tell More than Boxplots • The two histograms shown in the left may have the same boxplot representation • The same values for: min, Q1, median, Q3, max • But they have rather different data distributions
Quantile Plot • Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) • Plots quantile information • For a data xidata sorted in increasing order, fiindicates that approximately 100 * fi% of the data are below or equal to the value xi
Scatter plot • Shows the relationship and dependency between two variables • Univariate analysis, Bivariateanalysis, Multivariate analysis • Each pair of values is treated as a pair of coordinates and plotted as points in the plane
Positively and Negatively Correlated Data • The left half fragment is positively correlated • The right half is negative correlated
Visually Evaluating Correlation Matlab: plotmatrix
Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Often falls in the range [0,1] • Dissimilarity • Numerical measure of how different two data objects are • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity
Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.
Minkowski Distance • Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Minkowski Distance: Examples • r = 1. City block (Manhattan, taxicab, L1 norm) distance. • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors • r = 2. Euclidean distance • r. “supremum” (Lmaxnorm, Lnorm) distance. • This is the maximum difference between any component of the vectors • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Example: Minkowski Distance Manhattan (L1) Euclidean (L2) Supremum
Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties. • d(p, q) 0 for all p and q and d(p, q) = 0 only if p= q. (Positive definiteness) • d(p, q) = d(q, p) for all p and q. (Symmetry) • d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric
Common Properties of a Similarity • Similarities, also have some well known properties. • s(p, q) = 1 (or maximum similarity) only if p= q. • s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.
Similarity Between Binary Vectors • Common situation is that objects, p and q, have only binary attributes • Compute similarities using the following quantities M01= the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00= the number of attributes where p was 0 and q was 0 M11= the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01= 2 (the number of attributes where p was 0 and q was 1) M10= 1 (the number of attributes where p was 1 and q was 0) M00= 7 (the number of attributes where p was 0 and q was 0) M11= 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
Cosine Similarity • A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. • Applications: information retrieval, biologic taxonomy, gene feature mapping, ... • Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1d2) /||d1|| ||d2|| , where indicates vector dot product, ||d||: the length of vector d
Example: Cosine Similarity • Ex: Find the similarity between documents 1 and 2. d1= (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 cos(d1, d2 ) = 0.94