Understanding on Data

Understanding on Data 2014. 9

What is Data? Attributes • Data sets are made up of data objects. • Data objects are described by attributes • An attribute is a property or characteristic of an object • Examples: eye color of a person, temperature, etc. • Attribute is also known as variable, field, characteristic, dimension, or feature • Object is also known as record, point, case, sample, entity, tuple, or instance Objects

Attributes • Attribute: a data field, representing a characteristic or feature of a data object. • E.g., customer _ID, name, address • Types: • Nominal • Binary • Ordinal • Numeric: quantitative • Interval-scaled • Ratio-scaled

Attributes • Nominal: categories, states, or “names of things” • Hair_color = {auburn, black, blond, brown, grey, red, white} • marital status, occupation, ID numbers, zip codes, nationality • Binary • Nominal attribute with only 2 states (0 and 1) • Symmetric binary: both outcomes equally important • e.g., gender • Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome (e.g., HIV positive) • Ordinal • Values have a meaningful order (ranking) but magnitude between successive values is not known. • Size = {small, medium, large}, grades, army rankings

Numeric Attribute Types • Quantity (integer or real-valued) • Interval-scaled • Measured on a scale of equal-sized units • Values have order • Difference ( but not ratio) is meaningful • E.g., temperature in C˚or F˚, calendar dates • No true zero-point • Ratio-scaled • Inherent zero-point • Both difference and ratio are meaningful • 10 K˚ is twice as high as 5 K˚ • e.g., temperature in Kelvin, length, counts, monetary quantities, age, etc

Properties of Attribute Values • The type of an attribute depends on which of the following properties it possesses: • Distinctness: =  • Order: < > • Addition: + - • Multiplication: * / • Nominal attribute: distinctness • Ordinal attribute: distinctness & order • Interval attribute: distinctness, order & addition • Ratio attribute: all 4 properties

Discrete vs. Continuous Attributes • DiscreteAttribute • Has only a finite or countably infinite set of values • E.g., zip codes, profession, or the set of words in a collection of documents • Sometimes, represented as integer variables • Note: Binary attributes are a special case of discrete attributes • ContinuousAttribute • Has real numbers as attribute values • E.g., temperature, height, or weight • Practically, real values can only be measured and represented using a finite number of digits • Continuous attributes are typically represented as floating-point variables

Types of Data and Data Sets • Record • Relational records • Data matrix, e.g., numerical matrix, crosstabs • Document data: text documents: term-frequency vector • Transaction data • Graph and network • World Wide Web • Social or information networks • Molecular Structures • Ordered • Video data: sequence of images • Temporal data: time-series • Sequential Data: transaction sequences • Genetic sequence data • Spatial, image and multimedia: • Spatial data: maps • Image data: • Video data:

Important Characteristics of Structured Data • Dimensionality • Curse of dimensionality • Sparsity • Only presence counts • Resolution • Patterns depend on the scale • Distribution • Centrality and dispersion

Record Data • Data that consists of a collection of records, each of which consists of a fixed set of attributes

Data Matrix • If data objects have the same fixed set of numeric attributes, then the data objects can be thought of as points in a multi-dimensional space, where each dimension represents a distinct attribute • Such data set can be represented by an m by n matrix, where there are m rows, one for each object, and n columns, one for each attribute

Document Data • Each document becomes a `term' vector, • each term is a component (attribute) of the vector, • the value of each component is the number of times the corresponding term occurs in the document.

Transaction Data • A special type of record data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

Graph Data • Examples: Generic graph and HTML Links

Graph Data : Social Network

Graph data: Food Relationship

Chemical Data • Benzene Molecule: C6H6

Ordered Data • Sequences of transactions Items/Events An element of the sequence

Ordered Data • Genomic sequence data

Ordered Data • Spatio-Temporal Data Average Monthly Temperature of land and ocean

Summary Statistics • Summary statistics are numbers that summarize properties of the data • Summarized properties include frequency, location and spread • Examples: location - meanspread - standard deviation • Most summary statistics can be calculated in a single pass through the data

Frequency and Mode • The frequency of an attribute value is the percentage of time the value occurs in the data set • For example, given the attribute ‘gender’ and a representative population of people, the gender ‘female’ occurs about 50% of the time. • The mode of a an attribute is the most frequent attribute value • The notions of frequency and mode are typically used with categorical data

Measures of Location: Mean and Median • The mean is the most common measure of the location of a set of points. • However, the mean is very sensitive to outliers. • Thus, the median or a trimmed mean is also commonly used.

Symmetric vs. Skewed Data • Median, mean and mode of symmetric, positively and negatively skewed data

Measuring the Dispersion of Data • Variance and standard deviation • Variance: • Standard deviationσis the square root of variance σ2 • Percentile • The value that meets the given percentage of data sets • Five number summary • min, Q1, median,Q3, max • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Outlier: points beyond a specified outlier threshold, plotted individually • Graphically represented as a boxplot

Variance • The variance or standard deviation is the most common measure of the spread of a set of points. • However, this is also sensitive to outliers, so that other measures are often used.

Percentiles • For continuous data, the notion of a percentile is more useful. • Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is a value of x such that p% of the observed values of x are less than . • For instance, the 50th percentile is the value such that 50% of all values of x are less than . • Matlab: prctile

Percentile graph

Boxplot Analysis • Five-number summary of a distribution • Minimum, Q1, Median, Q3, Maximum • Quartiles: Q1 (25th percentile), Q3 (75th percentile) • Inter-quartile range: IQR = Q3 –Q1 • Five number: min, Q1, median,Q3, max • Boxplot • Data is represented with a box • The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR • The median is marked by a line within the box • Whiskers: two lines outside the box extended to Minimum and Maximum • Outliers: points beyond a specified outlier threshold, plotted individually • X > Q3 + 1.5 *IQR or X < Q1 – 1.5*IQR

Example of Box Plots • Box plots can be used to compare attributes

Homework • DB_score 파일의 모든 필드(attendance, mid-term, final, report, project, score)에 대하여 다음을 계산하시오. • mean, median • standard deviation, AAD, MAD (3)five-number summary,outliers (boxplot은 그림으로 표현할 것.) • Programming language: C, C++, Java

Graphic Displays of Basic Statistical Descriptions • Boxplot: graphic display of five-number summary • Matlab: boxplot • Histogram: x-axis are values, y-axis represents frequencies • Matlab: hist • Quantile plot: each value xiis paired with fi indicating that approximately 100*fi % of data are xi • Matlab: quantile • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the plane • Matlab: scatter

Histogram Analysis • Histogram: Graph display of tabulated frequencies, shown as bars • It shows what proportion of cases fall into each of several categories • The categories are usually specified as non-overlapping intervals of some variable. The categories (bars) must be adjacent

Histograms Often Tell More than Boxplots • The two histograms shown in the left may have the same boxplot representation • The same values for: min, Q1, median, Q3, max • But they have rather different data distributions

Quantile Plot • Displays all of the data (allowing the user to assess both the overall behavior and unusual occurrences) • Plots quantile information • For a data xidata sorted in increasing order, fiindicates that approximately 100 * fi% of the data are below or equal to the value xi

Scatter plot • Shows the relationship and dependency between two variables • Univariate analysis, Bivariateanalysis, Multivariate analysis • Each pair of values is treated as a pair of coordinates and plotted as points in the plane

Positively and Negatively Correlated Data • The left half fragment is positively correlated • The right half is negative correlated

Uncorrelated Data

Visually Evaluating Correlation Matlab: plotmatrix

Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Often falls in the range [0,1] • Dissimilarity • Numerical measure of how different two data objects are • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity

Similarity/Dissimilarity for Simple Attributes p and q are the attribute values for two data objects.

Minkowski Distance • Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

Minkowski Distance: Examples • r = 1. City block (Manhattan, taxicab, L1 norm) distance. • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors • r = 2. Euclidean distance • r. “supremum” (Lmaxnorm, Lnorm) distance. • This is the maximum difference between any component of the vectors • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Example: Minkowski Distance Manhattan (L1) Euclidean (L2) Supremum

Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties. • d(p, q)  0 for all p and q and d(p, q) = 0 only if p= q. (Positive definiteness) • d(p, q) = d(q, p) for all p and q. (Symmetry) • d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric

Common Properties of a Similarity • Similarities, also have some well known properties. • s(p, q) = 1 (or maximum similarity) only if p= q. • s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.

Similarity Between Binary Vectors • Common situation is that objects, p and q, have only binary attributes • Compute similarities using the following quantities M01= the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00= the number of attributes where p was 0 and q was 0 M11= the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of 11 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)

SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01= 2 (the number of attributes where p was 0 and q was 1) M10= 1 (the number of attributes where p was 1 and q was 0) M00= 7 (the number of attributes where p was 0 and q was 0) M11= 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Cosine Similarity • A document can be represented by thousands of attributes, each recording the frequency of a particular word (such as keywords) or phrase in the document. • Applications: information retrieval, biologic taxonomy, gene feature mapping, ... • Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d||: the length of vector d

Example: Cosine Similarity • Ex: Find the similarity between documents 1 and 2. d1= (5, 0, 3, 0, 2, 0, 0, 2, 0, 0) d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1) d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25 ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481 ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12 cos(d1, d2 ) = 0.94

Understanding on Data

Understanding on Data

Presentation Transcript

UNDERSTANDING STATISTICAL DATA

Descriptive Data Summarization (Understanding Data)

Probase : Understanding Data on the Web

Understanding Big Data

Understanding User Mobility Based on GPS Data

Understanding Data Quality

Understanding Data Quality

Understanding DATA Concepts

Understanding StorageScope Data

Understanding big data…

Understanding Analytical Environmental Data

Understanding the Data

Understanding Data

Data Mining: Data Understanding

Understanding Data for Beginners

Understanding family violence data

Understanding Data Mining

Data Understanding, Cleaning, Transforming

Understanding Disaggregated Data Centers