660 likes | 750 Views
Data Science: Data. Lecture Notes for Chapter 2. What is Data?. Collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: Name, Gender, Age , etc. Attribute is also known as variable, field, characteristic, or feature
E N D
Data Science: Data Lecture Notes for Chapter 2
What is Data? • Collection of data objects and their attributes • An attribute is a property or characteristic of an object • Examples:Name,Gender,Age, etc. • Attribute is also known as variable, field, characteristic, or feature • A collection of attributes describe an object • Object is also known as record, point, case, sample, entity, or instance Attributes Objects
Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Question: what is the first attribute of students and what is its value of the third student?
Types of Attributes • There are different types of attributes • Nominal • Examples: ID numbers, eye color, zip codes • Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} • Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. • Ratio • Examples: temperature in Kelvin, length, time, counts
Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation
Types of Attributes • Order the objects according to grade level from A to E. • From the birth year, the third student is 3 years older than the first student. • Assume that the length of one ruler is 20 centimeters and the length of another is 40 cm. The length is a ratio attribute. Specially, the second ruler is twice as long as the first one.
Discrete and Continuous Attributes • Discrete Attribute • Has only a finite or countably infinite set of values • Examples: zip codes, student ID • Note: binary attributes are a special case of discrete attributes (Gender) • Continuous Attribute • Has real numbers as attribute values • Examples: age, length, or weight.
Discrete and Continuous Attributes • In some cases, an attribute can be viewed as a discrete attribute. In other cases, it can be viewed as a continuous attribute. • For example, age is an attribute of persons. If age denotes an attribute of students in a university, its values usually fall into the range [10, 30]. In this case, it has 21 values, so it can be viewed as a discrete attribute. If we don’t limit the value range of age, it can be viewed as a continuous attribute. For example, we can say the age of a person is 23.3. (23 years old + 3.6 months) • Therefore, an interval attribute and a ration attribute may be a discrete attribute or a continuous attribute in the different cases.
Types of data sets • Record • Data Matrix • Document Data • Transaction Data • Graph • World Wide Web • Molecular Structures • Ordered • Spatial Data • Temporal Data • Sequential Data • Genetic Sequence Data
Types of data sets • For a specific data mining technique, it usually focuses on just one type of data sets. • In this course, the techniques that we will learn focus on the record data.
Record Data • A record data set consists of some records, each of which consists of a fixed set of attributes
Transaction Data • A special type of record data is transaction data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.
Important Characteristics of Record Data • Dimensionality • The dimensionality of a data set is the number of attributes that the objects have. • Sparsity • When most attributes of an object have values of 0, we say that the data set is sparse.
Data Quality • What kinds of data quality problems are there? • How can we detect problems with the data? • Examples of data quality problems: • Noise and outliers • missing values • duplicate data
Noise • Noise refers to modification of original values • Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + Noise
Noise • Noise reduces the data quality. Furthermore, it reduces the quality of data mining results, such as reducing classification accuracy and clustering accuracy. Noise can even cause the incorrect data mining results. • Therefore, it is an important task to reduce noise in the data preprocessing. • Nonetheless, the elimination of noise is frequently difficult, which requires us to devise robust data mining algorithms that produce acceptable results when noise is present.
Outliers • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set
Outliers • Just note: Outliers are different from noise. • Outliers can be legitimate data objects or values • Unlike noise, outliers may sometimes be of interest. • One task of data mining is to detect outliers from a large amount of data. • Outliers detection algorithms will be presented in Chapter 10 (if time permits).
Missing Values • It is usual for one or more objects to be missing one or more attribute values. • Why?
Missing Values • Reasons for missing values • Information is not collected (e.g., people decline to give their age and weight) • Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Table: Members of Tom’s family
Missing Values • Generally speaking, most of data mining algorithms cannot handle data sets with missing values. • Handling missing values--one task of preprocessing • Eliminate Data Objects • Estimate Missing Values
Duplicate Data • Data set may include data objects that are duplicates, or almost duplicates of one another • Issue often appears when merging data from heterogeneous sources • Examples: • Same person with multiple email addresses • Many people receive duplicate mailings.
Data Postprocessing: Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. • Visualization of data is one of the most powerful and appealing techniques for data exploration. • Humans have a well developed ability to analyze large amounts of information that is presented visually • Can detect general patterns and trends • Can detect outliers and unusual patterns
Arrangement • Example of visualization: arrangement by tabular • Help to well understand the data. • Example: --Nine objects with six binary attributes. --we can not observe any clear relationship between objects and attributes at first glance.
Arrangement • Help to well understand the data. • Example: --Permute the rows and columns. --Only two types of objects, one that has all ones for the first three attributes and one that has all ones for the last three attributes.
Histogram Example: Show the sepal length data (Figure 3.4) using Histogram. shows the distribution of values
Celsius Visualization Techniques: Scatter Plots • Many visualization techniques have been developed, such as histogram, scatter plots, box plot, pie chart, contour plot… • If you are interested in data visualization, you can refer to the specialized books for visualization techniques.
Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Dissimilarity • Numerical measure of how different two data objects are • Lower when objects are more alike • The term distance is used as a synonym for dissimilarity. • Proximity refers to a similarity or dissimilarity • Similarity and dissimilarity are important to data mining techniques. In many cases, the initial data set is not needed once the similarities and dissimilarities have been computed.
Examples • In the classification task • We need to compute the dissimilarity (distance) between the test record and each training record so than we can find the most similar training objects.
Example • In the clustering task • We need to compute the dissimilarity (distance) between each pair of points so that we can cluster the points into the different groups.
Similarity and Dissimilarity • Proximity Transformation 1 • Generally, proximity measures, especially similarities, are transformed to have values in the interval [0,1], so that we can use a scale in which a proximity value indicates the fraction of similarity or dissimilarity. • Example, if the similarities between objects range from 1 (not at all similar) to 10 (completely similar), fall them within the range [0,1] by using the transformation: where s’ and s are the new similarity and original similarity values, respectively. • Exercise: if the original similarity between objects is 6.4, what is the similarity when transformed to the range [0,1]?
Similarity and Dissimilarity • Proximity Transformation 2 • More generally, the transformation of similarities to the interval [0,1] is given by the expression: where min_sand max_sare the original minimum and maximum similarity values, respectively. • Exercise: if all original similarities between two objects fall within [10, 30] and the real similarity between two certain objects is 14, what is the new similarity when transformed to the range [0,1]?
Similarity and Dissimilarity • Proximity Transformation 3 • Likewise, dissimilarity measures with a finite range can be mapped to [0,1] by using the formula: where min_dand max_dare the minimum and maximum distance values, respectively. If the proximity measure originally takes values in the interval , one transformation of proximity measure to [0,1] is:
Similarity and Dissimilarity • Proximity Transformation 4 • Transform similarities to dissimilarity. • If the similarity falls in [0,1], we can use the transformation: where d and s are the dissimilarity and similarity values, respectively.
Proximity of Objects • Proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. • How to compute the dissimilarity of the two objects in the following table? • Suppose that Grade level={A,B,C,D,E} and the age of each student falls in [10,30].
Proximity of Objects • Proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. • d=distance(student ID)+distance(Grade level)+distance(Age) • Student ID is a nominal attribute, Grade level is an ordinal attribute, and age is a ration/interval attribute.
Proximity of Objects • Distance (dissimilarity) of two nominal attribute values. • Let p and q be two nominal attribute values, we define the distance between two attribute values: • d=distance(student ID)+distance(Grade level)+distance(Age) =1+distance(Grade level)+distance(Age)
Proximity of Objects • Distance (dissimilarity) of two ordinal attribute values. • Let p and q be two ordinal attribute values, we define the distance between them: • Map each value of an ordinal attribute to integer 0 to n-1 • Suppose that Grade level={A,B,C,D,E}, so map the values to: A=4, B=3, C=2, D=1, E=0.
Proximity of Objects • Distance (dissimilarity) of two ordinal attributes. • Suppose that Grade level={A,B,C,D,E}, so map the values to: A=4, B=3, C=2, D=1, E=0. • distance(Grade level)=|C-A|/n-1=|2-4|/5-1=0.5 • d=distance(student ID)+distance(Grade level)+distance(Age) =1+0.5+distance(Age)
Proximity of Objects • Distance (dissimilarity) of two ration attribute values. • Let p and q be two ratio attribute values, we define the distance of them: • (Note: the distance of two interval attributes can also be computed using this formula.) • distance(age)=|23-20|=3 • Furthermore, we transform it to [0,1]. • Suppose that the age of each student falls in [10,30]. So, the distance interval of the age is [0, 20].
Proximity of Objects • Distance (dissimilarity) of two ration attribute values. • distance(age)=|23-20|=3 • Furthermore, we transform it to [0,1]. • Suppose that the age of each student falls in [10,30]. So, the distance interval of the age is [0, 20]. • Transformed distance(age)=(3-0)/(20-0)=0.15 • d=distance(student ID)+distance(Grade level)+distance(Age) =1+0.5+0.15=1.65
Proximity of Objects • Exercise. • What is the dissimilarity of the two objects in the following table? • Suppose that Grade level={A,B,C,D,E,F} and the age of each student falls in [15,30].
Proximity of Objects • In many cases, each object has only a number of ratio attributes. Two cuboids with length, width and height • How to compute the distance between two objects with a number of ratio attributes?
Euclidean Distance • Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. • Standardization is necessary, if scales differ.
Euclidean Distance Distance Matrix
Minkowski Distance • Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.
Minkowski Distance: Examples • r = 1. City block (Manhattan, taxicab, L1 norm) distance. • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors • r = 2. Euclidean distance (L2 norm) distance. • r. “supremum” (Lmaxnorm, Lnorm) distance. • This is the maximum difference between any component of the vectors • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
Minkowski Distance Distance Matrix
Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties. • d(p, q) 0 for all p and q, d(p, q) = 0 only if p= q. (Positive definiteness) • d(p, q) = d(q, p) for all p and q. (Symmetry) • d(p, r) d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric
Similarity Between Binary Vectors • Common situation is that objects, p and q, have only binary attributes