1 / 56

Data Science: Data

Data Science: Data. Lecture Notes for Chapter 2. What is Data?. Collection of data objects and their attributes An attribute is a property or characteristic of an object Examples: Name, Gender, Age , etc. Attribute is also known as variable, field, characteristic, or feature

judge
Download Presentation

Data Science: Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Science: Data Lecture Notes for Chapter 2

  2. What is Data? • Collection of data objects and their attributes • An attribute is a property or characteristic of an object • Examples:Name,Gender,Age, etc. • Attribute is also known as variable, field, characteristic, or feature • A collection of attributes describe an object • Object is also known as record, point, case, sample, entity, or instance Attributes Objects

  3. Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Question: what is the first attribute of students and what is its value of the third student?

  4. Types of Attributes • There are different types of attributes • Nominal • Examples: ID numbers, eye color, zip codes • Ordinal • Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} • Interval • Examples: calendar dates, temperatures in Celsius or Fahrenheit. • Ratio • Examples: temperature in Kelvin, length, time, counts

  5. Attribute Type Description Examples Operations Nominal The values of a nominal attribute are just different names, i.e., nominal attributes provide only enough information to distinguish one object from another. (=, ) zip codes, employee ID numbers, eye color, sex: {male, female} mode, entropy, contingency correlation, 2 test Ordinal The values of an ordinal attribute provide enough information to order objects. (<, >) hardness of minerals, {good, better, best}, grades, street numbers median, percentiles, rank correlation, run tests, sign tests Interval For interval attributes, the differences between values are meaningful, i.e., a unit of measurement exists. (+, - ) calendar dates, temperature in Celsius or Fahrenheit mean, standard deviation, Pearson's correlation, t and F tests Ratio For ratio variables, both differences and ratios are meaningful. (*, /) temperature in Kelvin, monetary quantities, counts, age, mass, length, electrical current geometric mean, harmonic mean, percent variation

  6. Types of Attributes • Order the objects according to grade level from A to E. • From the birth year, the third student is 3 years older than the first student. • Assume that the length of one ruler is 20 centimeters and the length of another is 40 cm. The length is a ratio attribute. Specially, the second ruler is twice as long as the first one.

  7. Discrete and Continuous Attributes • Discrete Attribute • Has only a finite or countably infinite set of values • Examples: zip codes, student ID • Note: binary attributes are a special case of discrete attributes (Gender) • Continuous Attribute • Has real numbers as attribute values • Examples: age, length, or weight.

  8. Discrete and Continuous Attributes • In some cases, an attribute can be viewed as a discrete attribute. In other cases, it can be viewed as a continuous attribute. • For example, age is an attribute of persons. If age denotes an attribute of students in a university, its values usually fall into the range [10, 30]. In this case, it has 21 values, so it can be viewed as a discrete attribute. If we don’t limit the value range of age, it can be viewed as a continuous attribute. For example, we can say the age of a person is 23.3. (23 years old + 3.6 months) • Therefore, an interval attribute and a ration attribute may be a discrete attribute or a continuous attribute in the different cases.

  9. Types of data sets • Record • Data Matrix • Document Data • Transaction Data • Graph • World Wide Web • Molecular Structures • Ordered • Spatial Data • Temporal Data • Sequential Data • Genetic Sequence Data

  10. Types of data sets • For a specific data mining technique, it usually focuses on just one type of data sets. • In this course, the techniques that we will learn focus on the record data.

  11. Record Data • A record data set consists of some records, each of which consists of a fixed set of attributes

  12. Transaction Data • A special type of record data is transaction data, where • each record (transaction) involves a set of items. • For example, consider a grocery store. The set of products purchased by a customer during one shopping trip constitute a transaction, while the individual products that were purchased are the items.

  13. Important Characteristics of Record Data • Dimensionality • The dimensionality of a data set is the number of attributes that the objects have. • Sparsity • When most attributes of an object have values of 0, we say that the data set is sparse.

  14. Data Quality • What kinds of data quality problems are there? • How can we detect problems with the data? • Examples of data quality problems: • Noise and outliers • missing values • duplicate data

  15. Noise • Noise refers to modification of original values • Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen Two Sine Waves Two Sine Waves + Noise

  16. Noise • Noise reduces the data quality. Furthermore, it reduces the quality of data mining results, such as reducing classification accuracy and clustering accuracy. Noise can even cause the incorrect data mining results. • Therefore, it is an important task to reduce noise in the data preprocessing. • Nonetheless, the elimination of noise is frequently difficult, which requires us to devise robust data mining algorithms that produce acceptable results when noise is present.

  17. Outliers • Outliers are data objects with characteristics that are considerably different than most of the other data objects in the data set

  18. Outliers • Just note: Outliers are different from noise. • Outliers can be legitimate data objects or values • Unlike noise, outliers may sometimes be of interest. • One task of data mining is to detect outliers from a large amount of data. • Outliers detection algorithms will be presented in Chapter 10 (if time permits).

  19. Missing Values • It is usual for one or more objects to be missing one or more attribute values. • Why?

  20. Missing Values • Reasons for missing values • Information is not collected (e.g., people decline to give their age and weight) • Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Table: Members of Tom’s family

  21. Missing Values • Generally speaking, most of data mining algorithms cannot handle data sets with missing values. • Handling missing values--one task of preprocessing • Eliminate Data Objects • Estimate Missing Values

  22. Duplicate Data • Data set may include data objects that are duplicates, or almost duplicates of one another • Issue often appears when merging data from heterogeneous sources • Examples: • Same person with multiple email addresses • Many people receive duplicate mailings.

  23. Data Postprocessing: Visualization Visualization is the conversion of data into a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported. • Visualization of data is one of the most powerful and appealing techniques for data exploration. • Humans have a well developed ability to analyze large amounts of information that is presented visually • Can detect general patterns and trends • Can detect outliers and unusual patterns

  24. Arrangement • Example of visualization: arrangement by tabular • Help to well understand the data. • Example: --Nine objects with six binary attributes. --we can not observe any clear relationship between objects and attributes at first glance.

  25. Arrangement • Help to well understand the data. • Example: --Permute the rows and columns. --Only two types of objects, one that has all ones for the first three attributes and one that has all ones for the last three attributes.

  26. Histogram Example: Show the sepal length data (Figure 3.4) using Histogram. shows the distribution of values

  27. Celsius Visualization Techniques: Scatter Plots • Many visualization techniques have been developed, such as histogram, scatter plots, box plot, pie chart, contour plot… • If you are interested in data visualization, you can refer to the specialized books for visualization techniques.

  28. Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Dissimilarity • Numerical measure of how different two data objects are • Lower when objects are more alike • The term distance is used as a synonym for dissimilarity. • Proximity refers to a similarity or dissimilarity • Similarity and dissimilarity are important to data mining techniques. In many cases, the initial data set is not needed once the similarities and dissimilarities have been computed.

  29. Examples • In the classification task • We need to compute the dissimilarity (distance) between the test record and each training record so than we can find the most similar training objects.

  30. Example • In the clustering task • We need to compute the dissimilarity (distance) between each pair of points so that we can cluster the points into the different groups.

  31. Similarity and Dissimilarity • Proximity Transformation 1 • Generally, proximity measures, especially similarities, are transformed to have values in the interval [0,1], so that we can use a scale in which a proximity value indicates the fraction of similarity or dissimilarity. • Example, if the similarities between objects range from 1 (not at all similar) to 10 (completely similar), fall them within the range [0,1] by using the transformation: where s’ and s are the new similarity and original similarity values, respectively. • Exercise: if the original similarity between objects is 6.4, what is the similarity when transformed to the range [0,1]?

  32. Similarity and Dissimilarity • Proximity Transformation 2 • More generally, the transformation of similarities to the interval [0,1] is given by the expression: where min_sand max_sare the original minimum and maximum similarity values, respectively. • Exercise: if all original similarities between two objects fall within [10, 30] and the real similarity between two certain objects is 14, what is the new similarity when transformed to the range [0,1]?

  33. Similarity and Dissimilarity • Proximity Transformation 3 • Likewise, dissimilarity measures with a finite range can be mapped to [0,1] by using the formula: where min_dand max_dare the minimum and maximum distance values, respectively. If the proximity measure originally takes values in the interval , one transformation of proximity measure to [0,1] is:

  34. Similarity and Dissimilarity • Proximity Transformation 4 • Transform similarities to dissimilarity. • If the similarity falls in [0,1], we can use the transformation: where d and s are the dissimilarity and similarity values, respectively.

  35. Proximity of Objects • Proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. • How to compute the dissimilarity of the two objects in the following table? • Suppose that Grade level={A,B,C,D,E} and the age of each student falls in [10,30].

  36. Proximity of Objects • Proximity of objects with a number of attributes is typically defined by combining the proximities of individual attributes. • d=distance(student ID)+distance(Grade level)+distance(Age) • Student ID is a nominal attribute, Grade level is an ordinal attribute, and age is a ration/interval attribute.

  37. Proximity of Objects • Distance (dissimilarity) of two nominal attribute values. • Let p and q be two nominal attribute values, we define the distance between two attribute values: • d=distance(student ID)+distance(Grade level)+distance(Age) =1+distance(Grade level)+distance(Age)

  38. Proximity of Objects • Distance (dissimilarity) of two ordinal attribute values. • Let p and q be two ordinal attribute values, we define the distance between them: • Map each value of an ordinal attribute to integer 0 to n-1 • Suppose that Grade level={A,B,C,D,E}, so map the values to: A=4, B=3, C=2, D=1, E=0.

  39. Proximity of Objects • Distance (dissimilarity) of two ordinal attributes. • Suppose that Grade level={A,B,C,D,E}, so map the values to: A=4, B=3, C=2, D=1, E=0. • distance(Grade level)=|C-A|/n-1=|2-4|/5-1=0.5 • d=distance(student ID)+distance(Grade level)+distance(Age) =1+0.5+distance(Age)

  40. Proximity of Objects • Distance (dissimilarity) of two ration attribute values. • Let p and q be two ratio attribute values, we define the distance of them: • (Note: the distance of two interval attributes can also be computed using this formula.) • distance(age)=|23-20|=3 • Furthermore, we transform it to [0,1]. • Suppose that the age of each student falls in [10,30]. So, the distance interval of the age is [0, 20].

  41. Proximity of Objects • Distance (dissimilarity) of two ration attribute values. • distance(age)=|23-20|=3 • Furthermore, we transform it to [0,1]. • Suppose that the age of each student falls in [10,30]. So, the distance interval of the age is [0, 20]. • Transformed distance(age)=(3-0)/(20-0)=0.15 • d=distance(student ID)+distance(Grade level)+distance(Age) =1+0.5+0.15=1.65

  42. Proximity of Objects • Exercise. • What is the dissimilarity of the two objects in the following table? • Suppose that Grade level={A,B,C,D,E,F} and the age of each student falls in [15,30].

  43. Proximity of Objects • In many cases, each object has only a number of ratio attributes. Two cuboids with length, width and height • How to compute the distance between two objects with a number of ratio attributes?

  44. Euclidean Distance • Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. • Standardization is necessary, if scales differ.

  45. Euclidean Distance Distance Matrix

  46. Minkowski Distance • Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

  47. Minkowski Distance: Examples • r = 1. City block (Manhattan, taxicab, L1 norm) distance. • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors • r = 2. Euclidean distance (L2 norm) distance. • r. “supremum” (Lmaxnorm, Lnorm) distance. • This is the maximum difference between any component of the vectors • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

  48. Minkowski Distance Distance Matrix

  49. Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties. • d(p, q)  0 for all p and q, d(p, q) = 0 only if p= q. (Positive definiteness) • d(p, q) = d(q, p) for all p and q. (Symmetry) • d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric

  50. Similarity Between Binary Vectors • Common situation is that objects, p and q, have only binary attributes

More Related