1 / 92

Data Mining Chapter 2 Input: Concepts, Instances, and Attributes

Data Mining Chapter 2 Input: Concepts, Instances, and Attributes. Kirk Scott. Hopefully the idea of instances and attributes is clear Assuming there is something in the data to be mined, either this is the concept, or the concept is inherent in this

brick
Download Presentation

Data Mining Chapter 2 Input: Concepts, Instances, and Attributes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningChapter 2Input: Concepts, Instances, and Attributes Kirk Scott

  2. Hopefully the idea of instances and attributes is clear • Assuming there is something in the data to be mined, either this is the concept, or the concept is inherent in this • Earlier data mining was defined as finding a structural representation • Essentially the same idea is now expressed as finding a concept description

  3. Concept Description • The concept description needs to be: • Intelligible • It can be understood, discussed, disputed • Operation • It can be applied to actual examples

  4. 2.1 What’s a Concept?

  5. Reiteration of Types of Discovery • Classification • Prediction • Clustering • Outliers • Association • Each of these is a concept • Successful accomplishment of these for a data set is a concept description

  6. Recall Examples • Weather, contact lenses, iris, labor contracts • All were essentially classification problems • In general, the assumption is that classes are mutually exclusive • In complicated problems, data sets may be classified in multiple ways • This means individual instances can be “multilabeld”

  7. Supervised Learning • Classification learning is supervised • There is a training set • A structural representation is derived by examining a set of instances where the classification is known • How to test this? • Apply the results to another data set with known classifications

  8. Association Rules • In any given data set there can be many association rules • The total may approach n(n – 1) / 2 for n attributes • The book doesn’t use the terms support and confidence, but it discusses these concepts • These terms will be introduced

  9. Support for Association Rules • Let an association rule X = (x1, x2, …, xi)y be given in a data set with m instances • The support for Xy is the count of the number of instances where the combination of x values, X, occurs in the data set, divided by m • In other words, the association rule may be interesting if it occurs frequently enough

  10. Confidence for Association Rules • Confidence here is based on the statistical use of the term • The confidence for Xy is the count of the number of occurrences in the data set where this relationship holds true divided by the number of occurrences of X overall • The book describes this idea as accuracy • In other words, the association is interesting the more likely it is that X does determine y

  11. Clustering • We haven’t gotten the details yet, but this is an interesting data mining problem • Given a data set without predefined classes, is it possible to determine classes that the instances fall into? • Having determined the classes, can you then classify future instances into them? • Outliers are instances that you can definitely say do not fall into any of the classes

  12. Numerical Prediction • This is a variation on classification • Given n attribute values, determine the (n + 1)st attributed value • Recall the CPU performance problem • It would be a simple matter to dream up sample data where the weather data predicted how long you would play rather than a simple yes or no • (The book does so)

  13. 2.2 What’s in an Example?

  14. The authors are trying to present some important ideas • In case their presentation isn’t clear, I present it here in a slightly different way • The basic premise goes back to this question: • What form does a data set have to be in in order to apply data mining techniques to it?

  15. Data Sets Should Be Tabluar • The simple answer based on the examples presented so far: • The data has to be in tabular form, instances with attributes • The remainder of the discussion will revolve around questions related to normalization in db

  16. Not All Data is Naturally Tabular • Some data is not most naturally represented in tabular form • Consider OO db’s, where the natural representation is tree-like • How should such a representation be converted to tabular form that is amenable to data mining?

  17. Correctly Normalized Data May Fall into Multiple Tables • You might also have data which naturally falls into >1 table • Or, you might have data (god forbid) that has been normalized into >1 table • How do you make it conform to the single table model (instances with attributes) for data mining?

  18. Tree-like data and multi-table data may be related questions • It would not be surprising to find that a conversion of a tree to a table resulted in >1 table

  19. Denormalization • The situation goes against the grain of correct database design • The classification, association, and clustering you intend to do may cross db entity boundaries • The fact that you want to do mining on a single tabular representation of the data means you have to denormalize

  20. In short, you combine multiple tables back into one table • The end result is the monstrosity that is railed against in normalization theory: • The monolithic, one-table db

  21. The Book’s Family Examples • Family relationships are typically viewed in tree-like form • The book considers a family tree and the relationship “is a sister of” • The factors for inferring sisterhood: • Two people, one female • The same (or at least one common) parents for both people

  22. Two People in the Same Table • Suppose you want to do this in tabular form • You end up with the two people who might be in a sisterhood relationship in the same table • These occurrences of people are matched with a classification, yes or no

  23. Recall that according to normalization, a truly one-to-one relationship can be stored in a single table • Pairings of all people would result in lots of instances/rows where the classification was simply no • This isn’t too convenient

  24. In theory, you might restrict your attention only to those rows where the classification was yes • This restriction is known as the “closed world assumption” in data mining • Unfortunately, it is hardly ever the case that you have a problem where this kind of simplifying assumption applies • You have to deal with all cases

  25. Two People with Attributes in the Same Table • Suppose the two people are only listed by name in the table, without parent information • The classification might be correct, but this is of no help • There are no attributes to infer sisterhood from • The table has to include attributes about the two people, namely parent information

  26. The Connection with Normalization • There is a problem with denormalized data mining which is completely analogous to the normalization problem • Suppose you have two people in the same instance (the same row) with their attributes • By definition, you will have stray dependencies • The Person identifiers determine the attributes values

  27. So far we’ve considered classification • However, what would happen if you mined for associations? • The algorithm would find the perfectly true, but already known associations between the pk identifiers of the people and their attribute fields • This is not helpful • It’s a waste of effort

  28. Recursive Relationships • Recall the monarch and product-assembly examples from db • These give tables in recursive relationships with themselves or others • In terms of the book’s example, how do you deal with parenthood when there is a potentially unlimited sequence of ancestors?

  29. In general, the answer is that you would need recursive rules • Mining recursive rules is a step beyond classification, association, etc. • The good news is that this topic will not be covered further • It’s simply of interest to know that such problems can arise

  30. One-to-Many Relationships • A denormalized table might be the result joining two tables in a pk-fk relationship • If the classification is on the “one” side of the relationship, then you have multiple instances in the table which are not independent • In data mining this is called a multi-instance situation

  31. The multiple instances belonging to one classification together actually form one example of the concept under consideration in such a problem • Data mining algorithms have been developed to handle cases like these • They will be presented with the other algorithms later

  32. Summary of 2.2 • The fundamental practical idea here is that data sets have to be manipulated into a form that’s suitable for mining • This is the input side of data mining • The reality is that denormalized tables may be required • Data mining can be facetiously be referred to as file mining since the required form does not necessarily agree with db theory

  33. The situation can be restated in this way: • Assemble the query results first; then mine them • This leads to an open question: • Would it be possible to develop a data mining system that could encompass >1 table, crawling through the pk-fk relationships like a query, finding assocations?

  34. 2.3 What’s in an Attribute?

  35. This subsection falls into two parts: • 1. Some ideas that go back to db design and normalization questions • 2. Some ideas having to do with data type

  36. Design and Normalization • You could include different kinds (subtypes) of entities in the same table • To make this work you would have to include all of the fields of all of the kinds of entities • The fields that didn’t apply to a particular instance would be null • The book uses transportation vehicles as an example: ships and trucks

  37. You could also have fields in a table that depend on each other (ack) • The book gives married T/F and spouse’s name as examples • Again, you can handle this with null values

  38. Data Types • The simplest distinction is numeric vs. categorical • Some synonyms for categorical: symbolic, nominal, enumerated, discrete • There are also two-valued variables known as Boolean or dichotomy

  39. Spectrum of Data Types • 1. Nominal = unordered, unmeasurable named categories • Example: sunny, overcast, rainy • 2. Ordinal = named categories that can be put into a logical order but which have no intrinsic numeric value and no defined distance between them (support < or >) • Example: hot, mild, cool

  40. 3. Interval = numeric values where the distance between them makes sense (support subtraction) but other operations do not • Example: Time expressed in years

  41. 4. Ratio = numeric values where all operations make sense • These are real or continuous (or possibly integer) values on a scale with a natural 0 point • Example: Physical distance

  42. In principle, data mining has to handle all possible types of data • In practice, applied systems typically have some useful subset of the type distinctions given above • You adapt your data to the types provided

  43. 2.4 Preparing the Input

  44. In practice, preparing the data can take more time and effort than doing the mining • Data needs to be in the format required by whatever mining software you’re using • In Weka, this is ARFF = attribute relation file format

  45. Real data tends to be low in quality • Think data integrity and completeness • “Cleaning” the data before mining it pays off

  46. Weka • From Wikipedia, the free encyclopedia • Jump to: navigation, search • For other uses, see Weka (disambiguation).

  47. The Weka or woodhen (Gallirallusaustralis) is a flightless bird species of the railfamily. It is endemic to New Zealand, where four subspecies are recognized. Weka are sturdy brown birds, about the size of a chicken. As omnivores, they feed mainly on invertebrates and fruit. Weka usually lay eggs between August and January; both sexes help to incubate.

  48. Behaviour • … • Where the Weka is relatively common, their furtive curiosity leads them to search around houses and camps for food scraps, or anything unfamiliar and transportable.[2]

More Related