1 / 71

Data Mining Chapter 9 Moving on: Applications and Beyond

Data Mining Chapter 9 Moving on: Applications and Beyond. Kirk Scott. So-called machine learning is a broad topic with many ramifications Data mining is just an applied subset of this overall field

havyn
Download Presentation

Data Mining Chapter 9 Moving on: Applications and Beyond

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data MiningChapter 9Moving on: Applications and Beyond Kirk Scott

  2. So-called machine learning is a broad topic with many ramifications • Data mining is just an applied subset of this overall field • The book says the algorithms aren’t “abstruse or complicated” but they’re also not “completely obvious and trivial”

  3. The book identifies the challenge of the future as lying in the realm of applications • In this sense, data mining has something in common with database management systems • For some people the interesting part is figuring out how to apply the techniques to a given problem

  4. The book notes that the source of these applications are people working in the problem domains • People specializing in data mining will continue to develop new algorithms • But this doesn’t happen in a vacuum • Much of the real, interesting work will come out of applications

  5. 9.1 Applying Data Mining • The book lists the “Top 10” data mining algorithms • These are given in Table 9.1, shown on the following overhead • Recall that number 1, C4.5, was for decision tree induction • Notice also that the majority of these algorithms are for classification

  6. Progress in Data Mining • There is a pitfall of applying algorithms to data sets, comparing results, and drawing broad conclusions about what is best in certain problem domains • Reasoning from the specific to the general, without further information, is not necessarily correct

  7. Even statistically significant differences in outcomes may not be important in practice • Quite often, simple methods get reasonably good results • Complicated methods have their own shortcomings, including computational cost

  8. Something to always keep in mind: • There may just be a lot of noise in the data • Or, there may just be a lot of statistical variation • There ARE limits on the ability to draw inferences from data

  9. Also, training sets, by definition, are historical • They can’t perfectly reflect new data in a changing world

  10. Another point to consider: • Recall that some classification schemes give probabilities that an instance falls into a class • However, in reality, classification categories might not be mutually exclusive • There may be data points in the training set which are partially one and partially another

  11. However, the training set is considered to have instances that are rigidly classified as one or the other • The training set doesn’t reflect probability • Thus, the training set which you are basing your inferences on already contains inaccuracies • You might think of this as conceptual noise resulting from forcing an instance into one class

  12. The book admits that tweaking (picking parameters) can affect performance • As a result, small empirical differences in data mining results do not necessarily reflect actual differences in the quality of the algorithms • In application, one might have been more successfully tweaked than another

  13. Another interesting point on the Occam’s Razor/Epicurus divide: • Complicated methods may be harder to criticize than simple ones • That fact alone doesn’t make them better • The authors still favor “All else being equal, simpler is better”.

  14. 9.2 Learning from Massive Datasets • Basic constraints: Computational space and time • Data stream methods are unaffected by this (more later) • In other types of algorithms implementation techniques like hashing, caching, indexing and other data structures may be critical to practicality

  15. Massive data sets typically imply large numbers of instances • Any algorithm with time complexity > linear will eventually be swamped by massive data • Depending on the algorithm, too many attributes may also render it impractical because the computational complexity is in the dimension of the problem space

  16. General ways to adapt to large data sets: • Train on a sample or subset of the data only • Do a parallel implementation of the algorithm in a multi-processor environment • Invent new algorithms…

  17. Training on samples or subsets can give you as good a result as training on the whole set • The law of diminishing returns says that after a certain point, more instances don’t give significant increases in accuracy

  18. Training on Samples • There are two ways of looking at this: • If a problem is simple, a small data set may encapsulate all there is to know about it • If the problem is complex but the data mining algorithm is simple, the algorithm may max out on its predictive power no matter how many training instances there are

  19. Parallelization • Algorithms like nearest neighbor, tree formation, etc. can be parallelized • Not only do you have to figure out how to parallelize • Parallelization is no defense against combinatorial explosion • If the complexity is exponential but the growth in the number of processors is linear, you eventually lose

  20. New Algorithms • This is where research comes in • In general, the sky is the limit • In some situations (tree building, for example) there is a provable floor on complexity • Even here, new methods may be simpler and still have the characteristic of approximating the solutions given by deterministic methods

  21. A side note on this: • Virtually everything we’ve looked at has been a heuristic anyway—including greedy tree formation • Exhaustive search would give genuinely optimal results • Everything else is essentially an approximation approach

  22. Another aspect to improving algorithm performance: • A lot of the data in a set, both instances and attributes, may be redundant • Simply finding ways to throw out useless data may improve performance • This idea will recur in the next section

  23. 9.3 Data Stream Learning • For data streams, the overriding assumption for algorithms is that each instance will be examined at most one time • The model of the data is updated incrementally based on the incoming instance • Then the instance is discarded

  24. An example of an application area is sensor readings • As long as the sensor is active, the readings just keep on coming • It seems that things like Web transactions might be another example

  25. Both time and space are issues • Discarding instances saves space • The examination of each instance also has to be fixed in time, with an average rate of examination no less than the arrival rate of instances • This constraint rules out major changes or reorganization of the model

  26. Modifications to the model resulting from an instance either have to be counted in the examination time • Or they have to occur infrequently enough that they are averaged out over the examination time of multiple instances

  27. You may come up with ways of throwing out unneeded data • But the goal is not to throw out data simply because you couldn’t handle it fast enough • (Although stay tuned for a later comment on throwing out data)

  28. Algorithms That Are Directly Suited to Data Streams • 1R • Naïve Bayes • Perceptrons • Multi-layer neural networks • Rules with exceptions (although you can’t simply accumulate unlimited exceptions)

  29. The book notes that other kinds of algorithms can be adapted to data streams • It spends some time explaining how this might be done with trees • The details are obscure and not important at this late stage in the semester

  30. The key insight about throwing out instances is this: • Will you lose important data if you throw out instances? • In an unending data stream, if information or a pattern is significant, it will recur • So in the long run, throwing out an instance doesn’t hurt the model that results from the algorithm

  31. 9.4 Incorporating Domain Knowledge • The overall topic here is metadata—data about the data • How to put this to use is an open area of research • There can be various kinds of relationships between attributes • They include semantic, causal, and functional

  32. Semantic relationships can be summarized in this way: • If one attribute is included in a rule, another should also be included • Informally, in the problem domain, this means that these two attributes aren’t (fully) meaningful without the other • Somehow this could be included as a condition in a data mining scheme

  33. The idea of causality also comes from the problem domain • The point is that if causality exists, the data mining scheme should be able to detect it • This causality may go through multiple attributes • ABC…

  34. Functional dependency in data mining refers to the same concept as in db design • The point with respect to metadata is that if the functional dependency is already know, it’s not productive to have data mining “discover” it

  35. On the one hand, there may be ways of applying data mining to normalization • On the other hand, if that’s not your purpose, functional dependencies that are mined will tend to have high confidence and most likely high support • These associations will end up outweighing other, new associations that an algorithm might mine

  36. How should metadata be represented? • A straightforward approach is to list what you already know about the data set using rules • Logical deduction schemes can produce other rules resulting from the ones you already know

  37. Data mining scheme works with instances to produce other new rules that you didn’t know before • The bodies of rules, merged together, give the sum total of knowledge gleaned about the problem

  38. 9.5 Text Mining • Compare text mining to standard data mining • Data sets roughly parallel database tables • There are identifiable instances and well-defined attributes • Text is emphatically not structured in this way

  39. An interesting comparison: • Data mining is supposed to find information about data where that information was not known • By definition, text is different • In text, the information is out in the open, in the form language • It is simply not in a form suitable for easy computerized analysis

  40. Data mining can be said to have as its goal the acquisition of “actionable” information • Based on a training set you can classify or cluster future instances, for example • In a derivative way, you can make decisions that earn make money, etc. • Another goal of data mining is to develop a data model • Again, this is out in the open with text

  41. There are several applications of text mining: • Text summarization, document classification, and clustering • Language identification and authorship ascription • Assigning key descriptive phrases to documents • Metadata, entity, and information extraction

  42. Document classification can be done based on the (count of) occurrences of words in the document • There is a feature extraction aspect to this problem • Frequent words don’t help classify • Infrequent words don’t help classify • There is still an overwhelming number of words in the middle that have value

  43. More complex methods step up from counting words alone • Context, word order, grammatical constructs, etc. all affect meaning • At the very least phrases might be mined instead of words • Natural language processing, syntax, and semantics may come into play

  44. Document classification may be done with predefined classes • Document clustering doesn’t have predefined classes • Mining techniques can be used to identify the language of a document • n-grams, n-letter sequences correlate highly with different languages • n = 3 is usually sufficient for this

  45. Authorship ascription is done by counting common (stylistic) words, not the content words which define classifications • A more complex approach would again do more than just count words

  46. Assignment of key phrases to a document corresponds to the problem of assigning subject headings in the library catalog • You start with established sets of phrases with defined meanings • The goal is to assign one or more of these phrases to the document

  47. Metadata extraction is a related idea with further ramifications • Is it possible to find specific information like author and title, automatically? • Can you extract useful identifying key words and phrases? • Note that the ability to do this may result in “actionable” information

  48. The next step is entity extraction • Not only do you want to extract more obvious things like the author and title: • You want to identify any things that are mentioned in the document

  49. How do you identify entities? • You can look them up in reference resources like dictionaries, lists of names, etc. • You may rely on simply things like capitalization or titles of address, etc. • You may search for regular expressions or use simple grammars for expressions

More Related