720 likes | 856 Views
Data Mining Chapter 9 Moving on: Applications and Beyond. Kirk Scott. So-called machine learning is a broad topic with many ramifications Data mining is just an applied subset of this overall field
E N D
Data MiningChapter 9Moving on: Applications and Beyond Kirk Scott
So-called machine learning is a broad topic with many ramifications • Data mining is just an applied subset of this overall field • The book says the algorithms aren’t “abstruse or complicated” but they’re also not “completely obvious and trivial”
The book identifies the challenge of the future as lying in the realm of applications • In this sense, data mining has something in common with database management systems • For some people the interesting part is figuring out how to apply the techniques to a given problem
The book notes that the source of these applications are people working in the problem domains • People specializing in data mining will continue to develop new algorithms • But this doesn’t happen in a vacuum • Much of the real, interesting work will come out of applications
9.1 Applying Data Mining • The book lists the “Top 10” data mining algorithms • These are given in Table 9.1, shown on the following overhead • Recall that number 1, C4.5, was for decision tree induction • Notice also that the majority of these algorithms are for classification
Progress in Data Mining • There is a pitfall of applying algorithms to data sets, comparing results, and drawing broad conclusions about what is best in certain problem domains • Reasoning from the specific to the general, without further information, is not necessarily correct
Even statistically significant differences in outcomes may not be important in practice • Quite often, simple methods get reasonably good results • Complicated methods have their own shortcomings, including computational cost
Something to always keep in mind: • There may just be a lot of noise in the data • Or, there may just be a lot of statistical variation • There ARE limits on the ability to draw inferences from data
Also, training sets, by definition, are historical • They can’t perfectly reflect new data in a changing world
Another point to consider: • Recall that some classification schemes give probabilities that an instance falls into a class • However, in reality, classification categories might not be mutually exclusive • There may be data points in the training set which are partially one and partially another
However, the training set is considered to have instances that are rigidly classified as one or the other • The training set doesn’t reflect probability • Thus, the training set which you are basing your inferences on already contains inaccuracies • You might think of this as conceptual noise resulting from forcing an instance into one class
The book admits that tweaking (picking parameters) can affect performance • As a result, small empirical differences in data mining results do not necessarily reflect actual differences in the quality of the algorithms • In application, one might have been more successfully tweaked than another
Another interesting point on the Occam’s Razor/Epicurus divide: • Complicated methods may be harder to criticize than simple ones • That fact alone doesn’t make them better • The authors still favor “All else being equal, simpler is better”.
9.2 Learning from Massive Datasets • Basic constraints: Computational space and time • Data stream methods are unaffected by this (more later) • In other types of algorithms implementation techniques like hashing, caching, indexing and other data structures may be critical to practicality
Massive data sets typically imply large numbers of instances • Any algorithm with time complexity > linear will eventually be swamped by massive data • Depending on the algorithm, too many attributes may also render it impractical because the computational complexity is in the dimension of the problem space
General ways to adapt to large data sets: • Train on a sample or subset of the data only • Do a parallel implementation of the algorithm in a multi-processor environment • Invent new algorithms…
Training on samples or subsets can give you as good a result as training on the whole set • The law of diminishing returns says that after a certain point, more instances don’t give significant increases in accuracy
Training on Samples • There are two ways of looking at this: • If a problem is simple, a small data set may encapsulate all there is to know about it • If the problem is complex but the data mining algorithm is simple, the algorithm may max out on its predictive power no matter how many training instances there are
Parallelization • Algorithms like nearest neighbor, tree formation, etc. can be parallelized • Not only do you have to figure out how to parallelize • Parallelization is no defense against combinatorial explosion • If the complexity is exponential but the growth in the number of processors is linear, you eventually lose
New Algorithms • This is where research comes in • In general, the sky is the limit • In some situations (tree building, for example) there is a provable floor on complexity • Even here, new methods may be simpler and still have the characteristic of approximating the solutions given by deterministic methods
A side note on this: • Virtually everything we’ve looked at has been a heuristic anyway—including greedy tree formation • Exhaustive search would give genuinely optimal results • Everything else is essentially an approximation approach
Another aspect to improving algorithm performance: • A lot of the data in a set, both instances and attributes, may be redundant • Simply finding ways to throw out useless data may improve performance • This idea will recur in the next section
9.3 Data Stream Learning • For data streams, the overriding assumption for algorithms is that each instance will be examined at most one time • The model of the data is updated incrementally based on the incoming instance • Then the instance is discarded
An example of an application area is sensor readings • As long as the sensor is active, the readings just keep on coming • It seems that things like Web transactions might be another example
Both time and space are issues • Discarding instances saves space • The examination of each instance also has to be fixed in time, with an average rate of examination no less than the arrival rate of instances • This constraint rules out major changes or reorganization of the model
Modifications to the model resulting from an instance either have to be counted in the examination time • Or they have to occur infrequently enough that they are averaged out over the examination time of multiple instances
You may come up with ways of throwing out unneeded data • But the goal is not to throw out data simply because you couldn’t handle it fast enough • (Although stay tuned for a later comment on throwing out data)
Algorithms That Are Directly Suited to Data Streams • 1R • Naïve Bayes • Perceptrons • Multi-layer neural networks • Rules with exceptions (although you can’t simply accumulate unlimited exceptions)
The book notes that other kinds of algorithms can be adapted to data streams • It spends some time explaining how this might be done with trees • The details are obscure and not important at this late stage in the semester
The key insight about throwing out instances is this: • Will you lose important data if you throw out instances? • In an unending data stream, if information or a pattern is significant, it will recur • So in the long run, throwing out an instance doesn’t hurt the model that results from the algorithm
9.4 Incorporating Domain Knowledge • The overall topic here is metadata—data about the data • How to put this to use is an open area of research • There can be various kinds of relationships between attributes • They include semantic, causal, and functional
Semantic relationships can be summarized in this way: • If one attribute is included in a rule, another should also be included • Informally, in the problem domain, this means that these two attributes aren’t (fully) meaningful without the other • Somehow this could be included as a condition in a data mining scheme
The idea of causality also comes from the problem domain • The point is that if causality exists, the data mining scheme should be able to detect it • This causality may go through multiple attributes • ABC…
Functional dependency in data mining refers to the same concept as in db design • The point with respect to metadata is that if the functional dependency is already know, it’s not productive to have data mining “discover” it
On the one hand, there may be ways of applying data mining to normalization • On the other hand, if that’s not your purpose, functional dependencies that are mined will tend to have high confidence and most likely high support • These associations will end up outweighing other, new associations that an algorithm might mine
How should metadata be represented? • A straightforward approach is to list what you already know about the data set using rules • Logical deduction schemes can produce other rules resulting from the ones you already know
Data mining scheme works with instances to produce other new rules that you didn’t know before • The bodies of rules, merged together, give the sum total of knowledge gleaned about the problem
9.5 Text Mining • Compare text mining to standard data mining • Data sets roughly parallel database tables • There are identifiable instances and well-defined attributes • Text is emphatically not structured in this way
An interesting comparison: • Data mining is supposed to find information about data where that information was not known • By definition, text is different • In text, the information is out in the open, in the form language • It is simply not in a form suitable for easy computerized analysis
Data mining can be said to have as its goal the acquisition of “actionable” information • Based on a training set you can classify or cluster future instances, for example • In a derivative way, you can make decisions that earn make money, etc. • Another goal of data mining is to develop a data model • Again, this is out in the open with text
There are several applications of text mining: • Text summarization, document classification, and clustering • Language identification and authorship ascription • Assigning key descriptive phrases to documents • Metadata, entity, and information extraction
Document classification can be done based on the (count of) occurrences of words in the document • There is a feature extraction aspect to this problem • Frequent words don’t help classify • Infrequent words don’t help classify • There is still an overwhelming number of words in the middle that have value
More complex methods step up from counting words alone • Context, word order, grammatical constructs, etc. all affect meaning • At the very least phrases might be mined instead of words • Natural language processing, syntax, and semantics may come into play
Document classification may be done with predefined classes • Document clustering doesn’t have predefined classes • Mining techniques can be used to identify the language of a document • n-grams, n-letter sequences correlate highly with different languages • n = 3 is usually sufficient for this
Authorship ascription is done by counting common (stylistic) words, not the content words which define classifications • A more complex approach would again do more than just count words
Assignment of key phrases to a document corresponds to the problem of assigning subject headings in the library catalog • You start with established sets of phrases with defined meanings • The goal is to assign one or more of these phrases to the document
Metadata extraction is a related idea with further ramifications • Is it possible to find specific information like author and title, automatically? • Can you extract useful identifying key words and phrases? • Note that the ability to do this may result in “actionable” information
The next step is entity extraction • Not only do you want to extract more obvious things like the author and title: • You want to identify any things that are mentioned in the document
How do you identify entities? • You can look them up in reference resources like dictionaries, lists of names, etc. • You may rely on simply things like capitalization or titles of address, etc. • You may search for regular expressions or use simple grammars for expressions