180 likes | 199 Views
Outline. Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association rules. Knowledge Discovery in Databases. What is it? Why do we need it? How does data mining fit in?. Steps of a KDD Process.
E N D
Outline • Knowledge discovery in databases. • Data warehousing. • Data mining. • Different types of data mining. • The Apriori algorithm for generating association rules .
Knowledge Discovery in Databases What is it? Why do we need it? How does data mining fit in?
Steps of a KDD Process • Learn the application domain. • Create a target dataset. • Data cleaning and preprocessing. • Choose the type of data mining to perform. • Pick an algorithm. • Data mining! • Interpretation.
Data Warehousing • How does it differ from a database? • Databases provide support for: • Queries over current data • Persistent storage • Atomic updates • Data warehouses provide support for: • Storage of all data, current or not • Details or summaries • Metadata • Integrated data (to reduce time in cleaning data)
Types of Data Mining Classification Rules Clustering Sequence similarity Sampling/summarizing Association Rules
Classification Rules • Rules that partition data into separate groups. • Used to classify people as good/bad credit risks, etc. • A variation is the BestN problem; decide the best N of a set for a given problem (such as find the best N people to mail ski vacation packages to).
Clustering • Goal is to put data tuples into a class. • Map each data tuple to a point in n dimensional space, and identify clusters based on spatial proximity. • Differs from classification rules because here the idea is to group based on similarity overall, not to find the ones that lead to a similar outcome.
Sequence Similarity • Used where there are time series or ordered data. • The goal is generally to give an example and look for “similar” patterns. • Example: “When AT&T stock goes up on 2 consecutive days and DEC stock does not fall during this period, IBM stock goes up the next day 75% of the time.”
Sampling/Summarization • Sampling: finding samples of the data to carry out more detailed analysis on. Goal is to get the best sample. • Summarization: finding summaries of all of the data. Goal is to help people figure out what to do with their data, or to prepare reports.
Association Rules Rules that express when two or more items are found in the same “basket” of data. Used to try to find when certain members of the data cause other members: example, people who buy diapers tend to buy beer.
Support and Confidence • Association rules are measured in terms of • support = a and b occur together in at lest s% of the n baskets • confidence = of all of the baskets containing a, at least c% also contain b.
Association Rules Algorithms • Focuses on measuring support for “itemsets” (the number of transactions that contain the data set). Confidence is an easier problem and is figured out later. • The naïve method: Start with all the itemsets of size 1. Find the “large” itemsets. Combine to find itemsets of size 2, etc. • Apriori algorithm: Tests to make sure that not only does each step combine only large itemsets, but that every subset of the set is also a large itemset.
Apriori Algorithm • In general, the set of large itemsets of size k is referred to as Lk and Ck is the candidate set of size k • For each itemset in Lk-1, if all of the items in the set are the same except the last item, then the two itemsets are combined and this is put into the set Ck, which is the list of candidates for large itemsets of size k.
Check to see if for each itemset in Ck all of the subsets are in L k-1. This allows you to discard some itemsets without having to check for them in the dataset • Count the support of the remaining items in Ck and remove those without enough support. This is now the set Lk. • Bump k and repeat steps 2 through 4 until there is only one set remaining in Lk. When there is only one set in Lk you have found all of the large itemsets.
Example Find all Itemsets with support of 2