340 likes | 767 Views
A Brief History of Data Mining Society. 1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases
E N D
A Brief History of Data Mining Society • 1989 IJCAI Workshop on Knowledge Discovery in Databases • Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991) • 1991-1994 Workshops on Knowledge Discovery in Databases • Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996) • 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98) • Journal of Data Mining and Knowledge Discovery (1997)
A Brief History of Data Mining Society • ACM SIGKDD conferences since 1998 and SIGKDD Explorations • More conferences on data mining • PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc. • ACM Transactions on KDD starting in 2007
Conferences and Journals on Data Mining • KDD Conferences • ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) • SIAM Data Mining Conf. (SDM) • (IEEE) Int. Conf. on Data Mining (ICDM) • Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) • Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
Where to Find References? DBLP, CiteSeer, Google • Data mining and KDD (SIGKDD: CDROM) • Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc. • Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD • Bioinformatics • Conferences: RECOMB, CSB, PSB, BIBE, etc • Journals: Bioinformatics, BMC Bioinformatics, TCBB,…
Top-10 Algorithm Finally Selected at ICDM’06 • #1: Decision Tree (61 votes) • #2: K-Means (60 votes) • #3: SVM (58 votes) • #4: Apriori (52 votes) • #5: EM (48 votes) • #6: PageRank (46 votes) • #7: AdaBoost (45 votes) • #8: kNN (45 votes) • #9: Naive Bayes (45 votes) • #10: CART (34 votes)
Association Rules • support, s, probability that a transaction contains X Y • confidence, c,conditional probability that a transaction having X also contains Y
Association Rules • Let’s have an example
Association Rules • T100 1,2,5 • T200 2,4 • T300 2,3 • T400 1,2,4 • T500 1,3 • T600 2,3 • T700 1,3 • T800 1,2,3,5 • T900 1,2,3
Classification—A Two-Step Process • Classification • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • predicts categorical class labels (discrete or nominal)
Classification • Typical applications • Credit approval • Target marketing • Medical diagnosis • Fraud detection • And much more
Decision Tree • Decision Tree induction is the learning of decision trees from class-labeled training tuples • A decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute • Each Branch represents an outcome of the test • Each Leafnode holds a class label
Decision Tree Algorithm • Basic algorithm (a greedy algorithm) • Tree is constructed in a top-down recursive divide-and-conquer manner • At start, all the training examples are at the root • Attributes are categorical (if continuous-valued, they are discretized in advance) • Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)
Attribute Selection Measure: Information Gain (ID3/C4.5) • Select the attribute with the highest information gain • Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| • Expected information (entropy) needed to classify a tuple in D:
Attribute Selection Measure: Information Gain (ID3/C4.5) • Information needed (after using A to split D into v partitions) to classify D: • Information gained by branching on attribute A
Decision Tree • means “age <=30” has 5 out of 14 samples, with 2 yes’s and 3 no’s. • I(2,3) = -2/5 * log(2/5) – 3/5 * log(3/5)
Decision Tree • Similarily, we can compute • Gain(income)=0.029 • Gain(student)=0.151 • Gain(credit_rating)=0.048 • Since “age” obtains highest information gain, we can partition the tree into: