70 likes | 83 Views
TOP DM 10 Algorithms. C4.5 C 4.5 Research Issue:
E N D
TOP DM 10 Algorithms • C4.5 C 4.5 Research Issue: Stable trees. It is well known that the error rate of a tree on the cases from which it was constructed (the resubstitution error rate) is much lower than the error rate on unseen cases (the predictive error rate). For example, on a well-known letter recognition dataset with 20,000 cases, the resubstitution error rate for C4.5 is 4%, but the error rate from a leave-one-out (20,000-fold) cross-validation is 11.7%. As this demonstrates, leaving out a single case from 20,000 often affects the tree that is constructed! Suppose now that we could develop a non-trivial tree-construction algorithm that was hardly ever affected by omitting a single case. For such stable trees, the resubstitution error rate should approximate the leave-one-out cross-validated error rate, suggesting that the tree is of the “right” size. Suppose now that we could develop a non-trivial tree-construction algorithm that was hardly ever affected by omitting a single case. For such stable trees, the resubstitutionerrorrate should approximate the leave-one-out cross-validated error rate, suggesting that the tree is of the “right” size.
TOP DM 10 Algorithms • K-Means • Local minima can be countered by running the algorithm multiple times with different seeds • Limitations • Hard assignments of points to clusters • improvements: Fuzzy K-means / EM
TOP DM 10 Algorithms • K-Means Research Issues • Local minima can be countered by running the algorithm multiple times with different seeds • Use K-means with different distance metrics • Limitations • Hard assignments of points to clusters • will falter if spherical balls are not well separated (see next slide!) • improvements: Fuzzy K-means / EM
TOP DM 10 Algorithms • K-Means---Research Issues So, it will falter whenever the data is not well described by reasonably separated spherical balls, for example, if there are non-convex shaped clusters in the data. This problem may be alleviated by rescaling the data to “whiten” it before clustering, or by using a different distance measure that is more appropriate for the dataset. For example, information-theoretic clustering uses the KL-divergence to measure the distance between two data points… K-means can be paired with another algorithm to describe non-convex clusters. One first clusters the data into a large number of groups using k-means. These groups are then agglomerated into larger clusters using single link hierarchical clustering, which can detect complex shapes. This approach also makes the solution less sensitive to initialization, and since the hierarchical method provides results at multiple resolutions, one does not need to pre-specify k either.
Top Ten DM Algorithms Continued • SVM • Apriori • EM (kind of generalization of K-means): uses a mixture of Gaussian distributions instead of centroids as cluster models. Basic loop: DO • Create Cluster (E-step) • Update Model Parameters (M-Step) UNTIL there is no change
Top Ten Continued • PAGE RANK (determines the importance of webpages based on link structure) • Solves a complex system of score equations • PageRank is a probability distribution used to represent the likelihood that a person randomly clicking on links will arrive at any particular page. • Uses random walk to determine page importance • More information: http://www.prchecker.info/check_page_rank.phphttp://en.wikipedia.org/wiki/PageRankhttp://infolab.stanford.edu/~backrub/google.html (original PageRank paper) • AdaBoost (ensemble approach) • k-NN (k-nearest neighbor)
Top Ten Continued • Naïve Bayes (Machine Learning Class) • CART • Recursive partitioning procedure • Uses GINI • Similar to C4.5 but uses other techniques to obtain trees • Some newer work on forests recently