220 likes | 406 Views
Big Data Machine Learning Algorithms. Student-Faculty Research Day D PS Class ‘2016. Abstract. Machine Learning (ML) is a combination of computer science techniques, statistical methods and classification schemes harmonized to produce inference capabilities against (big) data sets.
E N D
Big Data Machine Learning Algorithms Student-Faculty Research Day DPS Class ‘2016 DPS Class - 2016
Abstract Machine Learning (ML) is a combination of computer science techniques, statistical methods and classification schemes harmonized to produce inferencecapabilities against (big) data sets. To be useful it has to be modeled into a system that has the ability to receive feedback from decisions and actions such that the model can be updated to produce better outcomes. Four popular Machine Learning algorithms discussed in this paper are: • Bayesian Decision Theory • k-Nearest-Neighbor Classification • k-Means Clustering • Linear Regression DPS Class - 2016
Introduction / Background • Machine Learning is implemented in a range of computing tasks in which the discovery of patterns implicit in the data leads to better decision-making: data mining, digital recognition, speech understanding, biometrics, tele medical diagnoses, and anomaly (fraud) detection . • Machine learning algorithms can be grouped according to the underlying problem classes that they address: classification, regression, clustering, and rule extraction. Additionally, an algorithm can be classified according to its learning style: supervised, unsupervised, semi-supervised, or reinforcement learning. • Supervised learning: pattern classification procedures take in data related to a pattern (object), and make a decision based on the class (category) of the pattern, whereas with • Unsupervised learning: the data is not labeled and there are no target variables. Clustering groups data points, with similar data points in one cluster and dissimilar points in a different group. A number of different measurements can be used to measure similarity. DPS Class - 2016
Bayes Decision Theory Bayes Decision Theoryis a statistical approach to the problem of pattern classification and refers to a decision theory which is informed by Bayesian Theorem. The assumption is that the problem is presented in probabilistic terms, and that all of the relevant probability values are known. Bayes Rule (Bayes Law): P(j | x) = P(x | j) P (j) / P(x) Posterior = (Likelihood) * (Prior) / Evidence Decision rule with only the prior information: Decide 1 if P(1) > P(2) otherwise decide 2 Decision rule given the posterior probabilities X is an observation for which: if P(1 | x) > P(2 | x) decision = 1 if P(1 | x) < P(2 | x) decision = 2 DPS Class - 2016
Bayes Decision Boundary Example • Training samples for a three-class (red, blue, green), 2D (two-feature) problem are shown: • Red = {(1,4), (-3,0), (5,0), (1,-4)}, • Green = {(-6,17), (-10,13), (-2,13), (-6,9), • Blue = {(10,15), (6,11), (14,11), (10,7)} The Bayes decision boundaries which provide the boundaries for classifying all new data presented to the system. These are the perpendicular bisectors between the means. DPS Class - 2016
K-Nearest Neighbor The k-Nearest-Neighbor (kNN) : The k-nearest-neighbor classifier is based on the distancebetween a test sample and the training samples. • When: It is used perform discriminant analysis when reliable parametric estimates of probability densities are unknown or difficult to determine. • Distance Measure: The k-NN algorithm usually uses the Euclideandistance or the Manhattan distance. However, any other distance such as the Chebyshev norm or the Mahalanobis distance can also be used. • Pro/Con: The nearest- neighbor algorithm is conceptual and computational simple, but is a sub-optimal procedure; it will usually lead to an error rate greater than the minimum possible, the Bayes rate. DPS Class - 2016
k-Nearest Neighbor Example • Drawn in this example are the k-NN decision boundaries which provide the boundaries for classifying all new data presented to the system. • Training samples for a three-class (red, blue, green), 2D (two-feature) problem are shown: • Red = {(1,4), (-3,0), (5,0), (1,-4)}, • Green = {(-6,17), (-10,13), (-2,13), (-6,9), • Blue = {(10,15), (6,11), (14,11), (10,7)} DPS Class - 2016
K-Means The k-Means algorithm clusters observations in to related groups without any prior knowledge of those relationships. The algorithms uses the following steps: Step 1: Select a value of “k.” k is the number of clusters into which you would like to split the data points. Note here that the value of k can be adjusted many times and the algorithm repeated in order to find an optimal solution. Step 2: Randomly assign a center/centroid to each cluster. Generally, extreme opposite points are selected as initial cluster centroids. Step 3: Assign each point in the data set to the nearest centroid. This is generally done using Euclidean distance. Step 4: Re-compute the new cluster centroids. This is done by taking the average of all the points in the cluster; that is, its coordinates are the means for each dimension separately over all the points in the cluster. Step 5: Repeat steps 3 and 4 until the clusters no longer change; that is to say, the assignment of points to clusters becomes stable. DPS Class - 2016
k-means Example • Training samples for a three-class (red, blue, green), 2D (two-feature) problem are shown: • Red = {(1,4), (-3,0), (5,0), (1,-4)}, • Green = {(-6,17), (-10,13), (-2,13), (-6,9), • Blue = {(10,15), (6,11), (14,11), (10,7)} Five steps are needed to reach the k-Means decision boundaries with three non-distant points. Two steps required for reaching k-Means decision boundaries using three distantpoints: DPS Class - 2016
Linear Regression Regression analysis is a statistical method for the investigation of relationships between variables. • Linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variable) denoted X. • The case of one explanatory variable is called simple linear regression. • When we have more than one explanatory variable, the process is called multiple linear regression • Simple linear regression is a way to describe a relationship between two variables through an equation of a straight line, called line of best fit, that most closely models the relationship. DPS Class - 2016
Simple Linear Regression Example The following linear regression formula provides the least-squares, best-fit equation for the line: Where x-bar and y-bar are the means of the x’s and y’s. This formula can be applied to a simple example. The computed least-squares, best-fit line through the five points {(2,2), (0,0), (-2,-2), (-1,1), (1,-1)} is shown below: DPS Class - 2016
Applications of Machine Learning Algorithms - Classifiers • Bayesian Decision Theory Classification • Known Probability structure for each class • Uses parameters • Training data is labeled. • Scales well with diverse datasets. • Gives classification regions similar to k-NN and k-Means when identical spherical Gaussian probability distributions for all classes and equal prior probability distributions are assumed • Used for facial recognition, adaptive teaching systems, medical imaging, and spam filtering • k-Nearest Neighbor • Unknown probability structure for each class • Does not use parameters • Training data is not labeled • Does not scale well with diverse datasets (curse of dimensionality) • Object is classified by majority vote of its neighbors, but there can be greater weight given to closer neighbors in computing the means. DPS Class - 2016
Applications of Machine Learning Algorithms – Non-Classifiers • k-Means Clustering • Takes data with possible but undefined groupings • No prior knowledge of relationships between elements • Identifies groupings to which analytics can be applied • Performance and results can be different depending if distant or non-distant points are selected for the center of each cluster. • Linear Regression • Used to predict a dependent variable from a number of independent variables • Observed XY values to used to calculate the least-squares, best-fit equation for the line to predict a new value DPS Class - 2016
Discussions & Conclusions • In practice, the algorithms k-NN and k-Means are good choices for Big Data Machine Learning, because they are easier to use than Bayesian Decision Theory Classification, and in our examples give equivalent decision boundaries. DPS Class - 2016
Backup DPS Class - 2016
Applications of Machine Learning Algorithms • Bayesian Decision Theory Classification: Probability structure for each class is known, uses parameters, training data is labeled. Gives classification regions similar to k-NN and k-Means when identical spherical Gaussian probability distributions for all classes and equal prior probability distributions are assumed. Scales well with diverse datasets. Is a classifier. • k-NN: no parameters, object is classified by majority vote of its neighbors, training data is not labeled. but there can be greater weight given to closer neighbors in computing the means. Does not scale well with diverse datasets (curse of dimensionality). Is a classifier. • k-Means: Is not a classifier. Takes data with possible but undefined groupings, and identifies groupings to which analytics can be applied. Performance and results can be different depending if distant or non-distant points are selected for the center of each cluster. • Linear Regression: used to predict a dependent variable from a number of independent variables. Use observed XY values to calculate the least-squares, best-fit equation for the line to predict a new value. DPS Class - 2016
Applications of Machine Learning Algorithm – Bayes Decision Theory Classification • ADVANTAGES AND LIMITATIONS: Bayesian decision has the following advantages: (1) Most of the general decision method use incomplete information or the subjective probability. But the Bayesian decision can make scientific judgments on the value of the information and the needs for collecting new information. (2) Bayesian decision can make a quantitative evaluation on the possibilities of survey results, and it doesn’t like General decision method which completely believe or not believe at all on the survey results. (3)If the results of any investigation that can not be completely accurate, and the prior knowledge and the subjective probability are both not entirely possible to believe. In this case, the Bayesian decision information combines these two kinds of information together cleverly and organically . (4) According to the specific circumstances, it can be constantly used in the decision process, and make the decision more scientific and perfect gradually. • Bayesian decision also has some limitations: (1) Bayesian decision is based on a sample learning, most of the samples learning needs experience of people or the knowledge of experts to specify the a prior probability. It requires experts to do a large number of detailed study, and still has a certain degree of subjective one-sidedness. • THE LATEST APPLICATIONS: • Bayesian decision theory has a very wide range of applications, recently summed up the application of the following (1) Application in artificial intelligence: for example, in Face Recognition, face images are taken as the matrix, the eigenvalues and the corresponding eigenvector which be calculated is taken as the algebraic characteristics group, and combining Bayesian decision to judgment. (2) Application in education and teaching: for example, in adaptive teaching system, it has an great significance to study that how to design student models to achieve a personalized teaching and adaptability. Construct student model by Bayesian network, the experiment shows that the construction of Bayesian networks which based on adaptive teaching system can effectively provide adaptive teaching resources, thus contributing to the achievement of adaptive learning platform. (3) Application in medical image: for the problem of small sample case and high variable dimension, using hierarchical clustering and principal component analysis to make high-dimensional variable dimensionality reduction first, and then using the generated main components to do Bayesian network learning and classification. (4) Application in the network spam filtering: for example, useing Bayesian methods to classify the content of the messages, in order to identify and filtering effectively the spam messages [1] • [1] Research on Bayesian Decision Theory in Pattern Recognition - Hai-ming Li, Ting-lei Huang, Xin Wang School of Computer and Control Guilin University of Electronic Technology. DPS Class - 2016
References • [[1] D.R. Abrams, "Introduction to Regression,” http://dss.princeton.edu/online_help/analysis/regression_intro.htm, accessed March 2015 • [2] N.S. Altman, "An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression," The American Statistician, Vol.46, No.3, 1992, pp.175–185 • [3] "Bayesian Decision Theory." Bayesian Decision Theory. N.p., n.d. Web. 11 Feb. 2015. (can’t find paper) • [4] Bayesian Inference. Aldershot: Elgar, 1995. Web. (can’t find paper) • [5] C.M. Bishop, Pattern Recognition and Machine Learning, Vol. 4, No. 4. New York: Springer, 2006, p.12 • [6] CrossValidated, “What are the main differences between K-means and K-nearest neighbours?,” http://stats.stackexchange.com/questions/56500/what-are-the-main-differences-between-k-means-and-k-nearest-neighbours, accessed April 2015 • [7] R.O. Duda, P.E. Hart and D.G. Stork, Pattern Classification. New York, NY: Wiley, 2001, pp. 84-102 • [8] T. Hastie, R.J. Tibshirani and J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2011 • [9] E. Keogh, UCR, http://www.cs.ucr.edu/~eamonn/CE/Bayesian%20Classification%20withInsect_examples.pdf., accessed February 2015. • [10] P. Langley, W. Iba, and K. Thompson, “An Analysis of Bayesian Classifiers,” Proceedings of the Tenth Conference on Artificial Intelligence, San Jose, CA, 1992. AAAI Press. • [11] LessWrongWIKI, "Bayesian Decision Theory,” 2012, http://wiki.lesswrong.com/wiki/Bayesian_decision_theory, accessed February 2015 • [12] “Linear Regression,” http://www.stat.yale.edu/Courses/1997-98/101/linreg.htm, accessed March 2015 • [13] SAI: http://thesai.org/Downloads/Volume4No11/Paper5Performance Comparison between Naïve_Bayes.pdf (link broken) • [14] S. Sayad, “An Introduction to Data Mining,” 2015, http://www.saedsayad.com/k_nearest_neighbors.htm, accessed February 2015 • [15] D. Spiegelhalter, K. Rice, “Bayesian Statistics,” 2009, http://www.scholarpedia.org/article/Bayesian_statistics, accessed February 2015 • [16] D.W. Stockburger, “Multiple Regression with Categorical Variables,” • http://www.psychstat.missouristate.edu/multibook/mlt08m.html, accessed March 2015 • [17] C. Tappert, Emerging Technologies II, Pace University, February 2015, http://www.csis.pace.edu/~ctappert/dps/d861-15/assign/assign1.pdf, accessed March 15. • [18] The Analysis Factor, “Regression Models: How do you know you need a polynominal,” 2015, http://www.theanalysisfactor.com/regression-modelshow-do-you-know-you-need-a-polynomial/, accessed March 2015 • [19] Wikipedia, “K-Means Clustering,” http://en.wikipedia.org/wiki/K-means_clustering, accessed March 2015 • [20]Wikipedia: “K-Nearest Neighbors,” http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm, accessed February 2015 • [21]Wikipedia, "Linear Regression,” http://en.wikipedia.org/wiki/Linear_regression, accessed March 2015 • [22] Wikipedia, “Machine Learning,” http://en.wikipedia.org/wiki/Machine_learning, accessed March 2015 • [23] I. H. Witten, E. Frank, and M. A. Hall, Data MiningPractical Machine Learning Tools and Techniques, Burlington, MA: Morgan Kaufmann, 2011 DPS Class - 2016