Understanding Machine Learning: Methods & Algorithms

Chapter 6Machine Learning Algorithms andPrediction Model Fitting

Taxonomy of Machine Learning Methods • The main idea of machine learning (ML) • To use computers to learn from massive amounts of data • ML forms the core of artificial intelligence (AI) • Especially in the era of big data • This field is highly relevant to statistical decision making and data mining • In building AI or expert systems • For tedious or unstructured data • Machines can often make better and more unbiased decisions than a human learner • Learning from given data objects • one can reveal the categorical class or experience affiliation of future data to be tested

Taxonomy of Machine Learning Methods (cont.) • This concept essentially defines ML as an operational term • To implement an ML task • Need to explore or construct computer algorithms to learn from data • Make predictions on data based on their specific features, similarities, or correlations • ML algorithms are operated by building a decision-making model from sample data inputs • The outputs of the ML model are data-driven predictions or decisions

Classification by Learning Paradigms • ML algorithms can be built with different styles in order to model a problem • The style is dictated by the interaction with the data environment • Expressed as the input to the model • The data interaction style decides the learning models that a ML algorithm can produce • The user must understand the roles of the input data and the model’s construction process • The goal is to select the ML model that can solve the problem with the best prediction result • ML sometime overlaps with the goal of data mining

Classification by Learning Paradigms • Three classes of ML algorithms based on different learning styles • Supervised, unsupervised, and semi-supervised • Three ML methods are viable in real-life applications • The style is hinged on how training data is used in the learning process

Classification by Learning Paradigms (cont.) • Supervised learning • The input data is called training data with a known label or result • A model is constructed through training by using the training data set • Improved by receiving feedback predictions • The learning process continues • Until the model achieves a desired level of accuracy on the training data • Future incoming data without known labels is tested on the model with an acceptable level of accuracy • Unsupervised learning • All input data are not labeled with a known result

Classification by Learning Paradigms (cont.) • A model is generated by exploring the hidden structures present in the input data • To extract general rules, go through a mathematical process to reduce redundancy, or organize data by similarity testing • Semi-supervised learning • The input data is a mixture of labeled and unlabeled examples • The model must learn the structures to organize the data in order to make predictions possible • Such problems and other ML algorithms will be treated under different assumptions on how to model the unlabeled data

Methodologies for Machine/Deep Learning • ML algorithms are distinguishable • By applying different similarity testing functions in the learning process • e.g., Tree-based methods apply decision trees • A neural network is inspired by artificial neurons in a connectionist brain model • One can handle the ML process subjectively • By finding the best fit to solve the decision problem based on the characteristics in data sets • Ensemble methods are composed of multiple weaker models • The prediction results of these models are combined

Methodologies for Machine/Deep Learning (cont.) • Makes the collective prediction more accurate • These models are independently trained • Much effort is put into what types of weak learners to combine • And the ways in which to combine them effectively • Consists of mixed learners applying supervised, unsupervised, or semi-supervised algorithms • Deep learning methods extend from Artificial neural networks (ANNs) • By building much deeper and complex neural networks • Built of multiple layers of interconnected artificial neurons

Methodologies for Machine/Deep Learning (cont.) • Often used to mimic the human brain process in response to light, sound, and visual signals • Often applied to semi-supervised learning problems • Large data sets contain very little labeled data

Supervised Machine Learning Algorithms • In a supervised ML system • The computer learns from a training data set of {input, output} pairs • The input comes from sample data given in a certain format • e.g., The credit reports of borrowers • The output may be discrete • e.g., yes or no to a loan application • The output can be also continuous • e.g., The probability distribution that the loan can be paid off in a timely manner • The goal is to work out a reliable ML model • Can map or produce the correct outputs from new inputs that were unseen before

Supervised Machine Learning Algorithms (cont.) • The ML system acts like a finely tuned predictor function g(x) • The learning system is built with a sophisticated algorithm to optimize this function • e.g., Given an input data x in a credit report of a borrower, the bank will make a loan decision based on the predicted outcome • Four families of important supervised ML algorithms • Including regression, decision trees, Bayesian networks, and support vector machines • In solving a classification problem, the inputs are divided into two or more classes

Supervised Machine Learning Algorithms (cont.) • The learner must produce a model that assigns unseen inputs to one or more of these classes • Typically tackled in a supervised way • Spam filtering is a good example of classification • The inputs are e-mails, blogs, or document files • The output classes are spam and non-spam

Supervised Machine Learning Algorithms (cont.) • Regression is also a supervised problem • The outputs are continuous in general but discrete in special cases • Uses statistical learning • Models the relationship between input and output data • The regression process is iteratively refined using an error criterion to make better predictions • Minimizes the error between predicted value and actual experience in input data • Decision trees offers a predictive model • Solve classification and regression problems • Map observations about an item to conclusions about the item’s target value

Supervised Machine Learning Algorithms (cont.) • Along various feature nodes in a tree-structured decision process • Various decision paths fork in the tree structure • Until a prediction decision is made hierarchically at the leaf node • Trained on given data for better accuracy • Bayesian methods are based on statistical decision theory • Often applied in pattern recognition, feature extraction, and regression applications • Offers a directed acyclic graph (DAG) model • Represented by a set of statistically independent random variables

Supervised Machine Learning Algorithms (cont.) • e.g., A Bayesian network can represent the probabilistic relationships between diseases and symptoms • Given symptoms, the system computes the probabilities of having various diseases • Many prediction algorithms are used in medical diagnosis to assist doctors, nurses, and patients in the healthcare industry • Both prior and posterior probabilities are applied in making predictions • Can also be improved with the provisioning of a better training data set • Support vector machines (SVMs) are often used in supervised learning methods

Supervised Machine Learning Algorithms (cont.) • For regression and classification applications • Decide how to generate a hyperplane • To separate the training sample data space into distinct subspaces • e.g., A surface in a 3D space • Builds a model to predict whether a new sample falls into one subspace or another

Unsupervised Machine Learning Algorithms • Unsupervised learning is typically used • In finding special relationships within the data set • No training examples used in this process • The system is given a set of data to find the patterns and correlations therein • Attempt to reveal hidden structures or properties in the entire input data set • Some reported ML algorithms that operate without supervision • Including clustering methods, association analysis, dimension reduction, and artificial neural networks

Unsupervised Machine Learning Algorithms (cont.) • Association rule learning generates inference rules • Used to discover useful associations in large multidimensional data sets • Best explain observed relationships between variables in the data

Unsupervised Machine Learning Algorithms (cont.) • These association patterns are often exploited by enterprises or large organizations • e.g., Association rules are generated from input data to identify close-knit groups of friends in a social network database • In clustering, a set of inputs is to be divided into groups • Grouping similar data objects as clusters • Modeled by using centroid-based clustering and/or hierarchal clustering • All clustering methods are based on similarity testing • Unlike supervised classification, the groups are not known in advance • Making this an unsupervised task

Unsupervised Machine Learning Algorithms (cont.) • Density estimation finds the distribution of inputs in some space • Dimensionality reduction exploits the inherent structure in the data • In an unsupervised manner • The purpose is to summarize or describe data using less information • Done by visualizing multidimensional data with principal components or dimensions • Then simplifies the inputs by mapping them into a lower-dimensional space • The simplified data can then be applied in a supervised learning method

Unsupervised Machine Learning Algorithms (cont.) • Artificial neural networks (ANNs) are cognitive models • Inspired by the structure and function of biological neurons • Tries to model the complex relationships between inputs and outputs • This forms a class of pattern matching algorithms • Used for solving regression and classification problems

Regression Analysis • Regression analysis is widely used in ML for prediction, classification, and forecasting • Essentially performs a sequence of parametric or nonparametric estimations • Finds the causal relationship between the input and output variables • Careful to make such predictions • Causality may lead to illusions or false relationships to mislead the users • The estimation function can be determined • By experience using a priori knowledge or visual observation of the data • Need to calculate the undetermined coefficients of the function by using some error criteria

Regression Analysis (cont.) • The regression method can be applied to classify data by predicting the category tag of data • Regression analysis tries to understand • How the value of the dependent variable changes • While the independent variables are held unchanged • The independent variables are the inputs of the regression process, aka the predictors • The dependent variable is the output of the process • Regression analysis estimates the average value of the dependent variable • When the independent variables are fixed • The estimated value is a function of the independent variables known as the regression function • Can be described by a probability distribution

Regression Analysis (cont.) • Most regression methods are parametric naturally • With a finite dimension in the analysis space • Nonparametric regression may be infinite-dimensional • Accuracy or performance depends on the quality of the data set used • Related to the data generation process and the underlying assumptions made • Regression offers estimation of continuous response variables • As opposed to the discrete decision values used in classification that demand higher accuracy

Regression Analysis (cont.) • In the formulation of a regression process • The unknown parameters are often denoted as β • May appear as a scalar or a vector • The independent variables are denoted by a vector X and a dependent variable as Y • When multiple dimensions are involved, these parameters are vectors in form • A regression model establishes the approximated relation between X, β, and Y • The function f(X, β) is approximated by the expected value E(Y|X)

Regression Analysis (cont.) • The regression function f is based on the knowledge of the relationship between a continuous variable Y and vector X • If no such knowledge is available, an approximated handy form is chosen for f • Measuring the Height after Tossing a Small Ball in the Air • Measure its height of ascent h at the various time instant t • The relationship is modeled as • β1 determines the initial velocity of the ball • β2 is proportional to standard gravity • ε is due to measurement errors

Regression Analysis (cont.) • Linear regression is used to estimate the values of β1 and β2 from the measured data • This model is nonlinear with respect to the time variable t • But it is linear with respect to parameters β1 and β2 • Consider k components in the vector of unknown parameters β • Three models to relate the inputs to the outputs • Depending on the relative magnitude between the number N of observed data points of the form (X, Y) and the dimension k of the sample space • When N < k, most classical regression analysis methods can be applied • Most classical regression analysis methods can be applied

Regression Analysis (cont.) • The defining equation is underdetermined • No enough data to recover the unknown parameters β • When N = k and the function f is linear • The equation Y = f (X, β) can be solved exactly without approximation • There are N equations to solve N components in β • The solution is unique as long as the X components are linearly independent • If f is nonlinear, many solutions may exist or no solution at all • In general, the situation that N > k data points • Enough information in the data that can estimate a unique value for β under an overdetermined situation • The measurement errors εi follows a normal distribution

Regression Analysis (cont.) • There exists an excess of information contained in (N - k) measurements • Known as the degrees of freedom of the regression • Basic assumptions on regression analysis under various error conditions • The sample is representative of the data space involved • The error is a random variable with a mean of zero conditioned over the input variables • The independent variables are measured with no error • The predictors are linearly independent • The errors are uncorrelated

Regression Analysis (cont.) • The variance of the error is a constant across observations • If not, a weighted least squares method is needed • Regression analysis is a statistical method • Determines the quantitative relation in a machine learning process • Two or more variables are dependent on each other • Including linear regression and nonlinear regression

Linear Regression • Unitary linear regression analysis • Only one independent variable and one dependent variable are included in the analysis • The approximate representation for the relation between the two can be conducted with a straight line • Multivariate linear regression analysis • Two or more independent variables are included in regression analysis • Linear relation between dependent variable and independent variables • The model of a linear regression y = f(X) • X = (x1, x2,⋯, xn) with n  1 is a multidimensional vector and y is scalar variable

Linear Regression (cont.) • A linear predictor function used to estimate the unknown parameters from data • Linear regression is applied mainly in the two areas • An approximation process for prediction, forecasting, or error reduction • Predictive linear regression models for an observed data set of y and X values • The fitted model makes a prediction of the value of y for future unknown input vector X • To quantify the strength of the relationship between output y and each input component Xj • Assess which Xj is irrelevant to y and which subsets of the Xj contain redundant information about y

Unitary Linear Regression • Fit linear regression models with a least squares approach • Consider a set of data points in a 2D sample space (x1, y1), (x2, y2), ..., (xn, yn) • Mapped into a scatter diagram • If they can be covered approximately by a straight line: y = ax + b + ε • x is an input variable, y is an output variable in the real number range, a and b are coefficients • ε is a random error, follows a normal distribution • Need to work out the expectation by using a linear regression expression: y = ax + b

Unitary Linear Regression (cont.) • The residual error of a unitary model • The approximation is shown by a linear line • Amid the middle or center of all data points in the data space • The main task for regression analysis

Unitary Linear Regression (cont.) • To conduct estimations for coefficient a and b via observation on n groups of input samples • The common method applies a least squares method • The objective function is given by • To minimize the sum of squares, need to calculate the partial derivative of Q • With respect to , and make them zero

Unitary Linear Regression (cont.) • are mean value for input variable and dependent variable, respectively • After working out the specific expression for the model • Need to know the fitting degree to the dataset • If the expression can express the relation between the two variables and can be used in actual predictions • To figure out the estimated value of the dependent variable with • For each sample in the training data set

Unitary Linear Regression (cont.) • The closer the coefficient of determination R2 is to 1, the better the fitting degree is • The further R2 is away from 1, the worse fitting degree is • Linear regression can also be used for classification • Only used in a binary classification problem • Decide between the two classes • For multivariate linear regression, this method is also applied to classify a dataset

Multiple Linear Regression • During solving actual problems • Often encounter many variables • e.g., The scores of a student may be influenced by factors like earnestness in class, preparation before class and review after class • e.g., The health of a man is not only influenced by the environment, but also related to the dietary habits • The model of unitary linear regression is not adapted to many conditions • Improve it with a model of multivariate linear regression analysis • Consider the case of m input variables • The output is expressed as a linear combination of the input variables

Multiple Linear Regression (cont.) • 𝛽0, 𝛽1,⋯, 𝛽m, 𝜎2 are unknown parameters • ε complies with normal distribution • The mean value is 0 and the variance is equal to 𝜎2 • By working out the expectation for the structure to get the multivariate linear regression equation • Substituted y for E(y) • Its matrix form is given as y = X𝛽 • X = [1, x1,⋯, xm], 𝛽 = [𝛽0, 𝛽1,⋯, 𝛽m]T • Our goal is to compute the coefficients • By minimizing the scalar objective function

Multiple Linear Regression (cont.) • Defined over n sample data points • To minimize Q, need to make the partial derivative of Q with respect to each βi zero • The multiple linear regression equation

Multiple Linear Regression (cont.) • Multivariate regression is an expansion and extension of unitary regression • Identical in nature • The range of applications is different • Unitary regression has limited applications • Multivariate regression is applicable to many real-life problems • Estimate the Density of Pollutant Nitric Oxide in a Spotted Location • Estimation of the density of nitric oxide (NO) gas, an air pollutant, in an urban location • Vehicles discharge NO gas during their movement

Multiple Linear Regression (cont.) • Creates a pollution problem proven harmful to human health • The NO density is attributed to four input variables • Vehicle traffic, temperature, air humidity, and wind velocity • 16 data points collected in various observed spotted locations in the city • Apply the multiple linear regression method to estimate the NO density • In testing a spotted location measured with a data vector of {1436, 28.0, 68, 2.00} for four features {x1, x2, x3, x4}, respectively • X = [1, xn1, xn2, xn3, xn4]T and the weight vector W = [b, β1, β2, β3, β4]T for n = 1,2,.…,16

Multiple Linear Regression (cont.)

Multiple Linear Regression (cont.) • e.g., for the first row of training data, [1300, 20, 80, 0.45, 0.066], X1 = [1, 1300, 20, 80, 0.45]T, which gives the output value y1 = 0.066 • Need to compute W = [b, β1, β2, β3, β4]T and minimize the mean square error • The 16 × 5 matrix directly obtained from the sample data table • y = [0.066, 0.005,.…, 0.039]Tis the given column vector of data labels

Multiple Linear Regression (cont.) • To make the prediction results on the testing sample vector x = [1, 1300, 20, 80, 0.45]T • By substituting the weight vector obtained • The final answer is {β1 = 0.029, β2 = 0.015, β3 = 0.002, β4 = −0.029, b = 0.070} • The NO gas density is predicted as = 0.065 or 6.5%

Logistic Regression Method • A linear regression analysis model can be extended to a broader application • For prediction and classification • Commonly used in fields like data mining, automatic diagnosis for diseases, and economic predictions • The logistic model may only be used to solve problems of dichotomy • The principle is to conduct classification to sample data with a logistic function • The expression for the logistic function • Known as a sigmoid function

Logistic Regression Method (cont.) • The input domain of the sigmoid function is (-∞, +∞) and the range is (0, 1) • The sigmoid function is a probability density function for sample data • The basic idea for logistic regression • Sample data may be concentrated at both ends of the by the use of intermediate feature z of the sample • Can be divided into two classes

Logistic Regression Method (cont.) • Consider vector X = (x1,⋯, xm) with m independent input variables • Each dimension of X stands for one attribute (feature) of the sample data (training data) • Multiple features of the sample data are combined into one feature by • Need to figure out the probability of the feature with designated data • And apply the sigmoid function to act on that feature

Logistic Regression Method (cont.) • During combining of multiple features into one feature • Make use of the linear function • The coefficient of the linear function, i.e., feature weight of sample data, needs to be determined • Maximum likelihood estimation is adopted to transform it into an optimization problem • The coefficient is determined through the optimization method

Understanding Machine Learning: Methods & Algorithms

Understanding Machine Learning: Methods & Algorithms

Presentation Transcript

Chapter 6: Inference and Prediction

Model Fitting

Chapter 3- Model Fitting

Seizure prediction and machine learning

Chapter 6. Classification and Prediction

Model Fitting

Chapter 6: Machine Learning

Model Fitting

Machine Learning Algorithms for Protein Structure Prediction

Machine Learning Evolutionary Algorithms

Machine Learning Algorithms

Machine Learning Chapter 6. Bayesian Learning

Nature Inspired Learning: Classification and Prediction Algorithms

Chapter 6. Classification and Prediction

Chapter 6. Classification and Prediction

Chapter 6. Classification and Prediction

Chapter 6. Classification and Prediction

MODEL FITTING

Machine Learning Evolutionary Algorithms

Unsupervised Machine Learning Algorithms

Chapter 6. Classification and Prediction

Comparative Study of Machine Learning Algorithms for Rainfall Prediction