1.79k likes | 1.96k Views
Supervised Machine Learning Algorithms. Taxonomy of Machine Learning Methods. The main idea of machine learning (ML) To use computers to learn from massive amounts of data For tedious or unstructured data, machines can often make better and more unbiased decisions than a human learner
E N D
Taxonomy of Machine Learning Methods • The main idea of machine learning (ML) • To use computers to learn from massive amounts of data • For tedious or unstructured data, machines can often make better and more unbiased decisions than a human learner • ML forms the core of artificial intelligence (AI) • Especially in the era of big data • Need to write a computer program based on a model algorithm • Learning from given data objects, one can reveal the categorical class or experience affiliation of future data to be tested • Essentially defines ML as an operational term
Taxonomy of Machine Learning Methods (cont.) • To implement an ML task • Need to explore or construct computer algorithms to learn from data • Make predictions on data based on their specific features, similarities, or correlations • ML algorithms are operated by building a decision-making model from sample data inputs • Defines the relationship between features and labels • A feature is an input variable for the algorithm • A label is an output variable for the algorithm • The outputs are data-driven predictions or decisions • One can handle the ML process subjectively • By finding the best fit to solve the decision problem based on the characteristics in data sets
Classification by Learning Paradigms • ML algorithms can be built with different styles in order to model a problem • The style is dictated by the interaction with the data environment • Expressed as the input to the model • The data interaction style decides the learning models that a ML algorithm can produce • The user must understand the roles of the input data and the model’s construction process • The goal is to select the ML model that can solve the problem with the best prediction result • ML sometime overlaps with the goal of data mining
Classification by Learning Paradigms (cont.) • Three classes of ML algorithms based on different learning styles • Supervised, unsupervised, and semi-supervised • Three ML methods are viable in real-life applications • The style is hinged on how training data is used in the learning process
Classification by Learning Paradigms (cont.) • Supervised learning • The input data is called training data with a known label or result • A model is constructed through training by using the training dataset • Improved by receiving feedback predictions • The learning process continues until the model achieves a desired level of accuracy on the training data • Future incoming data without known labels is tested on the model with an acceptable level of accuracy • Unsupervised learning • All input data are not labeled with a known result
Classification by Learning Paradigms (cont.) • A model is generated by exploring the hidden structures present in the input data • To extract general rules, go through a mathematical process to reduce redundancy, or organize data by similarity testing • Semi-supervised learning • The input data is a mixture of labeled and unlabeled examples • The model must learn the structures to organize the data in order to make predictions possible • Under different assumptions on how to model the unlabeled data
Supervised Machine Learning Algorithms • In a supervised ML system • The computer learns from a training data set of {input, output} pairs • The input comes from sample data given in a certain format • e.g., The credit reports of borrowers • The output may be discrete • e.g., yes or no to a loan application • The output can be also continuous • e.g., The probability distribution that the loan can be paid off in a timely manner • The goal is to work out a reliable ML model • Can map or produce the correct outputs from new inputs that were unseen before
Supervised Machine Learning Algorithms (cont.) • Four families of supervised ML algorithms • Regression, decision trees, Bayesian networks, and support vector machines • The ML system acts like a finely tuned predictor function g(x) • The learning system is built with a sophisticated algorithm to optimize this function • e.g., Given an input data x in a credit report of a borrower, the bank will make a loan decision based on the predicted outcome • The learning process is iteratively refined using an error criterion to make better predictions • Minimizes the error between predicted value and actual experience in input data
Supervised Machine Learning Algorithms (cont.) • The iterative trial-and-error process • Suggested for machine learning algorithms to train a model
Regression Analysis • The outputs of regression are continuous rather than discrete • Finds the causal relationship between the input and output variables • Apply mathematical statistics to establish dependent variables and independent variables in learning • The independent variables are the inputs of the regression process, aka the predictors • The dependent variable is the output of the process • Essentially performs a sequence of parametric or nonparametric estimations • Careful to make the predictions • Causality may lead to illusions or false relationships to mislead the users
Regression Analysis (cont.) • The estimation function can be determined • By experience using a priori knowledge or visual observation of the data • The regression method can be applied to classify data by predicting the category tag of data • Regression analysis determines the quantitative relation in a learning process • How the value of the dependent variable changes • When any independent variable varies while the other independent variables are left unchanged • Regression analysis estimates the average value of the dependent variable when the independent variables are fixed
Regression Analysis (cont.) • The estimated value is a function of the independent variables known as the regression function • Can be described by a probability distribution • Most regression methods are parametric naturally • Need to calculate the undetermined coefficients of the function by using some error criteria • With a finite dimension in the analysis space • Nonparametric regression may be infinite-dimensional • Accuracy or performance depends on the quality of the dataset used • Related to the data generation process and the underlying assumptions made
Regression Analysis (cont.) • Regression offers estimation of continuous response variables • As opposed to the discrete decision values used in classification that demand higher accuracy • In the formulation of a regression process • The unknown parameters are often denoted as β • May appear as a scalar or a vector • The independent variables are denoted by a vector X and a dependent variable as Y • When multiple dimensions are involved, these parameters are vectors in form • A regression model establishes the approximated relation between X, β, and Y:
Regression Analysis (cont.) • The function f(X, β) is approximated by the expected value E(Y|X) • The regression function f is based on the knowledge of the relationship between a continuous variable Y and vector X • If no such knowledge is available, an approximated handy form is chosen for f • Measuring the Height after Tossing a Small Ball in the Air • Measure its height of ascent h at the various time instant t • The relationship is modeled as • β1 determines the initial velocity of the ball
Regression Analysis (cont.) • β2 is proportional to standard gravity • ε is due to measurement errors • Linear regression is used to estimate the values of β1 and β2 from the measured data • This model is nonlinear with respect to the time variable t • But it is linear with respect to parameters β1 and β2 • Consider k components in the vector of unknown parameters β • Three models to relate the inputs to the outputs • Depending on the relative magnitude between the number N of observed data points of the form (X, Y) and the dimension k of the sample space
Regression Analysis (cont.) • When N < k, most classical regression analysis methods can be applied • Most classical regression analysis methods can be applied • The defining equation is underdetermined • No enough data to recover the unknown parameters β • When N = k and the function f is linear • The equation Y = f (X, β) can be solved exactly without approximation • There are N equations to solve N components in β • The solution is unique as long as the X components are linearly independent • If f is nonlinear, many solutions may exist or no solution at all
Regression Analysis (cont.) • In general, the situation with N > k data points • Enough information in the data that can estimate a unique value for β under an overdetermined situation • The measurement errors εi follows a normal distribution • There exists an excess of information contained in (N - k) measurements • Known as the degrees of freedom of the regression • Regression with a Necessary Set of Independent Measurements • Need the necessary number of independent data to perform the regression analysis of continuous data measurements
Regression Analysis (cont.) • Consider a regression model with four unknown parameters, 𝛽0, 𝛽1, 𝛽2 and 𝛽3 • An experimenter performs 10 measurements • All at exactly the same value of independent variable vector X = (X1, X2, X3, X4) • Regression analysis fails to give a unique set of estimated values for the four unknown parameters • Not get enough information to perform the prediction • Only can estimate the average value and the standard deviation of the dependent variable Y • Measuring at two different values of X • Only gives enough data for a regression with two unknowns, but not for three or more unknowns • Only if performs measurements at four different values of the independent variable vector X
Regression Analysis (cont.) • Regression analysis will provide a unique set of estimates for the four unknown parameters in β • Basic assumptions on regression analysis under various error conditions • The sample is representative of the data space involved • The error is a random variable with a mean of zero conditioned over the input variables • The independent variables are measured with no error • The predictors are linearly independent • The errors are uncorrelated • The variance of error is a constant across observations
Linear Regression • Regression analysis includes linear regression and nonlinear regression • Unitary linear regression analysis • Only one independent variable and one dependent variable are included in the analysis • The approximate representation for the relation between the two can be conducted with a straight line • Multivariate linear regression analysis • Two or more independent variables are included in regression analysis • Linear relation between dependent variable and independent variables • The model of a linear regression y = f(X)
Linear Regression (cont.) • X = (x1, x2,⋯, xn) with n 1 is a multidimensional vector and y is scalar variable • f(X) is a linear predictor function used to estimate the unknown parameters from data • Linear regression is applied mainly in the two areas • An approximation process for prediction, forecasting, or error reduction • Predictive linear regression models for an observed data set of y and X values • The fitted model makes a prediction of the value of y for future unknown input vector X • To quantify the strength of the relationship between output y and each input component Xj
Linear Regression (cont.) • Assess which Xj is irrelevant to y and which subsets of the Xj contain redundant information about y • Major steps in linear regression
Unitary Linear Regression • Crickets chirp more frequently on hotter days than on cooler days
Unitary Linear Regression (cont.) • Consider a set of data points in a 2D sample space (x1, y1), (x2, y2), ..., (xn, yn) • Mapped into a scatter diagram • If can be covered approximately by a straight line: y = ax + b + ε • x is an input variable, y is an output variable in the real number range, a and b are coefficients • ε is a random error, and follows a normal distribution with mean E(ε)and variance Var(ε) • Need to work out the expectation by using a linear regression expression: y = ax + b • The main task is to conduct estimations for coefficient a and b via observation on n groups of input samples
Unitary Linear Regression (cont.) • Fit linear regression models with a least squares approach • The approximation is shown by a linear line • Amid the middle or center of all data points in the data space • The residual error (loss) of a unitary model
Unitary Linear Regression (cont.) • The convex objective function is given by • To minimize the sum of squares, need to calculate the partial derivative of Q with respect to , and make them zero • are mean value for input variable and dependent variable, respectively
Unitary Linear Regression (cont.) • After working out the specific expression for the model • Need to know the fitting degree to the dataset • If the expression can express the relation between the two variables and can be used in actual predictions • To figure out the estimated value of the dependent variable with • For each sample in the training data set
Unitary Linear Regression (cont.) • The closer the coefficient of determination R2 is to 1, the better the fitting degree is • The further R2 is away from 1, the worse fitting degree is • Linear regression can also be used for classification • Only used in a binary classification problem • Decide between the two classes • For multivariate linear regression, this method is also applied to classify a dataset
Unitary Linear Regression (cont.) • Healthcare Data Analysis • Obesity is reflected by the weight index • More likely to have high blood pressure or diabetes • Predict the relationship between obesity and high blood pressure • The dataset for body weight index and blood pressure of some people at a hospital in Wuhan
Unitary Linear Regression (cont.) • Conduct a preliminary judgment on what is the datum of blood pressure of a person with a body weight index of 24 • A prediction model with two variables • The unitary linear regression may be considered • Determine distribution of the data points • Scatter diagram for body weight index-blood pressure
Unitary Linear Regression (cont.) • All data points are almost on or below the straight line • Being linearly distributed • The data space is modeled by a unitary linear regression process • By the least square method • Get a = 1.32 and b = 96.58 • Therefore we have y = 1.32x + 96.58 • A significance test is needed to verify whether the model will fit well with the current data • A prediction is made through calculation • The mean residual and coefficient of determination of the model are: average error is 1.17 and R2 = 0.90
Unitary Linear Regression (cont.) • The mean residual is much less than the mean value 125.6 of blood pressure • The coefficient of determination is close to 1 • This regression equation is significant • Can fit well into the dataset • Predictions may be conducted for unknown data on this basis • Given body weight index, the value of blood pressure of a person may be determined with the model • Substitute 24 for x • Can get the value of blood pressure of that person as y = 1.32 × 24 + 96.58 = 128
Multiple Linear Regression • During solving actual problems • Often encounter many variables • e.g., The scores of a student may be influenced by factors like earnestness in class, preparation before class and review after class • e.g., The health of a man is not only influenced by the environment, but also related to the dietary habits • The model of unitary linear regression is not adapted to many conditions • Improve it with a model of multivariate linear regression analysis • Consider the case of m input variables • The output is expressed as a linear combination of the input variables
Multiple Linear Regression (cont.) • 𝛽0, 𝛽1,⋯, 𝛽m, 𝜎2 are unknown parameters • ε complies with normal distribution • The mean value is 0 and the variance is equal to 𝜎2 • By working out the expectation for the structure to get the multivariate linear regression equation • Substituted y for E(y) • Its matrix form is given as E(y) = X𝛽 • X = [1, x1,⋯, xm], 𝛽 = [𝛽0, 𝛽1,⋯, 𝛽m]T • Our goal is to compute the coefficients by minimizing the objective function
Multiple Linear Regression (cont.) • Defined over n sample data points • To minimize Q, need to make the partial derivative of Q with respect to each βi zero • The multiple linear regression equation
Multiple Linear Regression (cont.) • Multivariate regression is an expansion and extension of unitary regression • Identical in nature • The range of applications is different • Unitary regression has limited applications • Multivariate regression is applicable to many real-life problems • Estimate the Density of Pollutant Nitric Oxide in a Spotted Location • Estimation of the density of nitric oxide (NO) gas, an air pollutant, in an urban location • Vehicles discharge NO gas during their movement
Multiple Linear Regression (cont.) • Creates a pollution problem proven harmful to human health • The NO density is attributed to four input variables • Vehicle traffic, temperature, air humidity, and wind velocity • 16 data points collected in various observed spotted locations in the city • Apply the multiple linear regression method to estimate the NO density • In testing a spotted location measured with a data vector of {1436, 28.0, 68, 2.00} for four features {x1, x2, x3, x4}, respectively • X = [1, xn1, xn2, xn3, xn4]T and the weight vector W = [b, β1, β2, β3, β4]T for n = 1,2,.…,16
Multiple Linear Regression (cont.) • e.g., for the first row of training data, [1300, 20, 80, 0.45, 0.066], X1 = [1, 1300, 20, 80, 0.45]T, which gives the output value y1 = 0.066 • Need to compute W = [b, β1, β2, β3, β4]T and minimize the mean square error • The 16 × 5 matrix directly obtained from the sample data table • y = [0.066, 0.005,.…, 0.039]Tis the given column vector of data labels
Multiple Linear Regression (cont.) • To make the prediction results on the testing sample vector x = [1, 1300, 20, 80, 0.45]T • By substituting the weight vector obtained • The final answer is {β1 = 0.029, β2 = 0.015, β3 = 0.002, β4 = −0.029, b = 0.070} • The NO gas density is predicted as = 0.065 or 6.5%
Logistic Regression Method • Many problems require a probability estimate as output • Logistic regression is an extremely efficient mechanism for calculating probabilities • Commonly used in fields like data mining, automatic diagnosis for diseases, and economic predictions • The logistic model may be used to solve problems of binary classification • In solving a classification problem • The inputs are divided into two or more classes • The learner must produce a model that assigns unseen inputs to one or more of these classes • Typically tackled in a supervised way
Logistic Regression Method • Spam filtering is a good example of classification • The inputs are e-mails, blogs, or document files • The output classes are spam and non-spam • For logistic regression classification • The principle is to conduct classification to sample data with a logistic function • Maps logistic regression output to probabilities • Known as a sigmoid function • The input domain of the sigmoid function is (-∞, +∞) and the range is (0, 1) • Can regard the sigmoid function as a probability density function for sample data
Logistic Regression Method (cont.) • The function image is sensitive, if z = 0 • And not sensitive if z ≫ 0 or z ≪ 0 z
Logistic Regression Method (cont.) • The basic idea for logistic regression • Sample data may be concentrated at both ends of the by the use of intermediate feature z of the sample • Can be divided into two classes • Consider vector X = (x1,⋯, xm) with m independent input variables • Each dimension of X stands for one attribute (feature) of the sample data (training data) • Multiple features of the sample data are combined into one feature by • Figure out the probability of the z feature with designated data
Logistic Regression Method (cont.) • And apply the sigmoid function to act on that feature • Obtain the expression for the logistic regression • During combining of multiple features into one feature
Logistic Regression Method (cont.) • Make use of the linear function • The coefficient of the linear function, i.e., feature weight of sample data, needs to be determined • Maximum likelihood Estimation (MLE) is adopted to transform it into an optimization problem • Attempts to find the parameter values that maximize the likelihood function, given the observations • The coefficient is determined through the optimization method • The loss function is Log Loss • D is the data set containing many labeled examples, i.e., (x, y) pairs
Logistic Regression Method (cont.) • y is the label in a labeled example and its value must either be 0 or 1 • y′ is the predicted value, somewhere between 0 and 1, given the set of features in x • Minimizing this negative logarithm of the likelihood function yields a maximum likelihood estimate • Logistic regression returns a probability • To map a regression value to a binary category must define a classification or decision threshold • Thresholds are problem-dependent • Tempting to assume that it should always be 0.5 • Its value must be tuned • Part of choosing a threshold is assessing how much one will suffer for making a mistake
Logistic Regression Method (cont.) • General steps for logistic regression • Accuracy is one metric for evaluating classification models • The fraction of predictions the model gets right
Logistic Regression Method (cont.) • Four possible statuses for binary classification • TP (True Positive) refers to an outcome where the model correctly predicts the positive class • TN (True Negative) means an outcome where the model correctly predicts the negative class • FP (False Positive) is an outcome where the model incorrectly predicts the positive class • FN (False Negative) an outcome where the model incorrectly predicts the negative class