Mastering Regression in Machine Learning

Feedback from last week Too much slides / Too much information in too little time  AI is a complex topic with different subjects  Overall lecture is not enough time (1.5h + Practice)  Use the summary and the learning results: Each slide is just for you to understand everything in total, but the focus of what you need to learn is clear based on learning results  Trust us: First lecture of his kind – We will combine learning results with exam questions Live Coding/ Practice Coding: the code could not be read on the beamer:  Uploading practice code and live code before the lecture  Use jupyter notebook for better explanation

Objectives for Lecture 3: Regression Depth of understanding • After the lecture you are able to…

Chapter: Motivation • Chapter: Linear Models • Chapter: Loss functions • Chapter: Regularization & Validation • Chapter: Practicalconsiderations • Chapter: Summary

Motivation – Regression Example Data points Output variable or Labels Weight in kg Regression result Size in m Input variables

Motivation – Algorithms in Machine Learning

Motivation – Algorithms in Machine Learning • House pricing • Sales • Persons weight • Object detection • Spam detection • Cancer detection • Genome patterns • Google news • Pointcloud (Lidar) processing

Motivation – Regression in Automotive Technology Sensor calibration • Usuallyelectricquantitiesaremeasured • Necessarytoconvertthemtophysicalquantities • Examples: • Accelerometers • Gyroscopes • Displacementsensors

Motivation – Regression in Automotive Technology Parameter estimation • Vehicleparameters like areoftenonlyroughlyknown • Estimation via regressiontechniques

Motivation – Regression in Automotive Technology Vehiclepricing • Regression iswidelyusedforfinancialrelations • Allowstocompressdatainto a simple modelandevaluate derivatives

Motivation – Whyshouldyouuse Regression? Model structure Training Data Previouslyunseensetsof input variables Predictionsaboutoutput variables Predictive Model Based on thecombinationofdataandmodelstructure, itispossibletopredicttheoutcomeof a processorsystem Training datasetisusuallyonly a representation at sparsepointsandcontains lots ofnoise Allowsusageofinformation in simulation, optimization, etc.

Relation ofstatisticsandmachinelearning Howcanweextractinformationfromdataandusethemtoreasonandpredictin beforehandunseencases? (learning) Nearly all classic machinelearningmethodscanbereinterpreted in termsofstatistics Focus in machinelearningismainly on prediction Statisticsoftenfocusses on relationanalysis Lots ofadvancedregressiontechniquesbuild upon a statisticalinterpretationofregression

Chapter: Motivation • Chapter: Linear Models • Chapter: Loss functions • Chapter: Regularization& Validation • Chapter: Practicalconsiderations • Chapter: Summary

Linear Basis Function Model Bias Term Input Variables Output Variables Weight Parameters Basis Functions Weight Parameters Basis Functions

Representingthedatasetas a matrix Weightvector Output vector Design Matrix

Nonlinearregression

Workflow – How do weobtainmodelparameters?

Basis functions – examples Linear function Polynomialfunction Sinusoidalfunction Gaussianbasisfunction

Basis functions – Polynomials • Globallydefined on theindependent variable domain • Design matrixbecomesill-conditionedfor large inputdomain variables forstandardpolynomials • Hyperparameter: • Polynomialdegree

Basis functions– Gaussians • Locally defined on theindependent variable domain • Sparse design matrix • Infinitlydifferentiable • Hyperparameter: • NumberofGaussianfunctions • Width ofeachbasisfunction • Meanofeachbasisfunction

Basis functions – comparisonoflocaland global Global basisfunction Localbasisfunction Spreadparameter: 0.3

Basis functions – other

Loss functions • The lossfunctionsmeasurestheaccuracyofthemodelbased on thetrainingdataset • The bestmodelwecanobtain, istheminimumlossmodel • Choice of a lossfunctionis fundamental in theregressionproblem • Minimizethelossfunctionforthetrainingdatasetconsistingofindependent variables andtarget variables byvariationofthebasisfunctionweights.

Loss functions – MeanSquared Error (MSE or L2) Pro‘s: Veryimportant in practicalapplications Solution canbeeasilyobtainedanalytically Con‘s: Not robust tooutliers Examples: Basic regression Energyoptimization Control applications

Loss functions – Mean Absolute Error (MAE or L1) Pro‘s: Robust tooutliers Con‘s: Noanalyticalsolution Non-differentiable in theorigin Examples: Financial applications

Loss functions – Huber Loss Pro‘s: Combinesstrengthsandweaknessesof L1 and L2 lossfunctions Robust + differentiable Con‘s: More hyperparameters Noanalyticalsolution

Loss functions – Comparison L2 lossisdifferentiable L1 lossismore intuitive Huber Loss combinestheoreticalstrengthsofboth Practicalhints: Start with L2 losswheneverpossible Think aboutphysicalinsightsandyourintent!

Analytic Solution – Low dimensional example Solve theoptimizationproblem with themodel Insert themodelanddatapoints In general, optimal solutionsareobtained at thepointswherethegradientvanishestozero.

Analytic Solution – Low dimensional example Calculate thegradient andsetitequaltozero

Analytic Solution – Low dimensional example Solve theresultingequation (also called normal equation):

Analytic Solution – General form Minimizing MSE lossfunctioncanberewritten in matrix form Optimum valueforisequaltosettingthegradienttozeroandsolvefor The importanceofthislossfunctionistightlyrelatedtothefactthattheanalyticalsolutionisavailableandcanbecalculatedexplicitlyforlow- to medium sizeddatasets!

SequentialAnalytic Solution - Motivation Actualbestestimate RLS Update rule New datapoint • Considerthefollowingcases: • Applyregressionduringoperationoftheproduct • Thereis not enoughmemorytostore all datapoints • A possiblesolutionisgivenbyRecursive Least Squares (RLS)

SequentialAnalyticSolution – The algorithm Predictionbased on oldparameters Residual Old parameterestimate Correctiongain New datapoint: Update theparameters Andthememorymatrix withbeingtheidentitymatrixofappropriatedimension

SequentialAnalyticSolution – Forgettingfactor • Some applicationsshowslowlyvaryingconditions in thelongterm, but canbeconsideredstationary on shortto medium time periods • Agingofproductsleadstoslightparameterchanges • Vehiclemassisusuallyconstantover a significantperiodof time • The RLS algorithmcan deal withthisbyintroductionof a forgettingfactor. This leadsto a reductionofweightforoldsamples.

Numerical Iterative Solutions Con‘s: • Knowledge aboutnumericoptimizationnecessary Optimum Costfunction Parameter • Regression canbesolvednumerically • Importantfor large-scaleproblemsandfor non-quadraticlossfunctions • Popularmethods: • Gradient descent • Gauss-Newton • Levenberg-Marquardt Pro‘s: • Verygeneric

Constraints on theweights • Weightscanbeinterpretedasphysicalquantities • Temperature (non-negative) • Spring constants (non-negative) • Mass (non-negative) • A valid rangeisknownfortheweights • Tireandotherfrictionmodels • Efficiency ( 0 – 100 % ) • Improvesrobustness • More difficulttosolve

Howtosolvetheregressionproblem? Isthecostfunctionquadratic? no yes Are thereparameterconstraints? no yes Isthedatasetvery large? yes no Is all dataavailableinstantanously? yes no SequentialAnalytic Solution Numeric Iterative Solution Analytic Solution

Howtochoosethemodel? Underfitted Welldone Overfitted • Toomanyfeatures • Unrelevantfeatures • Not enoughfeatures • Wrongstructure

Overfitting – Choice ofhyperparameters Figuresource: Bishop – Pattern Recognition andMachine Learning Overfittingisthefailuretogeneralizeproperlybetweenthedatapoints Costfunctiondecreaseswithincreasedmodelcomplexity Noise andunrelevanteffectsbecometooimportant

Overfitting – Curseofdimensionality 16 samples in one, twoandthree dimensional space • Overfittingoccursif • datapointsaresparse • Model complexityis high • Sparsityofdatapointsisdifficulttograsp • Sparsityincreases fast withincreasedinputdimension

Validation datasets Validation Data Train model A Evaluate Training Data Available Data Train model on complete Dataset Train model B Evaluate Evaluate Train model C Best model Difficulttojudgeoverfitting in high-dimensional domainsandautonomoussystems A standardtechniqueisto separate thedataintotrainingandvalidationdata

Validation datasets Figuresource: Bishop – Pattern Recognition andMachine Learning Increased Model Complexity Different hyperparameterscanbeusedto tune themodel Validation techniqueworksfor all ofthem

Common pitfallswithvalidationdatasets • Beawarethatyourvalidationdataset must reflectthefuturepropertiesoftheunderlyingphysicalrelationship. • Do not reusevalidationdatasets. Ifthe same validationsetisusedagainandagainfortestingthemodelperformanceitissomehow incorporated intothemodellingprocessanddoes not givetheexpectedresultsanymore! • Split thedatabeforefittingthemodelistherefore essential. Taking 2/3 ofthedataastrainingdatais a goodstartingvalue. Visualizeyourdataasmuchaspossible!

k-Fold Cross-validation Training Data Validation Data Validation Data Validation Data Validation Data fold • In caseof limited datasizesets, onemay not wanttoremove a substantial partofthedataforthefittingprocess • Onecanusesmallervalidationssetstoestimatethetruepredictionerrorbysplittingthedatainto multiple ‚folds‘ • Varianceoftheestimationerroris an indicatorformodelstability

Regularization • From a design pointofview, wewanttochoosemodelstructurebased on underlyingphysicalprinciplesandnot on thecharacteristicsofthedataset • Polynomialbasisfunctionstendtohave large coefficientsforsparsedatasets • Gaussianbasisfunctionstendtooverfitlocally, whichleadstosingle, large coefficients • A techniquetocircumventthisisregularization • Penalize high coefficients in theoptimizationpreventstheseeffects • Weightingofpenaltytermgives an intuitive hyperparametertocontrolmodelcomplexity

TypicalRegularization – Ridge Regression Regularization Term Other names: L2 regularization, Thikonovregularization Preventsoverfittingwell Analyticsolutionisavailableas an extensiontothe MSE problem Difficulttoapplyand tune in high-dimensional featurespaces

TypicalRegularization – Lasso Regression Regularization Term • Other names: L1 regularization • Tendstoproducesparsesolutionsandcanthereforebeappliedforfeatureselection • Sparsesolutionmeans, thatseveralcoefficientsgotozero:

Mastering Regression in Machine Learning

Mastering Regression in Machine Learning

Presentation Transcript

From last week

NMR - Recall From Last Week

Overview From Last Week

Homework from last week

Last week

In Review from Last Week

Recap from Last Week

Review from Last Week

Review from Last Week

From Last week

Homework from last week

Leftover Issues from last week:

Integration from last week

Questions from last week

Last week:

Re-cap from last week

LAST WEEK

“Homework” from last week

Last Week

Last Week

From Last Week

Questions from last week