200 likes | 304 Views
Artificial Intelligence Lecture 2. Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University boyuan@sjtu.edu.cn. Review of Lecture One. O verview of AI Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics
E N D
Artificial IntelligenceLecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University boyuan@sjtu.edu.cn
Review of Lecture One • Overview of AI • Knowledge-based rules in logics (expert system, automata, …) : Symbolism in logics • Kernel-based heuristics (neural network, SVM, …) : Connection for nonlinearity • Learning and inference (Bayesian, Markovian, …) : To sparsely sample for convergence • Interactive and stochastic computing (Uncertainty, heterogeneity) : To overcome the limit of Turin Machine • Course Content • Focus mainly on learning and inference • Discuss current problems and research efforts • Perception and behavior (vision, robotic, NLP, bionics …) not included • Exam • Papers (Nature, Science, Nature Review, Modern Review of Physics, PNAS, TICS) • Course materials
Today’s Content • Overview of machine learning • Linear regression • Gradient decent • Least square fit • Stochastic gradient decent • The normal equation • Applications
Basic Terminologies • x= Input variables/features • y= Output variables/target variables • (x, y) = Training examples, the ith training example = (x(i), y(i)) • m(j) = Number of training examples (1, …, m) • n(i) = Number of input variables/features (0, …,n) • h(x) = Hypothesis/function/model that outputs the predicative value under a given input x • q = Parameter/weight, which parameterizes the mapping of X to its predictive value, thus • We define x0 = 1 (the intercept), thus able to use a matrix representation:
Gradient Decent Using the matrix to representthe training samples with respect to q: The Cost Function is defined as: The gradient decent is based onthe partial derivatives with respect to q: The algorithm is therefore: Loop { There is another alternative to iterate, called stochastic gradient decent: } (for every j)
Normal Equation An explicit way to directly obtain q
The Optimization Problem by the Normal Equation We set the derivatives to zero, and obtain the Normal Equations:
Today’s Content • Linear Regression • Locally Weighted Regression (an adaptive method) • Probabilistic Interpretation • Maxima Likelihood Estimation vs. Least Square (Gaussian Distribution) • Classification by Logistic Regression • LMS updating • A Perceptron-based Learning Algorithm
Linear Regression • Number of Features • Over-fitting and under-fitting Issue • Feature selection problem (to be covered later) • Adaptive issue • Some definitions: • Parametric Learning (fixed set of q, withn being constant) • Non-parametric Learning (number of q grows with m linearly) • Locally-Weighted Regression (Loess/Lowess Regression) non-parametric • A bell-shape weighting (not a Gaussian) • Every time you need to use the entire training data set to train for a given input to predict its output (computational complexity)
Extension of Linear Regression • Linear Additive (straight-line): x1=1, x2=x • Polynomial: x1=1, x2=x, …, xn=xn-1 • Chebyshev Orthogonal Polynomial: x1=1, x2=x, …, xn=2x(xn-1-xn-2) • Fourier Trigonometric Polynomial: x1=0.5, followed by sin and cos of different frequencies of xn • Pairwise Interaction: linear terms + xk1,k2 (k =1, …, N) • … • The central problem underlying these representations are whether or not the optimization processes for q are convex.
Probabilistic Interpretation • Why Ordinary Least Square (OLE)? Why not other power terms? • Assume • PDF for Gaussian is • This implies that • Or, ~ = Random Noises, ~ Why Gaussian for random variables? Central limit theorem?
Maximum Likelihood (updated) • Consider training data are stochastic • Assume are i.i.d. (independently identically distributed) • Likelihood of L(q) = the probability of y given x parameterized by q • What is Maximum Likelihood Estimation (MLE)? • Chose parameters qto maximize the function , so to make the training data set as probable as possible; • Likelihood L(q) of the parameters, probability of the data.
The Equivalence of MLE and OLE = J(q) !?
Sigmoid (Logistic) Function Other functions that smoothly increase from 0 to 1 can also be found, but for a couple of good reasons (we will see next time for the Generalize Linear Methods) that the choice of the logistic function is a natural one.
Recall (Note the positive sign rather than negative) Let’s working with just one training example (x, y), and to derive the Gradient Ascent rule: