Introduction to Deep Learning - From Perceptron to Neural Networks

CV192: Introduction to Deep LearningOren Freifeld Ron Shapira WeberComputer Science, Ben-Gurion University

Contents • Introduction – What is Deep Learning? • Linear \ Binary Perceptron • Multi-Layer Perceptron [Figure from previous slide taken from https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html]

What is Deep Learning? From perceptron to deep neural networks

Example – Object recognition and localization [Andrej Karpathy Li Fei-Fei, (2015): Deep Visual-Semantic Alignments for Generating Image Descriptions]

Some history – ImageNet challenge • 1.2 million images in the training set, each labeled with one of 1000 categories • Image classification problem https://cs.stanford.edu/people/karpathy/cnnembed/

Some history – ImageNet challenge • One of the Top-5 guesses needs to be the correct one. https://blog.acolyer.org/2016/04/20/imagenet-classification-with-deep-convolutional-neural-networks/

Increasing Depth on ImageNet challenge Trend of increasing depth (Img Credit: Kaiming He)

ImageNet architecture comparison • Amount of operations for a single forward pass vs. top-1accuracy [Canziani et al., (2016). An analysis of deep neural network models for practical applications.]

Supervised Learning • Data: • X – dataset: Images, Videos, Text, etc… • y – labels (cat, dog, platypus) • Image classification example: Probability distribution over classes Classifier (SVM, LDA, Deep neural network etc…) *We’ll also see variants of deep learning algorithms where it isn’t

Supervised Learning • An example of a supervised learning algorithm we saw at this course? • Least-Squares Estimation in a Linear Model: • A known function, • Data: pairs of where • Define is a matrix). • Goal: find the optimal (in the least-square sense) parameter assuming the model In other words: • Note that in this framework we try to predict the label of the input .

Un-supervised Learning • Solve some task given “unlabeled” data. • An example to unsupervised learning algorithm we saw at this course?

Supervised Learning Framework: • Provide data, labels - • Split data into: • Training data: majority of the data (for instance, 60%) Used to train the model. • Validation set: a partition of the data (20%) used for tuning of the parameters. • Test data: a partition of the data (20%) used to test the accuracy of the model. • Define algorithm • Define a loss function: • In the case of Linear Regression, L2 norm: • Define an optimization method to find such that:

Example: Deep Learning for Image label classification • Provide data, labels - • Split data into: • Training data • Validation set • Test data • Define algorithm: Artificial Neural Network, Convolutional NN, etc… • Define a lost function: • L2 norm • Cross-Entropy • Define an optimization method to find such that: • Usually there’s no closed form solution, can use iterative gradient-based methods .

When working with images • Represent images as vectors: Image . Flatten image so that

Perceptron ) . . .

Some History • The perceptron algorithm was invented in 1957 at the Cornell Aeronautical Laboratory by Frank Rosenblatt • It is an extension of the perceptron which was first introduced in the 1950s. • In 1969 a famous book entitled “Perceptrons” by Marvin Minsky and Seymour Papert showed that it was impossible for perceptrons to learn an XOR function without adding an hidden layer. • Hence the term Multilayer perceptron. https://en.wikipedia.org/wiki/Perceptron

Linear Perceptron – Single Output . . .

Linear Perceptron • Try to predict by • This is a linear least squares problem: • Find: • Therefor there is a closed-form solution: • Where is the entire dataset (each row is a sample).

(Vanilla) Binary Perceptron – Single Output . . .

(Sigmoid) Binary Perceptron – Single Output ) . . .

Binary Perceptron • The binary perceptron acts as a binary classifier And

(Softmax) Binary Perceptron - Multiple Outputs A generalization of the sigmoid function called : ) . . . ) . . .

Multiclass Binary Perceptron Probability distribution over classes ) . . . ) . . .

Multiclass Binary Perceptron Correct class distribution ) . . . ) . . .

Need to calculate loss. How different is ‘our’ probability distribution over the possible classes from the correct one. • Cross-entropy (not to be confused with the joint entropy of two random variables): • Since our target distribution is “one-hot encoded” This means it is equivalent to minimizing the KL divergence between the two distributions. • In other words, the cross-entropy objective ‘wants’ the predicted distribution to have all of its mass on the correct answer. • When using the SoftMax activation function, with the cross-entropy loss function we get: • Note: when implementing use the long-sum-exp trick. http://cs231n.github.io/linear-classify/

Multilayer perceptron (MLP)

The XOR (“exclusive OR”) problem • Given 4 points in , return: • Can we solve the problem with a linear/binary perceptron (with a single output)? • Is it linearly separable?

The XOR problem [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem • A single-layer perceptron is a linear combination of its inputs. • The classification of the input is given by a line which separates between the classes of the input. • If we look at the equations: • There is no solution to this linear system

The XOR problem • We can also try to treat this problem as a least squares problems: • Loss function: • Model: • (Exercise) solving the normal equations we get: [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem • Adding a hidden layer can help solve the XOR problem. • We will add a vector of hidden units • The values of these hidden units are then used as input for the second/output layer. • Our model is now: [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem • What should be our choice of • can’t be linear, otherwise: and Then: where • We must use a non-linear function for [Example from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem • , which is known as Rectified Linear Unit (ReLU) [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

The XOR problem • Our new model: You can find a complete walkthrough of the problem at: http://www.deeplearningbook.org/ chapter 6.1 [Figure from: Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning]

No hidden layers http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

MLP with one hidden layer http://colah.github.io/posts/2014-03-NN-Manifolds-Topology/

MLP with one hidden layer [Lecun, Y., Bengio, Y., & Hinton, G. (2015)]

How big should our hidden layer be? https://cs.stanford.edu/people/karpathy/convnetjs/demo/classify2d.html

Summary • Deep learning is a class of supervised learning algorithms. • Linear \ binary perceptron acts as a linear classifier. • Hidden layers (followed by non-linear activation function) allows for non-linear transformation of the input so that it could be linear separable. • The number of neurons and connections in each layer determine our model capacity.

Introduction to Deep Learning - From Perceptron to Neural Networks

Introduction to Deep Learning - From Perceptron to Neural Networks

Presentation Transcript

Contents

Contents

Contents

Contents

Contents

Contents

Contents

CONTENTS

Contents

Contents