Exploring Intrinsic Dimension of Data Representations in Deep Neural Networks

Intrinsic dimension of data representations in deep neural networks Alessandro Laio

Intrinsic dimension of data representations in deep neural networks Jakob Macke (TUM) AlessioAnsuini (SISSA) neuroscience physics machine learning Davide Zoccolan Alessandro Laio(SISSA)

Deep neural nets for image classification Training with a very large set of labelled images (> 1 million) INPUT Adapted from Yamins & DiCarlo et al (2016)

Deep neural nets for image classification Adapted from LeCun et al (2015) Testing images OUTPUT INPUT Adapted from Yamins & DiCarlo et al (2016)

Deep neural nets for image classification Adapted from LeCun et al (2015) Testing images

Machine vs. human vision Artificial vision Human vision What do these processes have in common?

The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al(2016) Deep neural nets for image classification

The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al (2016) retina thalamus primaryand secondary visual cortex (V1& V2) area V4 inferotemporal cortex (IT)

The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al (2016) Retinal space Sam manifold Joe manifold Adapted from DiCarlo & Cox (2007)

The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al (2016) V1 space Retinal space Sam manifold Sam manifold Joe manifold Joe manifold Adapted from DiCarlo & Cox (2007)

The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al (2016) IT space V1 space Retinal space Sam manifold Sam manifold Sam manifold Joe manifold Joe manifold Joe manifold Adapted from DiCarlo & Cox (2007)

The untangling hypothesis untangling of object manifolds Adapted from Yamins & DiCarlo et al (2016) IT space Retinal space Sam manifold Sam manifold Joe manifold Joe manifold Adapted from DiCarlo & Cox (2007)

The untangling hypothesis untangling of object manifolds IT space Retinal space Sam manifold Sam manifold Joe manifold Joe manifold Adapted from DiCarlo & Cox (2007)

The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ Joe manifold Adapted from DiCarlo & Cox (2007)

The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? Joe manifold Adapted from DiCarlo & Cox (2007)

The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? Joe manifold ? Adapted from DiCarlo & Cox (2007)

The goal: to track the intrinsic dimension (ID) of image representations in deep nets • We tested various state-of-the-art deep nets • AlexNet, VGG, VGG-bn & ResNet • We estimated the ID in a subset of layers • Input & output layers, pooling layers & fully connected layers VGG-16 Simonyan & Zisserman (2015) Source: https://blog.heuritech.com

The goal: to track the intrinsic dimension (ID) of image representations in deep nets • We tested various state-of-the-art deep nets • AlexNet, VGG, VGG-bn & ResNet • We estimated the ID in a subset of layers • Input & output layers, • Empirical questions • How does the ID vary across the layers of a deep net? • How linear (i.e., flat) are the data manifolds? • Is there any relationship between ID in the last hidden layer and classification performance of the network? pooling layers & fully connected layers

Estimation of the intrinsic dimension Intrinsic dimension of a data representation: minimal number of coordinates that are necessary to described its point without significant information loss • The linear case: Principal Component Analysis (PCA) 2D embedding space P1 P2 Activation x2 Activation x1

Estimation of the intrinsic dimension Intrinsic dimension of a data representation: minimal number of coordinates that are necessary to described its point without significant information loss • The linear case: Principal Component Analysis (PCA) 2D embedding space P1 P2 Activation x2 1D linear subspace Activation x1

Estimation of the intrinsic dimension • The general (non-linear) case: TwoNN(Facco et al, 2017) 1) For each data point icompute the distance to its first and second neighbour (ri,1 and ri,2) 2D embedding space 2) For each i compute Activation x2 The probability distribution of m is where d is the ID, independently on the local density of points. 3) Infer d from the empirical probability distribution of all the mi. pointi ri,2 ri,1 1D manifold Activation x1

Results Evolution of ID across the layers of a deep net 1 pre-trained on ImageNet VGG-16 Simonyan & Zisserman (2015) Source: https://blog.heuritech.com

Results Evolution of ID across the layers of a deep net 1 • ID evolution across layers has a hunchback shape • In each layer: ID << ED (embedding dimension) • In the last hidden layers: • the ID decreases monotonically • the ID reaches very small values (~10 or lower)

Results Evolution of ID in state-of-the-art deep nets 2 • Four families of state-of-the art deep net architectures: • AlexNet • VGG • VGG-bn • ResNet • Pre-trained on ImageNet • ID computed for the 7 most populated categories (500 images each)

Results Evolution of ID in state-of-the-art deep nets 2 • ID evolution across layers has a hunchback shape • In each layer: ID << ED (embedding dimension) • Considerable overlap of the ID profile as a function of relative layer depth: •  After an initial growth of the ID, deep nets perform a progressive dimensionality reduction of the object manifolds • Any relationship with classification accuracy?

Results Performance vs. ID in last hidden layer 3 • Four families of state-of-the art deep net architectures: • AlexNet • VGG • VGG-bn • ResNet • Pre-trained on ImageNet • ID computed over a a random mixture of 2000 training images • Classification accuracy = top 5-score error

Results Performance vs. ID in last hidden layer 3 r = 0.94 training set  r = 0.99  test set • ID on the training set is a strong predictor of performance on the test set • ID = proxy for generalization ability of a deep network

The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? Joe manifold ? Adapted from DiCarlo & Cox (2007)

The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? ✓ Joe manifold Adapted from DiCarlo & Cox (2007)

Results Linear vs. non-linear ID estimates 4 ID-PCA ≈ ID-TwoNN ID-PCA >> ID-TwoNN 2D embedding space 2D embedding space Activation x2 Activation x2 1D manifold 1D manifold = 1D linear subspace Activation x1 Activation x1

Results Linear vs. non-linear ID estimates 4 • No gap in the eigenvalue spectrum yielded by PCA  data manifolds are not linear • Yet we define ID-PCA = # of PCs that account for 90% of variance in the data

Results Linear vs. non-linear ID estimates 4 • No gap in the eigenvalue spectrum yielded by PCA • ID-PCA >> ID-TwoNN

Results Linear vs. non-linear ID estimates 4 • No gap in the eigenvalue spectrum yielded by PCA • ID-PCA >> ID-TwoNN • ID-PCA not much different for trained and randomly initialized networks

Results Linear vs. non-linear ID estimates 4 • No gap in the eigenvalue spectrum yielded by PCA • ID-PCA >> ID-TwoNN • ID-PCA not much different for trained and randomly initialized networks • ID-TwoNN is flat for randomly initialized networks (orthogonal transformations) data manifolds are not linear

The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ✗ ✓ Joe manifold Adapted from DiCarlo & Cox (2007)

Results Looking into the origin of ID initial expansion 5 VGG-16 Source: https://blog.heuritech.com

Results Looking into the origin of ID initial expansion 5 MNIST

Results Looking into the origin of ID initial expansion 5 MNIST MNIST

Results Looking into the origin of ID initial expansion 5

Results Looking into the origin of ID initial expansion 5 • In a trained network, the initial ID expansion reflects the pruning of low-level, highly correlated visual features that carry no information about the correct labeling

Summary Hunchback shape of ID vs. layer depth Correlation of ID with performance Low ID but nonlinear manifolds ID expansion = pruning low-level info

Summary FIRST LAYERS: remove irrelevant features. The ID grows LAST LAYERS: Dimensional reduction. The ID shrinks along the layers. The more gradual the better (deep networks win). …. ….

Acknowledgments Jakob Macke (TUM) 2019 Alessio Ansuini (SISSA) Vancouver DavideZoccolan(SISSA)

Exploring Intrinsic Dimension of Data Representations in Deep Neural Networks

Exploring Intrinsic Dimension of Data Representations in Deep Neural Networks

Presentation Transcript

CSC 578 Neural Networks and Deep Learning

The Intrinsic Dimension of Metric Spaces

CSC 578 Neural Networks and Deep Learning

Neural networks for structured data

Neural Networks and Deep Learning

Predicting Signal Peptides using Deep Neural Networks

Estimation of the Intrinsic Dimension

Maximum likelihood estimation of intrinsic dimension

Neural networks for data mining

Estimating Intrinsic Dimension

Embedding Metric Spaces in Their Intrinsic Dimension

Data Mining with Neural Networks

Visual representations of networks

The Pragmatic dimension of social representations

Increasing Use of Neural Networks in Data Mining

Parallax: Sparsity-aware Data Parallel Training of Deep Neural Networks

Parallel Systems to Compute Deep Neural Networks

Deep Neural Networks for Supervised Speech Separation

Neural Networks in Social Networks

Neural Networks and Deep Learning_