470 likes | 514 Views
This study examines the intrinsic dimension of data representations in deep neural networks such as AlexNet, VGG, VGG-bn, and ResNet. It measures the number of coordinates needed to describe data points without significant information loss. By analyzing various layers including input, output, pooling, and fully connected layers, the research explores how intrinsic dimension varies across layers and its impact on network performance. The analysis involves linear methods like Principal Component Analysis (PCA) and non-linear methods like TwoNN. The aim is to understand the structure and complexity of data manifolds and their relation to classification accuracy.
E N D
Intrinsic dimension of data representations in deep neural networks Alessandro Laio
Intrinsic dimension of data representations in deep neural networks Jakob Macke (TUM) AlessioAnsuini (SISSA) neuroscience physics machine learning Davide Zoccolan Alessandro Laio(SISSA)
Deep neural nets for image classification Training with a very large set of labelled images (> 1 million) INPUT Adapted from Yamins & DiCarlo et al (2016)
Deep neural nets for image classification Adapted from LeCun et al (2015) Testing images OUTPUT INPUT Adapted from Yamins & DiCarlo et al (2016)
Deep neural nets for image classification Adapted from LeCun et al (2015) Testing images
Machine vs. human vision Artificial vision Human vision What do these processes have in common?
The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al(2016) Deep neural nets for image classification
The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al (2016) retina thalamus primaryand secondary visual cortex (V1& V2) area V4 inferotemporal cortex (IT)
The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al (2016) Retinal space Sam manifold Joe manifold Adapted from DiCarlo & Cox (2007)
The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al (2016) V1 space Retinal space Sam manifold Sam manifold Joe manifold Joe manifold Adapted from DiCarlo & Cox (2007)
The ventral stream: an object-processing path Adapted from Yamins & DiCarlo et al (2016) IT space V1 space Retinal space Sam manifold Sam manifold Sam manifold Joe manifold Joe manifold Joe manifold Adapted from DiCarlo & Cox (2007)
The untangling hypothesis untangling of object manifolds Adapted from Yamins & DiCarlo et al (2016) IT space Retinal space Sam manifold Sam manifold Joe manifold Joe manifold Adapted from DiCarlo & Cox (2007)
The untangling hypothesis untangling of object manifolds IT space Retinal space Sam manifold Sam manifold Joe manifold Joe manifold Adapted from DiCarlo & Cox (2007)
The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ Joe manifold Adapted from DiCarlo & Cox (2007)
The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? Joe manifold Adapted from DiCarlo & Cox (2007)
The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? Joe manifold ? Adapted from DiCarlo & Cox (2007)
The goal: to track the intrinsic dimension (ID) of image representations in deep nets • We tested various state-of-the-art deep nets • AlexNet, VGG, VGG-bn & ResNet • We estimated the ID in a subset of layers • Input & output layers, pooling layers & fully connected layers VGG-16 Simonyan & Zisserman (2015) Source: https://blog.heuritech.com
The goal: to track the intrinsic dimension (ID) of image representations in deep nets • We tested various state-of-the-art deep nets • AlexNet, VGG, VGG-bn & ResNet • We estimated the ID in a subset of layers • Input & output layers, • Empirical questions • How does the ID vary across the layers of a deep net? • How linear (i.e., flat) are the data manifolds? • Is there any relationship between ID in the last hidden layer and classification performance of the network? pooling layers & fully connected layers
Estimation of the intrinsic dimension Intrinsic dimension of a data representation: minimal number of coordinates that are necessary to described its point without significant information loss • The linear case: Principal Component Analysis (PCA) 2D embedding space P1 P2 Activation x2 Activation x1
Estimation of the intrinsic dimension Intrinsic dimension of a data representation: minimal number of coordinates that are necessary to described its point without significant information loss • The linear case: Principal Component Analysis (PCA) 2D embedding space P1 P2 Activation x2 1D linear subspace Activation x1
Estimation of the intrinsic dimension • The general (non-linear) case: TwoNN(Facco et al, 2017) 1) For each data point icompute the distance to its first and second neighbour (ri,1 and ri,2) 2D embedding space 2) For each i compute Activation x2 The probability distribution of m is where d is the ID, independently on the local density of points. 3) Infer d from the empirical probability distribution of all the mi. pointi ri,2 ri,1 1D manifold Activation x1
Results Evolution of ID across the layers of a deep net 1 pre-trained on ImageNet VGG-16 Simonyan & Zisserman (2015) Source: https://blog.heuritech.com
Results Evolution of ID across the layers of a deep net 1 • ID evolution across layers has a hunchback shape • In each layer: ID << ED (embedding dimension) • In the last hidden layers: • the ID decreases monotonically • the ID reaches very small values (~10 or lower)
Results Evolution of ID in state-of-the-art deep nets 2 • Four families of state-of-the art deep net architectures: • AlexNet • VGG • VGG-bn • ResNet • Pre-trained on ImageNet • ID computed for the 7 most populated categories (500 images each)
Results Evolution of ID in state-of-the-art deep nets 2 • ID evolution across layers has a hunchback shape • In each layer: ID << ED (embedding dimension) • Considerable overlap of the ID profile as a function of relative layer depth: • After an initial growth of the ID, deep nets perform a progressive dimensionality reduction of the object manifolds • Any relationship with classification accuracy?
Results Performance vs. ID in last hidden layer 3 • Four families of state-of-the art deep net architectures: • AlexNet • VGG • VGG-bn • ResNet • Pre-trained on ImageNet • ID computed over a a random mixture of 2000 training images • Classification accuracy = top 5-score error
Results Performance vs. ID in last hidden layer 3 r = 0.94 training set r = 0.99 test set • ID on the training set is a strong predictor of performance on the test set • ID = proxy for generalization ability of a deep network
The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? Joe manifold ? Adapted from DiCarlo & Cox (2007)
The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? ✓ Joe manifold Adapted from DiCarlo & Cox (2007)
The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ? ✓ Joe manifold Adapted from DiCarlo & Cox (2007)
Results Linear vs. non-linear ID estimates 4 ID-PCA ≈ ID-TwoNN ID-PCA >> ID-TwoNN 2D embedding space 2D embedding space Activation x2 Activation x2 1D manifold 1D manifold = 1D linear subspace Activation x1 Activation x1
Results Linear vs. non-linear ID estimates 4 • No gap in the eigenvalue spectrum yielded by PCA data manifolds are not linear • Yet we define ID-PCA = # of PCs that account for 90% of variance in the data
Results Linear vs. non-linear ID estimates 4 • No gap in the eigenvalue spectrum yielded by PCA • ID-PCA >> ID-TwoNN
Results Linear vs. non-linear ID estimates 4 • No gap in the eigenvalue spectrum yielded by PCA • ID-PCA >> ID-TwoNN • ID-PCA not much different for trained and randomly initialized networks
Results Linear vs. non-linear ID estimates 4 • No gap in the eigenvalue spectrum yielded by PCA • ID-PCA >> ID-TwoNN • ID-PCA not much different for trained and randomly initialized networks • ID-TwoNN is flat for randomly initialized networks (orthogonal transformations) data manifolds are not linear
The untangling hypothesis untangling of object manifolds Last hidden layer space Pixel space Sam manifold • Object manifolds: • Less tangled • Flattened • Lower dimensionality Sam manifold Joe manifold ✓ ✗ ✓ Joe manifold Adapted from DiCarlo & Cox (2007)
Results Looking into the origin of ID initial expansion 5 VGG-16 Source: https://blog.heuritech.com
Results Looking into the origin of ID initial expansion 5 MNIST
Results Looking into the origin of ID initial expansion 5 MNIST MNIST
Results Looking into the origin of ID initial expansion 5
Results Looking into the origin of ID initial expansion 5
Results Looking into the origin of ID initial expansion 5
Results Looking into the origin of ID initial expansion 5 • In a trained network, the initial ID expansion reflects the pruning of low-level, highly correlated visual features that carry no information about the correct labeling
Summary Hunchback shape of ID vs. layer depth Correlation of ID with performance Low ID but nonlinear manifolds ID expansion = pruning low-level info
Summary FIRST LAYERS: remove irrelevant features. The ID grows LAST LAYERS: Dimensional reduction. The ID shrinks along the layers. The more gradual the better (deep networks win). …. ….
Acknowledgments Jakob Macke (TUM) 2019 Alessio Ansuini (SISSA) Vancouver DavideZoccolan(SISSA)