Intelligent Systems Laboratory of Moscow Research Center

Intelligent Systems Laboratoryof Moscow Research Center • Mazurenko Ivan, • 2019.07.03 • Mazurenko.Ivan1@Huawei.com Security Level:

Huawei Moscow Research Center Moscow Research Center Concept is algorithm design based on strong Mathematical knowledge supported by Leading Russian Universities and Institutes of Russian academy of Science Moscow RC200+ 40+ Mathematical Researchersfrom Lomonosov Moscow State University, MIPT, Skoltech, HSE, Bauman and other Universities, including 25 PhDs, 2 Russian Doctors of Science + 20 interns Mathematic Modeling & Optimization CC IT Algorithm Lab. Media Algorithm Lab. RTT Algorithm Lab. IRF Algorithm Competence Center Coding Algorithm Competence center Intelligent systems Lab.

Intelligent Systems Lab Directions • Intelligent systems lab • Video intelligence group Video surveillance • FaceID, Dataset augmentation, AI Chip,… • Camera AI group Smartphone camera • Segmentation, Denoising, Anti-distortion,… • Big data group • Graph mining, Content aware video compression, Video surveillance,… • Image search group • SSD group • Quantum ECC group • ADAS group – new

Video Intelligence Direction Easy to see (IPC imaging) Accurate capture (Target snapshot) Recognizable (Person/vehicle identification) Easy to understand (Content and behavior understanding) Fast search (Person/vehicle search) Well coding(Transcoding and enhancement) Easy to transfer (Conference and live broadcast) Cloud transcoding ValueableFeatures Abnormal/Normal video License plate Face Recognition Live video on demand (VoD) Multi-category Fight, gather... Car brand & color Facial attribute Perceptual coding (PVC) Feature quantization Scenario identification Facial snapshot Video network collaboration Who do What local feature Human body attribute Desktop Coding Data clustering Dark light imaging Pedestrian snapshot Device-cloud congestion control What do what at where Human body/vehicle image search Enhanced and quality detection KNN Multi-focalmulti-camera Imaging Vehicle snapshot FEC/ARQ Intra-class segmentation Video image stitching Pedestrian snapshot ROI transmission Key Technologies Market practice Calculation Algorithm Data System design • Detection, tracing, quality evaluation, segmentation, identification, retrieval, and data-distributed training \AutoML algorithms • Transcoding and packet loss concealment • RGB+IR fusion and AI enhancement • Take snapshots and identify market practices in Guangzhou, Hefei, and Xinjiang. • CloudBU Yingkou live broadcast practice • Video Conferencing Company Office Practice • ……. • Data design and data production system design • Semi-automatic and automatic marking of data • Data increase and data cleaning • Vehicle Association Identification System • Interactive recognition system • IPC imaging and snapshot system • Video cloud processing and device-cloud collaborative transmission system • Multi-machine and multi-GPU training cluster • D chip \Hi35XX joint development • GPU\DSP\IVE\ARM \FPGA\X86 optimization + + + +

On Fundamental Mathematical Problems in Deep Learning • Huawei Moscow Research Center, Intelligent Systems Lab • Mazurenko Ivan, Petyushko Alexander • 2019.05.31 Security Level:

Contents • Introduction • 1 Expressiveness • 2 Convergence • 3 Generalization • 4 Robustness • Related math problems • 5 Metric learning • 6 Feature mapping

1 Expressiveness Can Neural network (and Convolutional neural network in particular) represent any continuous function with any desired precision? • In 1957, Andrey Kolmogorov(USSR, Moscow State University) proved a Universal representation theorem[1] for continuous functions: Any multivariate continuous function can be expressed as a finite sum of continuous functions of a single variable • This problem closely relates to 13th Hilbert’s problem and may be considered as a fundamental theoretical basis for the whole Neural networks theory. • In 1989 George Cybenko (Dartmouth college, USA) proved a Universal approximation theorem[2] for multi-layer feedforward neural networks called “perceptrons”: Any continuous function on a compact domain can be approximated with any given precision by a 2-layer perceptron with monotonically increasing nonconstant bounded activation function. • The theorem was independently proven by the authors of [3]. • Some recent works, like a paper by Dmitry Yarotsky(Skoltech, Russia) [4] and a work of Ding-Xuan Zhou(University of Hong Kong) [5] provide universal approximation theorems for special classes of convolutional neural networks • But the universal approximation problem for Convolutional neural networks is not solved yet. Andrey Kolmogorov George Cybenko Dmitry Yarotsky ? [1] Kolmogorov, A. N. (1957). “On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition”. DokladyAkademiiNauk SSR, 114, 953-956. [2] Cybenko, G. (1988) Approxtmation by superpositions of a sigmoidal function (Tech. Rep. No.856). Urbana, IL: University of Illinois Urbana-Champaign Department of Electrical and Computer Engineering. [3] Hornik, K., Stinchcombe, M., & White, H. (1988) “Multilayer feedforward networks are universal approximators” (Discussion Paper 88-45). San Diego, CA: Department of Economics, university of California, San Diego. [4] Yarotsky, Dmitry. "Universal approximations of invariant maps by neural networks." arXiv preprint arXiv:1804.10306 (2018). [5] Zhou, Ding-Xuan. "Universality of Deep Convolutional Neural Networks." arXiv preprint arXiv:1805.10769 (2018).

Expressiveness of different DNNs • Multi-Layer Perceptron • Convolutional network • Graph neural network • CNN with discrete weights • Binary Neural Network • Capsule Neural Network • Impulse Neural Network • Memristor Neural Network • Open questions:What is expressiveness of each of these neural network classes? How to train discrete (binary) NN? How to automatically convert DNN from one class to another? What is the optimal bit width for discrete DNNs?

2 Convergence Do Gradient Descent (GD) and Stochastic Gradient Descent (SGD)converge to the global optimum? • In 1957 Frank Rosenblatt (USA) [1] proposed a concept of Perceptronand a heuristicperceptron training algorithm. • In 1962 Albert Novikoff (Stanford, USA) proved a Convergence theorem [2] for it: perceptron training converges after making not more than R2/G2 updates (where R is the norm of an input and G is the margin between linearly separable classes). • In recent works (2017-2019) multiple researchers provide theorems that prove approximate convergence of Gradient Descent [3] and Stochastic Gradient Descent [4,5,6] with polynomial number of iterations for Convolutional neural networks with constraints and special classes of CNNs (e.g. ResNet based). • Nevertheless, convergence theorem for general case of Convolutional neural networks is not proven yet. Frank Rosenblatt Albert Novikoff ? [1] Rosenblatt, Frank (1957), The Perceptron--a perceiving and recognizing automaton. Report 85-460-1, Cornell Aeronautical Laboratory [2] Novikoff, A.B.J. (1962). On convergence proofs on perceptrons. Symposium on the Mathematical Theory of Automata, 12, 615-622. Polytechnic Institute of Brooklyn. [3] Du, Simon S., et al. "Gradient descent finds global minima of deep neural networks." arXiv preprint arXiv:1811.03804 (2018). [4] Allen-Zhu, Zeyuan, Yuanzhi Li, and Zhao Song. "A convergence theory for deep learning via over-parameterization." arXiv preprint arXiv:1811.03962 (2018). [5] DifanZou, Yuan Cao, Dongruo Zhou, QuanquanGu ``Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks’’ arXiv preprint arXiv:1811.08888 (2018) [6] Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, VahidTarokh. “SGD Converges to Global Minimum in Deep Learning via Star-convex Path”. arXivpreprint arXiv:1901.00451 (2019)

DNN Convergence – Open Problems • How to compare two neural networkstaking into account instability of training procedure (dependency of random weights initialization, training rate change policy etc.)? • When to stop the training procedure to prevent training set overfitting? • How to accelerate the training procedure? (tensor train, proper weights initialization, multiple GPU for data and model parallelization, tensor neuro chips, analog memristor-based networks,...)

3 Generalization How to select the optimal complexity of the model to reach its maximal generalization ability while preventing overfitting on the training data? Test data Train data Vladimir Vapnik Alexey Chervonenkis Generalization error Sanjeev Arora Dmitry Yarotsky ? [1] Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2): 264. [2] Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer [3] Arora, Sanjeev, et al. "Stronger generalization bounds for deep nets via a compression approach." arXiv preprint arXiv:1802.05296(2018). [4] BehnamNeyshabur, RyotaTomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Proceeding of the 28th Conference on Learning Theory (COLT), 2015b. [5] BehnamNeyshabur, SrinadhBhojanapalli, David McAllester, and Nathan Srebro. A pacbayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017a.

DNN Generalization – Open Problems • How to estimate the optimal number of DNN weights? (analogof VC-dimensionfor neural networks) • How to automatically find the optimal DNN architecture for the given task? • Examples: Genetic architecture search, Monte-Carlo architecture search, Gradient descend-based architecture search methods • How to automatically select the best hyper parameters(parameters of loss functions, learning rate ofSGD, weights initialization etc.)? • How to accelerate the inference of DNN? (quantization, pruning, tensor decomposition,...)

4 Robustness How can we guarantee that output distributions of CNN are close when distributions of input images are close? • Recently, Adversarial Examples [1] for Convolutional Neural Networks were discovered by Christian Szegedy (Google AI, USA). Two images are called Adversarial if they are very close (almost invisible to human eye) but provide significantly different output if sent to the input of some CNN. So called Adversarial Attacks (including Black Box Attacks and Real World Attacks) can fool the neural network by slightly changing the input images that may lead to incorrect classification of NN inputs. • The problem of Adversarial Defense is closely connected to the problem of calculating distance between two probabilistic distributions: we may guarantee robustness of the Neural Network if training it with special types of losses, based on calculating distance between two distributions. Such methods include KL-divergence/ Relative entropy [2], Inception score [3], Frechet distance [4] etc. • In 2017 Martin Arjovsky (New York University, USA) et al suggested a to use Wasserstein metric named after Russian mathematician Leonid Vaseršteĭnwho introduced this concept in 1969. This novel idea was used in a new Wasserstein Generative Adversarial Networks [5] which guarantee their training convergence. • New mathematical theory is needed to be developed for providing a guaranteed defense of any type of adversarial attacks. Christian Szegedy Leonid Vaserstein Martin Arjovsky ? [1] Szegedy, Christian, et al. "Intriguing properties of neural networks." arXiv preprint arXiv:1312.6199(2013). [2] Kullback, Solomon, and Richard A. Leibler. "On information and sufficiency." The annals of mathematical statistics 22.1 (1951): 79-86. [3] Salimans, Tim, et al. "Improved techniques for training gans." Advances in Neural Information Processing Systems. 2016. [4] Heusel, Martin, et al. "Gans trained by a two time-scale update rule converge to a local nash equilibrium." Advances in Neural Information Processing Systems. 2017. [5] Arjovsky, Martin, SoumithChintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017).

DNN Robustness – Open Problems • How to train DNN in a robust way to defend from all known adversarial attacks? • How to compare similarity of distributions on the output of DNN and on its input? • How to evaluate the quality of Generative adversarial networks (GANs)? • How to learn the topological properties of manifolds for input images / DNN outputs?

FaceID Attacks: Geometrical & Real-world Ivan Impersonation Dodging (no recognition) Alex

Generative Adversarial Networks (GANs) A generative adversarial network (GAN) is a class of machine learning systems. Two neural networks contest with each other in a zero-sum game framework. This technique can generate photographs that look at least superficially authentic to human observers, having many realistic characteristics. It is a form of unsupervised learning. The generative network G generates candidates while the discriminative network D evaluates them. The contest operates in terms of data distributions. Typically, the generative network learns to map from a latent space to a data distribution of interest, while the discriminative network distinguishes candidates produced by the generator from the true data distribution. The name "GAN" was introduced by Ian Goodfellow(Google AI) et al. in 2014. Their paper popularized the concept and influenced subsequent work. Ian Goodfellow

Which of these two face images is real?

Which of these two face images is real? Answer: both are fake and generated by GANs!

Problem: How to Evaluate theGAN? • Which of these twoGANs is better?

Manifold hypothesis for AI • Manifold hypothesis: reason why NN work as classifiers • Almost all RNconsists of invalid (from the point of view of real distribution) samples • Random image is almost always a noise • All interesting cases are concentrated along a small number of low dimensional manifolds • Example: • Initial image samples in multidimensional space of Height * Width (where h,w could be thousands) are hard to (linearly) separate: they are entangled • NN aim: to disentangle classes using inner representation of a smaller dimension (on a manifold) • If we are working on a manifold of a much smaller dimension this task could be solved much easier

5 Related problem - Metric Learning • We need to minimize average difference between maximal similarity of images from different classes and minimal similarity of images from the same class

6 Related problem - Feature Mapping • We need to find a mapping transform M which minimizes the mean square distance between similarities of the target network F2 and similarities transformed from the source network F1 using this mapping M

7 Other - Disentangled Representations • Task is related to all embedding search / classification problems (such as FaceID) • The idea is to propose the training method and / or NN architecture so that: • There exists a robust definite mapping between some elements of NN embedding vector and semantic features of the objects we are classifying • e.g. for FaceID: eyes color, skin color, hair length etc • Reference paper: “Towards a Definition of Disentangled Representations” We need to find a mapping between embeddings and semantic features of objects. The task is tightly connected to feature mapping and manifold learning

8 Other - Video face recognition • Background • Single image is widely adopted for face recognition in both academia and industry. However, much information is lost after selecting a key frame from the original sequence. Will it be helpful by recognizing the video sequence as a whole? If so, how? • Problem description • Input • Cropped face image sequence of the same person • Task • Find a single feature to represent the video sequence, and use it for 1:N face identification • Goal • Improve performance for scenarios where single image is inadequate, such as low-resolution, blur and large pose Key frame selection Feature extraction [0.01, 0.03, …, 0.05] Feature extraction [0.02, 0.04, …, 0.06]

Problems in Deep Learning • Automatic optimal architecture search • Automatic hyper parametersearch • Model compression • Fast convergence algorithms • Training dataset cleanup • Single/few shot learning • Deep online learning • Loss functions research • NN fine tuning Expressibility Convergence • Optimal selection of network size • Optimal initialization • Overfitting prevention • Validation and testing methodology Robustness • Adversarial attacks • Black box attacks • Real world attacks • Adversarial defense • Identity preservingdataset augmentation Generalization

Intelligent Systems Laboratory of Moscow Research Center

Intelligent Systems Laboratory of Moscow Research Center

Presentation Transcript

Intelligent Systems Laboratory: Research Goal

Center for Intelligent Systems Research GW Transportation Research Institute

The OU Intelligent Transportation Systems Laboratory

Intelligent Systems

Research Projects at Intelligent Systems Laboratory

1 Intelligent E-Commerce Systems Laboratory,

PORTAL: Transportation Data Archive Intelligent Transportation Systems Laboratory

Intelligent Systems

Intelligent Systems

V ladimir Gorodetsky Head of Laboratory of Intelligent Systems space.iias.spb.su/ai/

Intelligent Systems

Laboratory for Thrombosis Research Interdisciplinary Research Center

Center for Intelligent Systems Research OzBot, OzTug, PoleCam

Intelligent Systems

INTELLIGENT Systems

Intelligent Systems

Intelligent Systems

Intelligent systems, intelligent agents

Research on Intelligent Information Systems

Robert W. Mah, Ph.D. Smart Systems Research Laboratory NASA Ames Research Center

1 Intelligent E-Commerce Systems Laboratory,

Laboratory for Thrombosis Research Interdisciplinary Research Center