(Deep) Learning from Programs

(Deep) Learning from Programs Marc Brockschmidt - MSR Cambridge @mmjb86

Personal History: Termination Proving System.out.println(“Hello World!”)

Heuristics in Program Analysis

MSR Team Overview Deep Learning • Understands images/language/speech • Finds patterns in noisy data • Requires many samples • Handling structured data is hard Procedural Artificial Intelligence • Interpretable • Generalisation verifiable • Manual effort • Limited to specialists Program Structure

MSR Team Overview Understanding Programs Deep Learning • Program-structured ML models Procedural Artificial Intelligence Program Structure • Generating Programs

Overview • Step 2: • Obtain (latent) program representation • Often re-uses natural language processing infrastructure • Usually produces a (set of) vectors 1 Intermediate Representation Learned Representation Program Result • Step 1: • Transform to ML-compatible representation • Interface between PL and ML world • Often re-uses compiler infrastructure • Step 3: • Produce output useful to programmer/PL tool • Interface between ML and PL world 3 2

ML Models – Part 1 • Things not covered • (Linear) Regression • Multi-Layer Perceptrons • How to split data

Not Covered Today • Details of training ML systems See generic tutorials for internals (e.g., backprop) • Concrete implementations But: Many relevant papers come with F/OSS artefacts • Non-neural approaches But: Ask me for pointers, and I’ll allude to these from time to time

(Linear) Regression Given: , Aim: Find such that Fudge 1: Allow mistakes: Find such that (“loss”) minimal Fudge 2: Restrict shape of Find such that (“loss”) minimal > 0 < 0

Gradient Descent Aim: Find such that minimal Idea: 1. Choose random 2. Given fixed data, is function of 3. Compute partial derivates of with respect to elements of 4. Modify according to derivate to make smaller 5. Go to 2

Stochastic Gradient Descent Aim: Find such that minimal Problem: Too much data to handle at once Idea: 1. Choose random 2. Using sampled subset of data, is function of 3. Compute partial derivates of with respect to elements of 4. Modify according to derivate to make smaller 5. Go to 2

Multi-Layer Perceptrons Problem: Linear functions very restricted Idea 1: Use richer class of functions But: linear is nice – easy to differentiate, fast, etc. Idea 2: Alternate linear and parameter-free non-linear functions: sigmoid tanh ReLU … http://playground.tensorflow.org

Data splitting Usual procedure: Train on ~2/3 of data Validate performance on ~1/6 of data (during development) Test & report performance on rest (very, very rarely) Avoids specialising model to concrete data in test set

Programs are no Snowflakes • Program data often sourced from crawling github/bitbucket/… • GitHub full of duplicates: • Impacts ML models: DéjàVu: A Map of Code Duplicates on GitHub Lopes et al, OOPLA 2017 The Adverse Effects of Code Duplication in Machine Learning Models of Code Allamanis, 2018

ML Models – Part 2: Sequences • Vocabularies & Embeddings • Bag of Words • Recurrent NNs • 1D Convolutional NNs • Attentional Models

Learning from !Numbers Observation: Data usually not from Standard Solutions: • Feature Extraction Design numerical representation Examples: Presence of keywords, # of selected phrases, etc. • Deep Learning (Mostly) lossless transformation to very large numerical representation Learn how to reduce to useful size

Vocabularies Input: Token sequences Aim: Represent as vectors with large (>>) Idea: 1. Assign ids to the most common tokens 2. Define , where Consequence: fixed at train time, all new test tokens handled as unknown -th dimension

Embeddings Input: Token sequences Aim: Represent as vectors with reasonably small (<) Idea: 1. Assign “embeddings” to the most common tokens 2. Define 3. Learn embeddings together with rest of NN:

Neural Bag of Words Aim: Represent token sequences Idea 1: Order does not matter Represent as a bag of embeddings: x > 0 && x < 42 Idea 2: “Pool” element-wise representations for sequence representation:

Recurrent Neural Networks – Overview Aim: Represent token sequences Idea: Order is crucial Compute on sequence of embeddings: x > 0 && x < 42 let = … in foldl

Recurrent Neural Networks – Cell Types Aim: Represent token sequences Idea: Order is crucial Compute on sequence of embeddings: Common choices for : LSTM GRU Basic http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Recurrent Neural Networks – Details Aim: Represent token sequences Idea: Order is crucial Compute on sequence of embeddings: • Bidirectional RNNs: • Deep RNNs: Use outputs as inputs of another RNN (& repeat …)

Recurrent Neural Networks – Key Points Aim: Represent token sequences Idea: Order is crucial Compute on sequence of embeddings: Key points: • RNNs process sequences taking preceding inputs into account • Sequential computation Slow on long sequences, high mem usage • Even complex LSTM cells “forget” state after ~10 tokens • Construction makes it hard to recognise repeated occurrence of tokens/entities

1D Convolutional Neural Networks – Overview Aim: Represent token sequences Idea: Context is crucial Compute on windows of tokens: x > 0 && x < 42

1D Convolutional Neural Networks – Details Aim: Represent token sequences Idea: Context is crucial Compute on windows of tokens: • Kernels usually of the form , where are learnable weights/bias and a non-linearity (such as ) • Usually stack 1DCNNs (3 – 7 layers) • Often use windowed pooling (mean, max, …) layers between convolutional layers • Variations such as dilations () common

1D Convolutional Neural Networks – Key Points Aim: Represent token sequences Idea: Context is crucial Compute on windows of tokens: Key points: • CNNs process sequences taking the context of tokens into account • Parallel computation, can leverage optimisations from computer vision Very fast • By construction, cannot recognise long-distance relationships • Construction makes it hard to recognise repeated occurrence of tokens/entities

Attentional Models – Overview Aim: Represent token sequences Idea: Context is crucial Compute relationship between tokens: x > 42 Compute a “query” Compute “keys” Compute matches Compute relative match via Compute “values” Compute weighted sum

Attentional Models – Details Aim: Represent token sequences Idea: Context is crucial Compute relationship between tokens: • learnable linear layers (matrix multiplications) • Usually uses several “attention heads” (independent copies of model); their results are concat’ed • Often “positional encoding” is added, to help model distinguish near/far pairs of inputs • Usually several layers are stacked http://nlp.seas.harvard.edu/2018/04/03/attention.html

Attentional Models – Key Points Aim: Represent token sequences Idea: Context is crucial Compute relationship between tokens: Key points: • Faster than RNNs, slower than CNNs • Can handle long-distance relationships • Many choices (attention mechanism, number of heads, number of layers)

ML Models – Part 3: Structures • TreeNNs • Graph NNs

N-Ary Tree Neural Networks Aim: Represent n-ary trees Idea: Structure is crucial Compute along tree edges: && > < 0 42 x x

General Tree Neural Networks Aim: Represent trees Idea: Structure is crucial Compute along tree edges: && > < 0 42 x x

Tree Neural Networks – Key Points Aim: Represent trees Idea: Context is crucial Compute along tree edges: Key points: • Generalisation of RNNs to trees • Inherit problems of RNNs: Long-distance relationships remain hard • No exchange of information between neighbouring tokens if in different subtrees • Implementations often inefficient (hard to parallelise computation)

Graph Neural Networks – Overview Aim: Represent graph , edge types, labels, Idea: Structure is crucial Compute along graph edges: 1. Embed labels: Edge Type 1 Edge Type 2

Graph Neural Networks – Overview Aim: Represent graph , edge types, labels, Idea: Structure is crucial Compute along graph edges: 1. Embed labels: 2. Compute messages 3. Aggregate messages: 4. Compute new state: Edge Type 1 Edge Type 2 Recurrent unit

Graph Neural Networks – Details Aim: Represent graph , edge types, labels, Idea: Structure is crucial Compute along graph edges: • Shown: Generalisation of RNNs to graphs • Variants: Generalisation of CNNs to graphs, of Self-Attention to graphs • Usually add implicit reverse edges for all input edges • Labels on edges supported as well

Graph Neural Networks – Key Points Aim: Represent graph , edge types, labels, Idea: Structure is crucial Compute along graph edges: Key points: • Generalisation of seq. models to graphs • Can be implemented efficiently (linear in number of edges) • Can model complex relationships

Model Summary • Learning from structure widely studied; many variations: • Sets (bag of words) • Sequences (RNNs, 1D-CNNs, Self-Attention) • Trees (TreeNNs variants) • Graphs (GNNs variants) • Tension between computational effort and precision • Key insight: Domain knowledge needed to select right model

Overview • Step 2: • Obtain (latent) program representation • Often re-uses natural language processing infrastructure • Usually produces a (set of) vectors 1 Intermediate Representation Learned Representation Program Result • Step 1: • Transform to ML-compatible representation • Interface between PL and ML world • Often re-uses compiler infrastructure • Step 3: • Produce output useful to programmer/PL tool • Interface between ML and PL world 3 2

“Programs” Liberal definition: Element of language with semantics Examples: • Files in Java, C, Haskell, … • Single Expressions • Compiler IR / Assembly • (SMT) Formulas • Diffs of Programs

Programs as Sequences – Tokenization Easy first steps: public class FooBar { int BAZ_CONST = 42; } “public”, “class”, “FooBar”, “{”, “int”, “BAZ_CONST”, “=”, “42”, “;”, “}” Lexer Not so easy first steps: class FooBar(): pass BAZ_CONST = 42 “class”, “FooBar”, “(”, “)”, “:”, “pass”, “BAZ_CONST”, “=”, “42” Lexer

Programs as Sequences – Subtokens • Programs differ from natural language • Set of distinct tokens (e.g. “CatRecognitionModelClass”) very large! • Problematic for two reasons: • Long tail of tokens appearing only rarely, hard to learn • Size of vocabulary bounded by memory Solution: Split compound names into subtokens Sometimes: Introduce special tokens to mark word boundaries public class FooBar { int BAZ_CONST = 42; } Lexer++ “public”, “class”, “Foo”, “Bar”, “{”, “int”, “BAZ”, “CONST”, “=”, “42”, “;”, “}” “int”, “WordStart”, “BAZ”, “CONST”, “WordEnd”, “=”, “42”, “;” Lexer++ int BAZ_CONST = 42;

Programs as Sequences – Variables • Programs differ from natural language • Set of distinct tokens (e.g. “myMostSpecialVar”) very large! • Problematic for two reasons: • Long tail of tokens appearing only rarely, hard to learn • Size of vocabulary bounded by memory Solution: conversion to standardised set of names Useful when names contain little information (e.g., on obfuscated code) public class FooBar { int BAZ_CONST = 42; } Lexer++ “public”, “class”, “CLASS0”, “{”, “int”, “VAR0”, “=”, “42”, “;”, “}”

Programs as Sequences – Slicing Problem: Full programs usually too long Solution: Use point of interest to “slice” to manageable size. Examples: • Filter out keywords / brackets / etc • Given a location, use tokens before/after • Given a variable var, use windows of tokens around usages of var • Filter to only keep some kinds of statements (e.g., calls of known APIs)

Types I • Variables, calls and operators can be typed • Used by producing two aligned sequences: Token sequence: Type sequence: (with special NOTYPE for untyped toks) Then, have two embedding functions and and define

Types II • Types can implement superclasses / be member of typeclass • Concrete subtype (e.g. “CatRecognitionResultsList”) often very rare • Common supertypes (e.g. “List”) very common • Solution: Encode set of implemented types for each token: Type sequence: (with special NOTYPE for untyped toks) Then define

Program Representations Approach 1: Sequence of Words or Trees (re-using NLP ideas) Programs are different from natural language: • Semantics for keywords already known • Many words (APIs, local methods) only used seldomly • Long-distance dependencies common Approach 2: Graphs • Nodes labelled by semantic information • Edges for semantic relationships

Programs as Trees – Syntax Tree Assert.NotNull(clazz); ExpressionStatement InvocationExpression ArgumentList MemberAccessExpression ( … . NotNull Assert

Programs as Graphs – Version I • Use token sequence • Add edges for syntax tree • Add dataflow: (x, y) = Foo(); while (x > 0) x = x + y; Last Write Last Use Computed From

(Deep) Learning from Programs

(Deep) Learning from Programs

Presentation Transcript

Deep Thought

Deep Thought

Deep Ecology

Deep Thought

Deep Deep Deep Blue Sea

“Deep Convictions”

Deep Linking

Deep Blue

Deep Thought

Deep Impact

Deep

Deep, Deep Love

THE DEEP

Digging Deep: Making “Deep” Connections

O, the Deep, Deep Love

DEEP-3

DEEP FREEZER

Deep Networks

Deep Segmentation

Deep