1 / 112

(Deep) Learning from Programs

(Deep) Learning from Programs. Marc Brockschmidt - MSR Cambridge. @mmjb86. Personal History: Termination Proving. System.out.println (“Hello World!”). Heuristics in Program Analysis. MSR Team Overview. Deep Learning. Understands images/language/speech Finds patterns in noisy data

bratten
Download Presentation

(Deep) Learning from Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. (Deep) Learning from Programs Marc Brockschmidt - MSR Cambridge @mmjb86

  2. Personal History: Termination Proving System.out.println(“Hello World!”)

  3. Heuristics in Program Analysis

  4. MSR Team Overview Deep Learning • Understands images/language/speech • Finds patterns in noisy data • Requires many samples • Handling structured data is hard Procedural Artificial Intelligence • Interpretable • Generalisation verifiable • Manual effort • Limited to specialists Program Structure

  5. MSR Team Overview Understanding Programs Deep Learning • Program-structured ML models Procedural Artificial Intelligence Program Structure • Generating Programs

  6. Overview • Step 2: • Obtain (latent) program representation • Often re-uses natural language processing infrastructure • Usually produces a (set of) vectors 1 Intermediate Representation Learned Representation Program Result • Step 1: • Transform to ML-compatible representation • Interface between PL and ML world • Often re-uses compiler infrastructure • Step 3: • Produce output useful to programmer/PL tool • Interface between ML and PL world 3 2

  7. Overview • Step 2: • Obtain (latent) program representation • Often re-uses natural language processing infrastructure • Usually produces a (set of) vectors 1 Intermediate Representation Learned Representation Program Result • Step 1: • Transform to ML-compatible representation • Interface between PL and ML world • Often re-uses compiler infrastructure • Step 3: • Produce output useful to programmer/PL tool • Interface between ML and PL world 3 2

  8. ML Models – Part 1 • Things not covered • (Linear) Regression • Multi-Layer Perceptrons • How to split data

  9. Not Covered Today • Details of training ML systems See generic tutorials for internals (e.g., backprop) • Concrete implementations But: Many relevant papers come with F/OSS artefacts • Non-neural approaches But: Ask me for pointers, and I’ll allude to these from time to time

  10. (Linear) Regression Given: , Aim: Find such that Fudge 1: Allow mistakes: Find such that (“loss”) minimal Fudge 2: Restrict shape of Find such that (“loss”) minimal > 0 < 0

  11. Gradient Descent Aim: Find such that minimal Idea: 1. Choose random 2. Given fixed data, is function of 3. Compute partial derivates of with respect to elements of 4. Modify according to derivate to make smaller 5. Go to 2

  12. Stochastic Gradient Descent Aim: Find such that minimal Problem: Too much data to handle at once Idea: 1. Choose random 2. Using sampled subset of data, is function of 3. Compute partial derivates of with respect to elements of 4. Modify according to derivate to make smaller 5. Go to 2

  13. Multi-Layer Perceptrons Problem: Linear functions very restricted Idea 1: Use richer class of functions But: linear is nice – easy to differentiate, fast, etc. Idea 2: Alternate linear and parameter-free non-linear functions: sigmoid tanh ReLU … http://playground.tensorflow.org

  14. Data splitting Usual procedure: Train on ~2/3 of data Validate performance on ~1/6 of data (during development) Test & report performance on rest (very, very rarely) Avoids specialising model to concrete data in test set

  15. Programs are no Snowflakes • Program data often sourced from crawling github/bitbucket/… • GitHub full of duplicates: • Impacts ML models: DéjàVu: A Map of Code Duplicates on GitHub Lopes et al, OOPLA 2017 The Adverse Effects of Code Duplication in Machine Learning Models of Code Allamanis, 2018

  16. ML Models – Part 2: Sequences • Vocabularies & Embeddings • Bag of Words • Recurrent NNs • 1D Convolutional NNs • Attentional Models

  17. Learning from !Numbers Observation: Data usually not from Standard Solutions: • Feature Extraction Design numerical representation Examples: Presence of keywords, # of selected phrases, etc. • Deep Learning (Mostly) lossless transformation to very large numerical representation Learn how to reduce to useful size

  18. Vocabularies Input: Token sequences Aim: Represent as vectors with large (>>) Idea: 1. Assign ids to the most common tokens 2. Define , where Consequence: fixed at train time, all new test tokens handled as unknown -th dimension

  19. Embeddings Input: Token sequences Aim: Represent as vectors with reasonably small (<) Idea: 1. Assign “embeddings” to the most common tokens 2. Define 3. Learn embeddings together with rest of NN:

  20. Neural Bag of Words Aim: Represent token sequences Idea 1: Order does not matter Represent as a bag of embeddings: x > 0 && x < 42 Idea 2: “Pool” element-wise representations for sequence representation:

  21. Recurrent Neural Networks – Overview Aim: Represent token sequences Idea: Order is crucial Compute on sequence of embeddings: x > 0 && x < 42 let  = …  in  foldl

  22. Recurrent Neural Networks – Cell Types Aim: Represent token sequences Idea: Order is crucial Compute on sequence of embeddings: Common choices for : LSTM GRU Basic http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  23. Recurrent Neural Networks – Details Aim: Represent token sequences Idea: Order is crucial Compute on sequence of embeddings: • Bidirectional RNNs: • Deep RNNs: Use outputs as inputs of another RNN (& repeat …)

  24. Recurrent Neural Networks – Key Points Aim: Represent token sequences Idea: Order is crucial Compute on sequence of embeddings: Key points: • RNNs process sequences taking preceding inputs into account • Sequential computation Slow on long sequences, high mem usage • Even complex LSTM cells “forget” state after ~10 tokens • Construction makes it hard to recognise repeated occurrence of tokens/entities

  25. 1D Convolutional Neural Networks – Overview Aim: Represent token sequences Idea: Context is crucial Compute on windows of tokens: x > 0 && x < 42

  26. 1D Convolutional Neural Networks – Details Aim: Represent token sequences Idea: Context is crucial Compute on windows of tokens: • Kernels usually of the form , where are learnable weights/bias and a non-linearity (such as ) • Usually stack 1DCNNs (3 – 7 layers) • Often use windowed pooling (mean, max, …) layers between convolutional layers • Variations such as dilations () common

  27. 1D Convolutional Neural Networks – Key Points Aim: Represent token sequences Idea: Context is crucial Compute on windows of tokens: Key points: • CNNs process sequences taking the context of tokens into account • Parallel computation, can leverage optimisations from computer vision Very fast • By construction, cannot recognise long-distance relationships • Construction makes it hard to recognise repeated occurrence of tokens/entities

  28. Attentional Models – Overview Aim: Represent token sequences Idea: Context is crucial Compute relationship between tokens: x > 42 Compute a “query” Compute “keys” Compute matches Compute relative match via Compute “values” Compute weighted sum

  29. Attentional Models – Details Aim: Represent token sequences Idea: Context is crucial Compute relationship between tokens: • learnable linear layers (matrix multiplications) • Usually uses several “attention heads” (independent copies of model); their results are concat’ed • Often “positional encoding” is added, to help model distinguish near/far pairs of inputs • Usually several layers are stacked http://nlp.seas.harvard.edu/2018/04/03/attention.html

  30. Attentional Models – Key Points Aim: Represent token sequences Idea: Context is crucial Compute relationship between tokens: Key points: • Faster than RNNs, slower than CNNs • Can handle long-distance relationships • Many choices (attention mechanism, number of heads, number of layers)

  31. ML Models – Part 3: Structures • TreeNNs • Graph NNs

  32. N-Ary Tree Neural Networks Aim: Represent n-ary trees Idea: Structure is crucial Compute along tree edges: && > < 0 42 x x

  33. General Tree Neural Networks Aim: Represent trees Idea: Structure is crucial Compute along tree edges: && > < 0 42 x x

  34. Tree Neural Networks – Key Points Aim: Represent trees Idea: Context is crucial Compute along tree edges: Key points: • Generalisation of RNNs to trees • Inherit problems of RNNs: Long-distance relationships remain hard • No exchange of information between neighbouring tokens if in different subtrees • Implementations often inefficient (hard to parallelise computation)

  35. Graph Neural Networks – Overview Aim: Represent graph , edge types, labels, Idea: Structure is crucial Compute along graph edges: 1. Embed labels: Edge Type 1 Edge Type 2

  36. Graph Neural Networks – Overview Aim: Represent graph , edge types, labels, Idea: Structure is crucial Compute along graph edges: 1. Embed labels: 2. Compute messages 3. Aggregate messages: 4. Compute new state: Edge Type 1 Edge Type 2 Recurrent unit

  37. Graph Neural Networks – Details Aim: Represent graph , edge types, labels, Idea: Structure is crucial Compute along graph edges: • Shown: Generalisation of RNNs to graphs • Variants: Generalisation of CNNs to graphs, of Self-Attention to graphs • Usually add implicit reverse edges for all input edges • Labels on edges supported as well

  38. Graph Neural Networks – Key Points Aim: Represent graph , edge types, labels, Idea: Structure is crucial Compute along graph edges: Key points: • Generalisation of seq. models to graphs • Can be implemented efficiently (linear in number of edges) • Can model complex relationships

  39. Model Summary • Learning from structure widely studied; many variations: • Sets (bag of words) • Sequences (RNNs, 1D-CNNs, Self-Attention) • Trees (TreeNNs variants) • Graphs (GNNs variants) • Tension between computational effort and precision • Key insight: Domain knowledge needed to select right model

  40. Overview • Step 2: • Obtain (latent) program representation • Often re-uses natural language processing infrastructure • Usually produces a (set of) vectors 1 Intermediate Representation Learned Representation Program Result • Step 1: • Transform to ML-compatible representation • Interface between PL and ML world • Often re-uses compiler infrastructure • Step 3: • Produce output useful to programmer/PL tool • Interface between ML and PL world 3 2

  41. “Programs” Liberal definition: Element of language with semantics Examples: • Files in Java, C, Haskell, … • Single Expressions • Compiler IR / Assembly • (SMT) Formulas • Diffs of Programs

  42. Programs as Sequences – Tokenization Easy first steps: public class FooBar { int BAZ_CONST = 42; } “public”, “class”, “FooBar”, “{”, “int”, “BAZ_CONST”, “=”, “42”, “;”, “}” Lexer Not so easy first steps: class FooBar(): pass BAZ_CONST = 42 “class”, “FooBar”, “(”, “)”, “:”, “pass”, “BAZ_CONST”, “=”, “42” Lexer

  43. Programs as Sequences – Subtokens • Programs differ from natural language • Set of distinct tokens (e.g. “CatRecognitionModelClass”) very large! • Problematic for two reasons: • Long tail of tokens appearing only rarely, hard to learn • Size of vocabulary bounded by memory Solution: Split compound names into subtokens Sometimes: Introduce special tokens to mark word boundaries public class FooBar { int BAZ_CONST = 42; } Lexer++ “public”, “class”, “Foo”, “Bar”, “{”, “int”, “BAZ”, “CONST”, “=”, “42”, “;”, “}” “int”, “WordStart”, “BAZ”, “CONST”, “WordEnd”, “=”, “42”, “;” Lexer++ int BAZ_CONST = 42;

  44. Programs as Sequences – Variables • Programs differ from natural language • Set of distinct tokens (e.g. “myMostSpecialVar”) very large! • Problematic for two reasons: • Long tail of tokens appearing only rarely, hard to learn • Size of vocabulary bounded by memory Solution: conversion to standardised set of names Useful when names contain little information (e.g., on obfuscated code) public class FooBar { int BAZ_CONST = 42; } Lexer++ “public”, “class”, “CLASS0”, “{”, “int”, “VAR0”, “=”, “42”, “;”, “}”

  45. Programs as Sequences – Slicing Problem: Full programs usually too long Solution: Use point of interest to “slice” to manageable size. Examples: • Filter out keywords / brackets / etc • Given a location, use tokens before/after • Given a variable var, use windows of tokens around usages of var • Filter to only keep some kinds of statements (e.g., calls of known APIs)

  46. Types I • Variables, calls and operators can be typed • Used by producing two aligned sequences: Token sequence: Type sequence: (with special NOTYPE for untyped toks) Then, have two embedding functions and and define

  47. Types II • Types can implement superclasses / be member of typeclass • Concrete subtype (e.g. “CatRecognitionResultsList”) often very rare • Common supertypes (e.g. “List”) very common • Solution: Encode set of implemented types for each token: Type sequence: (with special NOTYPE for untyped toks) Then define

  48. Program Representations Approach 1: Sequence of Words or Trees (re-using NLP ideas) Programs are different from natural language: • Semantics for keywords already known • Many words (APIs, local methods) only used seldomly • Long-distance dependencies common Approach 2: Graphs • Nodes labelled by semantic information • Edges for semantic relationships

  49. Programs as Trees – Syntax Tree Assert.NotNull(clazz); ExpressionStatement InvocationExpression ArgumentList MemberAccessExpression ( … . NotNull Assert

  50. Programs as Graphs – Version I • Use token sequence • Add edges for syntax tree • Add dataflow: (x, y) = Foo(); while (x > 0) x = x + y; Last Write Last Use Computed From

More Related