TrellisNet: Sequences Modelling with RNN and Convolutional Models

OTHER NETWORKS ARCHITECTURE Yuguang Lin, Aashi Jain , Siddarth Ravichandran Presenters CSE 291 G00 Deep Learning for Sequences

TRELLIS NETWORK FOR SEQUENCE MODELLING Yuguang Lin, presenter

Sequence Model

Background • Three popular ways for sequence modelling • Temporal Convolution Networks (TCN) • Recurrent Networks (LSTM, GRU, Tree-structured LSTM) • Self-attention (Transformer)

Motivation • TCN can give good empirical results • RNNs with many tricks can give state-of-the-art results in different tasks, but none of them seem to dominate multiple tasks • Can we combine TCN with RNNs so that we can use techniques from both sides?

Previous Work • Convolutional LSTMs: combine convolutional and recurrent units (Donahue et al., 2015) • Quasi-recurrent neural networks interleave convolutional and recurrent layers (Bradbury et al., 2017) • Dilation applied in RNNs (Chang et al., 2017) • This paper follows the direction of previous paper: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

This work – Main Contribution • TrellisNet, the proposed network architecture, serves as a bridge between recurrent and convolutional models • Performed very well and achieved state-of-the-art results on different sequence tasks including language modelling and modelling long-range dependencies

This presentation • TCN • TrellisNet • TrellisNet and TCN • RNN • TrellisNet and RNN • TrellisNet as a bridge between recurrent and convolutional models • An LSTM as a TrellisNet • Experiments • Results

TCN – A Quick Overview • A special kind of CNN in which • 1) causality; no information from the future is used in computation • 2) take a sequence with variable length and map it to a same-length output sequence – just like RNN • In summary, TCN = 1 Fully-convolutional network with causal convolution

TrellisNet

Trellis Network at an atomic level

Trellis Network on a sequence of units

TrellisNet and TCN

RNN

RNN • Processes one input element at a time and unrolls in the time dimension • Non-parallel operation, unlike CNN and TCN that operate on elements in parallel in each layer

TrellisNet and RNN

TrellisNet and RNN • Induction step: • Suppose equation (6) holds, we need to show that at level j + 1 is also true • At level j + 1: use • W1 and W2 are sparse matrices now • And now we use f in the TrellisNet which is f(a, b) = g(a); (we only take the first input) • We have the following equations

TrellisNet and RNN

TrellisNet and RNN Injected input at each layer Hidden units from previous layer at current step and ones from previous layer at previous steps Causality condition is met Weights are shared for each layer Note: non-linearity after linear combination is skipped for clarity

TrellisNet and RNN • At each hidden unit at time t at layer i, hidden unit is computed by using a hidden unit from previous layers i - j with history started at t – j and hidden unit from previous layers i – 1 with history started at t • Mixed group convolution, and can be represented with L = 2 in equation 5 • Notice here we have 4 layers of hidden units; this will become clear when we move to an LSTM as a TrellisNet (No, that is just the activation!)

TrellisNet as a bridge between recurrent and convolutional models • TrellisNet is a special kind of TCN • TrellisNet is a generalization of truncated RNN, and Theorem 1 allows to benefit significantly from techniques developed for RNNs • From recurrent networks: • Structured nonlinear activations (e.g. LSTM and GRU gates) • Variational RNN dropout • Recurrent DropConnect • History compression and repackaging • From convolutional networks: • Large kernel and dilated convolution • Auxiliary losses at intermediate layers • Weight normalization • Parallel convolutional processing

Questions for Discussion • Can you think of some ways to establish connection between the Trellis network with the self-attention architecture? • What are some drawbacks for this model? • Do you think this architecture has potential, and would you like to try it in your research / project? Why and why not.

Expressing a TrellisNet as an LSTM

Benchmark Tasks • Word-level language modelling • Penn Treebank (PTB), WikiText-103 (110 times larger) • Character-level language modelling • PTB • Long-range modelling • Sequential MNIST, PMNIST, and CIFAR-10

Results

Questions?

Thank you!

References • Bai, S., Kolter, J. Z., & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling • Trellis Networks for Sequence Modeling, Shaojie Bai, J. Zico Kolter, Vladlen Koltun 2018

Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks By Kai Sheng Tai, Richard Socher, Christopher D. Manning, 2015 Aashi JainFeb 13, 2019

LSTMs • A type of RNNs • Preserves sequence information over time • Address the problem of exploding or vanishing gradients by introducing a memory cell that is able to preserve state over long periods of time.

LSTMs so far…. • Are being explored in a linear chain. • However, language is not really sequential (Linear LSTMs) - (e.g., “My dog, who I rescued in the past, eats rawhide”) • So, we turn to tree-structured models!

Standard LSTM transition equations

Limitation of Standard LSTM Only allow for strictly sequential information propagation

Why Tree-Structured? • Linguistically attractive for syntactic interpretations of sentence structure. • To model word/phrase dependencies in a tree-like format instead of a linear fashion.

That’s exactly what this paper has done!

Differences? • Standard LSTM- hidden state from the input at the current time step and the hidden state of the LSTM unit in the previous time step. • Tree-LSTM- hidden state from an input vector and the hidden states of arbitrarily many child units.

In more detail. In Tree-LSTMs: • Gating vectors and memory cell updates are dependent on the states of possibly many child units. • Contains one forget gate for each child k. Allows selective incorporation of information from each child.

Types of tree-LSTMs Two variants: • Child-Sum Tree-LSTM (Dependency based, # dependents highly variable) • N-ary Tree-LSTM (Constituency based, left v/s right dependents)

TrellisNet: Sequences Modelling with RNN and Convolutional Models