780 likes | 797 Views
This presentation introduces TrellisNet, a network architecture that combines the strengths of Recurrent Neural Networks (RNNs) and Temporal Convolution Networks (TCNs) for sequence modelling. TrellisNet achieves state-of-the-art results in various sequence tasks, including language modelling and modelling long-range dependencies.
E N D
OTHER NETWORKS ARCHITECTURE Yuguang Lin, Aashi Jain , Siddarth Ravichandran Presenters CSE 291 G00 Deep Learning for Sequences
TRELLIS NETWORK FOR SEQUENCE MODELLING Yuguang Lin, presenter
Background • Three popular ways for sequence modelling • Temporal Convolution Networks (TCN) • Recurrent Networks (LSTM, GRU, Tree-structured LSTM) • Self-attention (Transformer)
Motivation • TCN can give good empirical results • RNNs with many tricks can give state-of-the-art results in different tasks, but none of them seem to dominate multiple tasks • Can we combine TCN with RNNs so that we can use techniques from both sides?
Previous Work • Convolutional LSTMs: combine convolutional and recurrent units (Donahue et al., 2015) • Quasi-recurrent neural networks interleave convolutional and recurrent layers (Bradbury et al., 2017) • Dilation applied in RNNs (Chang et al., 2017) • This paper follows the direction of previous paper: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling
This work – Main Contribution • TrellisNet, the proposed network architecture, serves as a bridge between recurrent and convolutional models • Performed very well and achieved state-of-the-art results on different sequence tasks including language modelling and modelling long-range dependencies
This presentation • TCN • TrellisNet • TrellisNet and TCN • RNN • TrellisNet and RNN • TrellisNet as a bridge between recurrent and convolutional models • An LSTM as a TrellisNet • Experiments • Results
TCN – A Quick Overview • A special kind of CNN in which • 1) causality; no information from the future is used in computation • 2) take a sequence with variable length and map it to a same-length output sequence – just like RNN • In summary, TCN = 1 Fully-convolutional network with causal convolution
RNN • Processes one input element at a time and unrolls in the time dimension • Non-parallel operation, unlike CNN and TCN that operate on elements in parallel in each layer
TrellisNet and RNN • Induction step: • Suppose equation (6) holds, we need to show that at level j + 1 is also true • At level j + 1: use • W1 and W2 are sparse matrices now • And now we use f in the TrellisNet which is f(a, b) = g(a); (we only take the first input) • We have the following equations
TrellisNet and RNN Injected input at each layer Hidden units from previous layer at current step and ones from previous layer at previous steps Causality condition is met Weights are shared for each layer Note: non-linearity after linear combination is skipped for clarity
TrellisNet and RNN • At each hidden unit at time t at layer i, hidden unit is computed by using a hidden unit from previous layers i - j with history started at t – j and hidden unit from previous layers i – 1 with history started at t • Mixed group convolution, and can be represented with L = 2 in equation 5 • Notice here we have 4 layers of hidden units; this will become clear when we move to an LSTM as a TrellisNet (No, that is just the activation!)
TrellisNet as a bridge between recurrent and convolutional models • TrellisNet is a special kind of TCN • TrellisNet is a generalization of truncated RNN, and Theorem 1 allows to benefit significantly from techniques developed for RNNs • From recurrent networks: • Structured nonlinear activations (e.g. LSTM and GRU gates) • Variational RNN dropout • Recurrent DropConnect • History compression and repackaging • From convolutional networks: • Large kernel and dilated convolution • Auxiliary losses at intermediate layers • Weight normalization • Parallel convolutional processing
Questions for Discussion • Can you think of some ways to establish connection between the Trellis network with the self-attention architecture? • What are some drawbacks for this model? • Do you think this architecture has potential, and would you like to try it in your research / project? Why and why not.
Benchmark Tasks • Word-level language modelling • Penn Treebank (PTB), WikiText-103 (110 times larger) • Character-level language modelling • PTB • Long-range modelling • Sequential MNIST, PMNIST, and CIFAR-10
References • Bai, S., Kolter, J. Z., & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling • Trellis Networks for Sequence Modeling, Shaojie Bai, J. Zico Kolter, Vladlen Koltun 2018
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks By Kai Sheng Tai, Richard Socher, Christopher D. Manning, 2015 Aashi JainFeb 13, 2019
LSTMs • A type of RNNs • Preserves sequence information over time • Address the problem of exploding or vanishing gradients by introducing a memory cell that is able to preserve state over long periods of time.
LSTMs so far…. • Are being explored in a linear chain. • However, language is not really sequential (Linear LSTMs) - (e.g., “My dog, who I rescued in the past, eats rawhide”) • So, we turn to tree-structured models!
Limitation of Standard LSTM Only allow for strictly sequential information propagation
Why Tree-Structured? • Linguistically attractive for syntactic interpretations of sentence structure. • To model word/phrase dependencies in a tree-like format instead of a linear fashion.
Differences? • Standard LSTM- hidden state from the input at the current time step and the hidden state of the LSTM unit in the previous time step. • Tree-LSTM- hidden state from an input vector and the hidden states of arbitrarily many child units.
In more detail. In Tree-LSTMs: • Gating vectors and memory cell updates are dependent on the states of possibly many child units. • Contains one forget gate for each child k. Allows selective incorporation of information from each child.
Types of tree-LSTMs Two variants: • Child-Sum Tree-LSTM (Dependency based, # dependents highly variable) • N-ary Tree-LSTM (Constituency based, left v/s right dependents)