Introduction to Artificial Intelligence CSCE 4310 and 5210 Recurrent Neural Networks

Introduction to Artificial IntelligenceCSCE 4310 and 5210Recurrent Neural Networks Dr. Rodney Nielsen Director, Human Intelligence and Language Technologies Lab

Recurrent Neural Networks (RNNs) ANNs where connections form a directed cycle Exhibit dynamic temporal behavior Can use memory to process arbitrary sequences of inputs

Recurrent Neural Networks (RNNs) • Work on sequence data • Any sensor data • Physiological data • Audio • Connected handwriting recognition • Manufacturing • Language • Stock Market over time

RNN History Recurrent neural networks were developed in the 1980s. Hopfield networks were invented by John Hopfield in 1982. In 1993, a neural history compressor system solved a "Very Deep Learning" task requiring more than 1000 subsequent layers in an RNN unfolded in time. Long short-term memory (LSTM) (Hochreiterand Schmidhuber, 1997) set accuracy records in multiple applications domains. Around 2007, LSTM started to revolutionize ASR, outperforming traditional models in certain speech applications. In 2009, Connectionist Temporal Classification (CTC)-trained LSTM was first RNN to win pattern recognition contests, when it won several competitions in connected handwriting recognition. In 2014, the Chinese search giant Baidu used CTC-trained RNNs to break the Switchboard Hub5'00 speech recognition benchmark, without using any traditional speech processing methods. LSTM also improved large-vocabulary speech recognition, text-to-speech synthesis, and photo-real talking heads. In 2015, Google's speech recognition reportedly experienced a 49% performance jump through CTC-trained LSTM (Google voice search). LSTM broke records for improved Machine Translation, Language Modeling and Multilingual Language Processing. LSTM combined with convolutional neural networks (CNNs) improved automatic image captioning.

Alternative Sequence-based ML • Alternatives • Conditional Random Fields (CRFs) • Hidden Markov Models (HMMs)

Recurrent Neural Networks x = input y = output hidden layer time: t time: t-1 time: t-2 x1 x1 x1 y1 y1 y1 w21 w21 w21 x2 x2 x2 y2 y2 y2 w22 w22 w22 x3 x3 x3 w23 w23 w23 y3 y3 y3 Effectively want multiple networks, But don’t want to overfit

Recurrent Neural Networks • Tying Parameters Together (Sharing) • As in CNNs, this improves generalization • With separate weights for each t there would be poor learning and essentially no generalization • Also enables effective modeling of and generalization across variable length sequences

Recurrent Neural Networks x = input y = output hidden layer time: t time: t-1 time: t-2 x1 x1 x1 y1 y1 y1 w21 w21 w21 x2 x2 x2 y2 y2 y2 w22 w22 w22 x3 x3 x3 w23 w23 w23 y3 y3 y3 Effectively want multiple networks, But don’t want to overfit Learn a single set of weights for each layer

Recurrent Neural Networks y^t y^t+1 … … … … W3 W3 W3 … … … … … … … W2 W2 W2 W2 W1 W1 W1 … … xt+1 y^t-1 xt-1 xt ht ht-1 ht+1 Where W2 is a diagonal matrix

Dynamic Systems … … W3 W3 W3 … … … … … … st-1 st st+1 θ2 θ2 θ2 θ2 θ1 θ1 θ1 … … xt-1 xt+1 yt-1 yt+1 xt yt fθ fθ fθ • Maintains information over entire prior sequence st = gt(x1, x2, …xt)

Recurrent Neural Networks y^t y^t+1 … … … … W3 W3 W3 … … … … … … … W2 W2 W2 W2 W1 W1 W1 ht+1 … … ht-1 ht xt-1 y^t-1 xt+1 xt

Recurrent Neural Networks • Second term essentially holds information from the beginning of the sequence • Remember: Tying Parameters Together • As in CNNs, improves generalization • With separate weights for each t there would be poor learning and essentially no generalization • This also enables effective modeling of and generalization across variable length sequences

Recurrent Neural Networks y^t … … … … … … … W3 W3 W2 W2 W2 ht-1 ht W1 W1 xt-1 y^t-1 xt … … Back-Propagation Through Time (BPTT)

Recurrent Neural Networks • Vanishing (or Exploding) Gradient (Product of Jacobians) • In DNNs (including RNNs), we are multiplying multiple nonlinear activation functions • Derivatives often tend to be very close to zero (or sometimes relatively large) • Gradients vanish exponentially quickly with the size of time lag between important events • This is one key reason relatively little work was done on DNNs until around 2006 when someone recognized how to handle this

Vanishing/Exploding Gradient • The Jacobian (matrix of derivatives) of a composition is the product of the Jacobians of each stage • Hadamard product (element-wise matrix multiplication) • The Jacobian matrix of derivatives of f(x) wrtx is

Vanishing/Exploding Gradient • Gradients tend to either vanish or explode over time • Echo State Networks (only learn output weights) • E.g., W2 = 1 (Identity Matrix) remember everything • Leaky ReLUs • Instead consider: • τi = 1, normal RNN • τi = 1+ε, gradients propagate well • τi >> 1, ht ~= ht-1, very little change in state

Gated Recurrent Units (GRU) • Standard RNN directly computes hidden layer at next time step: • GRU first computes an update gate (ANN/RNN layer) based on the current input and hidden state: • Then compute a reset gate: • New state: • If reset gate is 0, resets memory / discards past • Final value combines current and prev. steps:

Long-Short-Term-Memory More complex Input gate, how much does current cell matter: it= σ(xtWi,1 + ht-1Wi,2) Forget gate, how much does past matter:ft= σ(xtWf,1+ ht-1Wf,2) Output gate: ot= σ(xtWo,1 + ht-1Wo,2) New memory cell: c~t = tanh(xtWc,1 + ht-1Wc,2) Final memory cell: ct= ft°ct-1 + it°c~t Final hidden state: ht= ot°tanh(ct)

Bi-directional LSTM • A Bi-directional RNN uses a finite sequence to classify each element of a sequence based on the element's past and future contexts. • Concatenates the outputs of two RNNs, one processing the sequence left to right, the other right to left • Output classification is combined • Technique proved especially useful when combined with LSTM • General bi-directional approach has proven useful in a wide variety of settings (non-RNN) • Isn’t real-time or directly biologically plausible

Exploding Gradient Solution • Clip Gradient • Set gradient to max(d/dx, 3)

Questions ?

Introduction to Artificial Intelligence CSCE 4310 and 5210 Recurrent Neural Networks