Module Review: Sequence Models

Review the essential concepts of sequence modeling, focusing on RNNs, LSTMs, GRUs, and the Attention Mechanism to build a strong foundation.

1. Key Takeaways

  • Sequence Matters: Traditional FNNs cannot handle variable-length sequences or capture temporal dependencies. RNNs are designed for this.
  • Memory: RNNs maintain a hidden state (h<sub>t</sub>) that acts as memory, updated at each time step.
  • Training Challenges: Training RNNs with BPTT often leads to vanishing or exploding gradients.
  • LSTMs & GRUs: These advanced architectures solve the vanishing gradient problem using gating mechanisms (Forget, Input, Output).
  • Seq2Seq: The Encoder-Decoder architecture is standard for tasks like translation and summarization.
  • Attention: The Attention Mechanism allows the decoder to access the entire input sequence, solving the information bottleneck problem of fixed-length context vectors.

2. Flashcards

Test your understanding of the core concepts.

What is BPTT?

Backpropagation Through Time. It's the standard training algorithm for RNNs, unrolling the network across time steps to compute gradients.

Why use LSTM over RNN?

LSTMs solve the vanishing gradient problem using gates, allowing them to capture long-term dependencies that vanilla RNNs miss.

What is the "Forget Gate"?

A sigmoid layer in an LSTM that decides what information to discard from the cell state (0 = forget, 1 = keep).

What problem does Attention solve?

The Information Bottleneck. It allows the decoder to "look back" at all encoder states, not just the final context vector.

3. Cheat Sheet

Concept Equation / Description
RNN Update h<sub>t</sub> = tanh(W<sub>xh</sub> x<sub>t</sub> + W<sub>hh</sub> h<sub>t-1</sub>)
LSTM Forget f<sub>t</sub> = &sigma;(W<sub>f</sub> &middot; [h<sub>t-1</sub>, x<sub>t</sub>])
LSTM Input i<sub>t</sub> = &sigma;(W<sub>i</sub> &middot; [h<sub>t-1</sub>, x<sub>t</sub>])
LSTM Cell C<sub>t</sub> = f<sub>t</sub> * C<sub>t-1</sub> + i<sub>t</sub> * &Ctilde;<sub>t</sub>
Attention Score score = h<sub>decoder</sub><sup>T</sup> &middot; h<sub>encoder</sub> (Dot Product)
Context Vector c<sub>t</sub> = &Sigma; &alpha;<sub>ts</sub> h<sub>s</sub> (Weighted sum)

4. Quick Revision

  • RNNs: Capture sequential dependencies by updating a hidden state at each time step.
  • Vanishing Gradients: Gradients shrink exponentially in long sequences during BPTT.
  • LSTMs & GRUs: Utilize gating mechanisms to control memory and solve vanishing gradients.
  • Seq2Seq: Encoder compresses input into a context vector; Decoder generates the sequence.
  • Attention: Enables the Decoder to attend to specific Encoder states dynamically, overcoming the information bottleneck.

Review all terminology in the Deep Learning Glossary.

6. Next Steps

You have mastered the fundamentals of sequence modeling with RNNs. However, RNNs are sequential by nature, making them slow to train on modern hardware (GPUs).

In the next module, we will explore Transformers, an architecture that relies entirely on Attention and discards recurrence, revolutionizing NLP and Deep Learning.

Start Module 05: Transformers