Module: Attention#

⭐⭐⭐ | ⏱️ 4-5 hours

📊 Module Info#

  • Difficulty: ⭐⭐⭐ Advanced

  • Time Estimate: 4-5 hours

  • Prerequisites: Tensor module

  • Next Steps: Training, Transformers modules

Build the core attention mechanism that powers modern AI! This module implements the fundamental scaled dot-product attention that’s used in ChatGPT, BERT, GPT-4, and virtually all state-of-the-art AI systems.

🎯 Learning Objectives#

By the end of this module, you will be able to:

  • Master the attention formula: Understand and implement Attention(Q,K,V) = softmax(QK^T/√d_k)V

  • Build self-attention: Create the core component that enables global context understanding

  • Control information flow: Implement masking for causal, padding, and bidirectional attention

  • Visualize attention patterns: See what the model “pays attention to”

  • Understand modern AI: Grasp the mechanism that revolutionized natural language processing

🧠 Build → Use → Understand#

This module follows TinyTorch’s Build → Use → Understand framework:

  1. Build: Implement the core attention mechanism and masking utilities from mathematical foundations

  2. Use: Apply attention to sequence tasks and visualize attention patterns

  3. Understand: How attention enables dynamic, global context modeling that powers modern AI

📚 What You’ll Build#

Scaled Dot-Product Attention#

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    The fundamental attention operation:
    Attention(Q,K,V) = softmax(QK^T/√d_k)V
    
    This exact function powers ChatGPT, BERT, and all transformers.
    """
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attention_weights = softmax(scores)
    return attention_weights @ V, attention_weights

Self-Attention Wrapper#

class SelfAttention:
    """
    Convenient wrapper for self-attention where Q=K=V.
    The most common use case in transformer models.
    """
    def __init__(self, d_model):
        self.d_model = d_model
    
    def forward(self, x, mask=None):
        # Self-attention: Q = K = V = x
        return scaled_dot_product_attention(x, x, x, mask)

Attention Masking#

# Causal masking (GPT-style: can't see future tokens)
causal_mask = create_causal_mask(seq_len)

# Padding masking (ignore padding tokens)
padding_mask = create_padding_mask(lengths, max_length)

# Bidirectional masking (BERT-style: can see all tokens)
bidirectional_mask = create_bidirectional_mask(seq_len)

🔬 Key Concepts#

Why Attention Revolutionized AI#

  • Global connectivity: Unlike CNNs, attention connects any two positions directly

  • Dynamic weights: Attention adapts to input content, not fixed like convolution kernels

  • Parallel processing: Unlike RNNs, all positions computed simultaneously

  • Interpretability: You can visualize what the model pays attention to

The Attention Formula Explained#

Attention(Q,K,V) = softmax(QK^T/√d_k)V

Where:
- Q (Query): "What am I looking for?"
- K (Key): "What information is available?"  
- V (Value): "What is the actual content?"
- √d_k scaling: Prevents extreme softmax values

Attention vs Convolution#

Aspect

Convolution

Attention

Receptive field

Local, grows with depth

Global from layer 1

Computation

O(n) with kernel size

O(n²) with sequence length

Weights

Fixed learned kernels

Dynamic input-dependent

Best for

Spatial data (images)

Sequential data (text)

Real-World Applications#

  • Language Models: GPT, BERT, ChatGPT use self-attention to understand context

  • Machine Translation: Google Translate uses attention to align source and target words

  • Image Understanding: Vision Transformers apply attention to image patches

  • Multimodal AI: CLIP, DALL-E use attention to connect text and images

🚀 From Attention to Modern AI#

This module teaches the core building block of modern AI:

What you’re building: The fundamental attention mechanism
What it enables: Multi-head attention, positional encoding, transformer blocks
What it powers: ChatGPT, BERT, GPT-4, and contemporary AI systems

Understanding this module gives you the foundation to understand:

  • How ChatGPT generates coherent text

  • How BERT understands language bidirectionally

  • How Vision Transformers work without convolution

  • How modern AI achieves human-like language understanding

📈 Module Progression#

Tensors → **ATTENTION** → Layers → Networks → CNNs → Training
  ↑              ↑
Foundation   Modern AI Core

After completing this module, you’ll understand the mechanism that sparked the AI revolution, making you ready to work with state-of-the-art models and architectures.

🎯 Success Criteria#

You’ll know you’ve mastered this module when you can:

  • Implement scaled dot-product attention from scratch

  • Explain why the √d_k scaling prevents gradient problems

  • Create different types of attention masks for various use cases

  • Visualize and interpret attention weights

  • Understand why attention enabled the transformer revolution

  • Connect this foundation to modern AI systems like ChatGPT

Choose your preferred way to engage with this module:

🚀 Launch Binder

Run this module interactively in your browser. No installation required!

https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/source/07_attention/attention_dev.ipynb
📖 View Source

Browse the Python source code and understand the implementation.

https://github.com/mlsysbook/TinyTorch/blob/main/modules/source/07_attention/attention_dev.py

💾 Save Your Progress

Binder sessions are temporary! Download your completed notebook when done, or switch to local development for persistent work.

Ready for serious development? → 🏗️ Local Setup Guide