Module: Attention

Module: Attention#

⭐⭐⭐ | ⏱️ 4-5 hours

📊 Module Info#

Difficulty: ⭐⭐⭐ Advanced
Time Estimate: 4-5 hours
Prerequisites: Tensor module
Next Steps: Training, Transformers modules

Build the core attention mechanism that powers modern AI! This module implements the fundamental scaled dot-product attention that’s used in ChatGPT, BERT, GPT-4, and virtually all state-of-the-art AI systems.

🎯 Learning Objectives#

By the end of this module, you will be able to:

Master the attention formula: Understand and implement Attention(Q,K,V) = softmax(QK^T/√d_k)V
Build self-attention: Create the core component that enables global context understanding
Control information flow: Implement masking for causal, padding, and bidirectional attention
Visualize attention patterns: See what the model “pays attention to”
Understand modern AI: Grasp the mechanism that revolutionized natural language processing

🧠 Build → Use → Understand#

This module follows TinyTorch’s Build → Use → Understand framework:

Build: Implement the core attention mechanism and masking utilities from mathematical foundations
Use: Apply attention to sequence tasks and visualize attention patterns
Understand: How attention enables dynamic, global context modeling that powers modern AI

📚 What You’ll Build#

Scaled Dot-Product Attention#

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    The fundamental attention operation:
    Attention(Q,K,V) = softmax(QK^T/√d_k)V
    
    This exact function powers ChatGPT, BERT, and all transformers.
    """
    d_k = Q.shape[-1]
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, -1e9)
    attention_weights = softmax(scores)
    return attention_weights @ V, attention_weights

Self-Attention Wrapper#

class SelfAttention:
    """
    Convenient wrapper for self-attention where Q=K=V.
    The most common use case in transformer models.
    """
    def __init__(self, d_model):
        self.d_model = d_model
    
    def forward(self, x, mask=None):
        # Self-attention: Q = K = V = x
        return scaled_dot_product_attention(x, x, x, mask)

Attention Masking#

# Causal masking (GPT-style: can't see future tokens)
causal_mask = create_causal_mask(seq_len)

# Padding masking (ignore padding tokens)
padding_mask = create_padding_mask(lengths, max_length)

# Bidirectional masking (BERT-style: can see all tokens)
bidirectional_mask = create_bidirectional_mask(seq_len)

🔬 Key Concepts#

Why Attention Revolutionized AI#

Global connectivity: Unlike CNNs, attention connects any two positions directly
Dynamic weights: Attention adapts to input content, not fixed like convolution kernels
Parallel processing: Unlike RNNs, all positions computed simultaneously
Interpretability: You can visualize what the model pays attention to

The Attention Formula Explained#

Attention(Q,K,V) = softmax(QK^T/√d_k)V

Where:
- Q (Query): "What am I looking for?"
- K (Key): "What information is available?"  
- V (Value): "What is the actual content?"
- √d_k scaling: Prevents extreme softmax values

Attention vs Convolution#

Aspect	Convolution	Attention
Receptive field	Local, grows with depth	Global from layer 1
Computation	O(n) with kernel size	O(n²) with sequence length
Weights	Fixed learned kernels	Dynamic input-dependent
Best for	Spatial data (images)	Sequential data (text)

Real-World Applications#

Language Models: GPT, BERT, ChatGPT use self-attention to understand context
Machine Translation: Google Translate uses attention to align source and target words
Image Understanding: Vision Transformers apply attention to image patches
Multimodal AI: CLIP, DALL-E use attention to connect text and images

🚀 From Attention to Modern AI#

This module teaches the core building block of modern AI:

What you’re building: The fundamental attention mechanism
What it enables: Multi-head attention, positional encoding, transformer blocks
What it powers: ChatGPT, BERT, GPT-4, and contemporary AI systems

Understanding this module gives you the foundation to understand:

How ChatGPT generates coherent text
How BERT understands language bidirectionally
How Vision Transformers work without convolution
How modern AI achieves human-like language understanding

📈 Module Progression#

Tensors → **ATTENTION** → Layers → Networks → CNNs → Training
  ↑              ↑
Foundation   Modern AI Core

After completing this module, you’ll understand the mechanism that sparked the AI revolution, making you ready to work with state-of-the-art models and architectures.

🎯 Success Criteria#

You’ll know you’ve mastered this module when you can:

Implement scaled dot-product attention from scratch
Explain why the √d_k scaling prevents gradient problems
Create different types of attention masks for various use cases
Visualize and interpret attention weights
Understand why attention enabled the transformer revolution
Connect this foundation to modern AI systems like ChatGPT

Choose your preferred way to engage with this module:

🚀 Launch Binder

Run this module interactively in your browser. No installation required!

https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/source/07_attention/attention_dev.ipynb

⚡ Open in Colab

Use Google Colab for GPU access and cloud compute power.

https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/source/07_attention/attention_dev.ipynb

📖 View Source

Browse the Python source code and understand the implementation.

https://github.com/mlsysbook/TinyTorch/blob/main/modules/source/07_attention/attention_dev.py

💾 Save Your Progress

Binder sessions are temporary! Download your completed notebook when done, or switch to local development for persistent work.

← Previous Module Next Module →