Module: Training

Module: Training#

⭐⭐⭐⭐ | ⏱️ 8-10 hours

📊 Module Info#

Difficulty: ⭐⭐⭐⭐ Expert
Time Estimate: 8-10 hours
Prerequisites: Tensor, Activations, Layers, Networks, DataLoader, Autograd, Optimizers modules
Next Steps: Compression, Kernels, Benchmarking, MLOps modules

Build the complete training pipeline that brings all TinyTorch components together. This capstone module orchestrates data loading, model forward passes, loss computation, backpropagation, and optimization into the end-to-end training workflows that power modern AI systems.

🎯 Learning Objectives#

By the end of this module, you will be able to:

Design complete training architectures: Orchestrate all ML components into cohesive training systems
Implement essential loss functions: Build MSE, CrossEntropy, and BinaryCrossEntropy from mathematical foundations
Create evaluation frameworks: Develop metrics systems for classification, regression, and model performance assessment
Build production training loops: Implement robust training workflows with validation, logging, and progress tracking
Master training dynamics: Understand convergence, overfitting, generalization, and optimization in real scenarios

🧠 Build → Use → Optimize#

This module follows TinyTorch’s Build → Use → Optimize framework:

Build: Implement loss functions, evaluation metrics, and complete training orchestration systems
Use: Train end-to-end neural networks on real datasets with full pipeline automation
Optimize: Analyze training dynamics, debug convergence issues, and optimize training performance for production

🎯 NEW: Model Checkpointing & Evaluation Tools#

Complete Training with Checkpointing#

This module now includes production features for our north star goal:

from tinytorch.core.training import Trainer, CrossEntropyLoss, Accuracy
from tinytorch.core.training import evaluate_model, plot_training_history

# Train with automatic model checkpointing
trainer = Trainer(model, CrossEntropyLoss(), Adam(lr=0.001), [Accuracy()])
history = trainer.fit(
    train_loader,
    val_dataloader=test_loader,
    epochs=30,
    save_best=True,                    # ✅ NEW: Saves best model automatically
    checkpoint_path='best_model.pkl',  # ✅ NEW: Checkpoint location
    early_stopping_patience=5          # ✅ NEW: Stop if no improvement
)

# Load best model after training
trainer.load_checkpoint('best_model.pkl')
print(f"✅ Restored best model from epoch {trainer.current_epoch}")

# Evaluate with comprehensive metrics
results = evaluate_model(model, test_loader)
print(f"Test Accuracy: {results['accuracy']:.2%}")
print(f"Confusion Matrix:\n{results['confusion_matrix']}")

# Visualize training progress
plot_training_history(history)  # Shows loss and accuracy curves

What’s New in This Module#

✅ save_checkpoint()/load_checkpoint(): Save and restore model state during training
✅ save_best=True: Automatically saves model with best validation performance
✅ early_stopping_patience: Stop training when validation loss stops improving
✅ evaluate_model(): Comprehensive model evaluation with confusion matrix
✅ plot_training_history(): Visualize training and validation curves
✅ compute_confusion_matrix(): Analyze classification errors by class

📚 What You’ll Build#

Complete Training Pipeline#

# End-to-end training system
from tinytorch.core.training import Trainer
from tinytorch.core.losses import CrossEntropyLoss
from tinytorch.core.metrics import Accuracy

# Define complete model architecture
model = Sequential([
    Dense(784, 128), ReLU(),
    Dense(128, 64), ReLU(),
    Dense(64, 10), Softmax()
])

# Configure training components
optimizer = Adam(model.parameters(), learning_rate=0.001)
loss_fn = CrossEntropyLoss()
metrics = [Accuracy()]

# Create and configure trainer
trainer = Trainer(
    model=model,
    optimizer=optimizer, 
    loss_fn=loss_fn,
    metrics=metrics
)

# Train with comprehensive monitoring
history = trainer.fit(
    train_dataloader=train_loader,
    val_dataloader=val_loader,
    epochs=50,
    verbose=True
)

Loss Function Library#

# Regression loss for continuous targets
mse_loss = MeanSquaredError()
regression_loss = mse_loss(predictions, continuous_targets)

# Multi-class classification loss
ce_loss = CrossEntropyLoss()
classification_loss = ce_loss(logits, class_indices)

# Binary classification loss
bce_loss = BinaryCrossEntropyLoss()
binary_loss = bce_loss(sigmoid_outputs, binary_labels)

# All losses support batch processing and gradient computation
loss.backward()  # Automatic differentiation integration

Evaluation Metrics System#

# Classification performance measurement
accuracy = Accuracy()
acc_score = accuracy(predictions, true_labels)  # Returns 0.0 to 1.0

# Regression error measurement  
mae = MeanAbsoluteError()
error = mae(predictions, targets)

# Extensible metric framework
class CustomMetric:
    def __call__(self, y_pred, y_true):
        # Implement custom evaluation logic
        return custom_score

metrics = [Accuracy(), CustomMetric()]
trainer = Trainer(model, optimizer, loss_fn, metrics)

Real-World Training Workflows#

# Train on CIFAR-10 with full pipeline
from tinytorch.core.dataloader import CIFAR10Dataset, DataLoader

# Load and prepare data
train_dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

# Configure CNN for computer vision
cnn_model = Sequential([
    Conv2D(3, 16, kernel_size=3), ReLU(),
    MaxPool2D(kernel_size=2),
    Conv2D(16, 32, kernel_size=3), ReLU(),
    Flatten(),
    Dense(32 * 13 * 13, 128), ReLU(),
    Dense(128, 10)
])

# Train with monitoring and validation
trainer = Trainer(cnn_model, Adam(cnn_model.parameters()), CrossEntropyLoss(), [Accuracy()])
history = trainer.fit(train_loader, val_loader, epochs=100)

# Analyze training results
print(f"Final train accuracy: {history['train_accuracy'][-1]:.4f}")
print(f"Final val accuracy: {history['val_accuracy'][-1]:.4f}")

🚀 Getting Started#

Prerequisites#

Ensure you have completed the entire TinyTorch foundation:

# Activate TinyTorch environment
source bin/activate-tinytorch.sh

# Verify all prerequisite modules (this is the capstone!)
tito test --module tensor
tito test --module activations  
tito test --module layers
tito test --module networks
tito test --module dataloader
tito test --module autograd
tito test --module optimizers

Development Workflow#

Open the development file: modules/source/10_training/training_dev.py
Implement loss functions: Build MSE, CrossEntropy, and BinaryCrossEntropy with proper gradients
Create metrics system: Develop Accuracy and extensible evaluation framework
Build Trainer class: Orchestrate training loop with validation and monitoring
Test end-to-end training: Apply complete pipeline to real datasets and problems
Export and verify: tito export --module training && tito test --module training

🧪 Testing Your Implementation#

Comprehensive Test Suite#

Run the full test suite to verify complete training system functionality:

# TinyTorch CLI (recommended)
tito test --module training

# Direct pytest execution
python -m pytest tests/ -k training -v

Test Coverage Areas#

✅ Loss Function Implementation: Verify mathematical correctness and gradient computation
✅ Metrics System: Test accuracy calculation and extensible framework
✅ Training Loop Orchestration: Ensure proper coordination of all components
✅ End-to-End Training: Verify complete workflows on real datasets
✅ Convergence Analysis: Test training dynamics and optimization behavior

Inline Testing & Training Analysis#

The module includes comprehensive training validation and convergence monitoring:

# Example inline test output
🔬 Unit Test: CrossEntropy loss function...
✅ Mathematical correctness verified
✅ Gradient computation working
✅ Batch processing supported
📈 Progress: Loss Functions ✓

# Training monitoring
🔬 Unit Test: Complete training pipeline...
✅ Trainer orchestrates all components correctly
✅ Training loop converges on test problem
✅ Validation monitoring working
📈 Progress: End-to-End Training ✓

# Real dataset training
📊 Training on CIFAR-10 subset...
Epoch 1/10: train_loss=2.345, train_acc=0.234, val_loss=2.123, val_acc=0.278
Epoch 5/10: train_loss=1.456, train_acc=0.567, val_loss=1.543, val_acc=0.523
✅ Model converging successfully

Manual Testing Examples#

from training_dev import Trainer, CrossEntropyLoss, Accuracy
from networks_dev import Sequential
from layers_dev import Dense
from activations_dev import ReLU, Softmax
from optimizers_dev import Adam

# Test complete training on synthetic data
model = Sequential([Dense(4, 8), ReLU(), Dense(8, 3), Softmax()])
optimizer = Adam(model.parameters(), learning_rate=0.01)
loss_fn = CrossEntropyLoss()
metrics = [Accuracy()]

trainer = Trainer(model, optimizer, loss_fn, metrics)

# Create simple dataset
from dataloader_dev import SimpleDataset, DataLoader
train_dataset = SimpleDataset(size=1000, num_features=4, num_classes=3)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Train and monitor
history = trainer.fit(train_loader, epochs=20, verbose=True)
print(f"Training completed. Final accuracy: {history['train_accuracy'][-1]:.4f}")

🎯 Key Concepts#

Real-World Applications#

Production ML Systems: Companies like Netflix, Google use similar training pipelines for recommendation and search systems
Research Workflows: Academic researchers use training frameworks like this for experimental model development
MLOps Platforms: Production training systems extend these patterns with distributed computing and monitoring
Edge AI Training: Federated learning systems use similar orchestration patterns across distributed devices

Training System Architecture#

Loss Functions: Mathematical objectives that define what the model should learn
Metrics: Human-interpretable measures of model performance for monitoring and decision-making
Training Loop: Orchestration pattern that coordinates data loading, forward passes, backward passes, and optimization
Validation Strategy: Techniques for monitoring generalization and preventing overfitting

Machine Learning Engineering#

Training Dynamics: Understanding convergence, overfitting, underfitting, and optimization landscapes
Hyperparameter Tuning: Systematic approaches to learning rate, batch size, and architecture selection
Debugging Training: Common failure modes and diagnostic techniques for training issues
Production Considerations: Scalability, monitoring, reproducibility, and deployment readiness

Systems Integration Patterns#

Component Orchestration: How to coordinate multiple ML components into cohesive systems
Error Handling: Robust handling of training failures, data issues, and convergence problems
Monitoring and Logging: Tracking training progress, performance metrics, and system health
Extensibility: Design patterns that enable easy addition of new losses, metrics, and training strategies

🎉 Ready to Build?#

You’re about to complete the TinyTorch framework by building the training system that brings everything together! This is where all your hard work on tensors, layers, networks, data loading, gradients, and optimization culminates in a complete ML system.

Training is the heart of machine learning—it’s where models learn from data and become intelligent. You’re building the same patterns used to train GPT, train computer vision models, and power production AI systems. Take your time, understand how all the pieces fit together, and enjoy creating something truly powerful!

Choose your preferred way to engage with this module:

🚀 Launch Binder

Run this module interactively in your browser. No installation required!

https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/source/11_training/training_dev.ipynb

⚡ Open in Colab

Use Google Colab for GPU access and cloud compute power.

https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/source/11_training/training_dev.ipynb

📖 View Source

Browse the Python source code and understand the implementation.

https://github.com/mlsysbook/TinyTorch/blob/main/modules/source/11_training/training_dev.py

💾 Save Your Progress

Binder sessions are temporary! Download your completed notebook when done, or switch to local development for persistent work.

← Previous Module Next Module →