Module: DataLoader

Module: DataLoader#

⭐⭐⭐ | ⏱️ 5-6 hours

📊 Module Info#

Difficulty: ⭐⭐⭐ Advanced
Time Estimate: 5-7 hours
Prerequisites: Tensor, Layers modules
Next Steps: Training, Networks modules

Build the data pipeline foundation of TinyTorch! This module implements efficient data loading, preprocessing, and batching systems—the critical infrastructure that feeds neural networks during training and powers real-world ML systems.

🎯 Learning Objectives#

By the end of this module, you will be able to:

Design data pipeline architectures: Understand data engineering as the foundation of scalable ML systems
Implement reusable dataset abstractions: Build flexible interfaces that support multiple data sources and formats
Create efficient data loaders: Develop batching, shuffling, and streaming systems for optimal training performance
Build preprocessing pipelines: Implement normalization, augmentation, and transformation systems
Apply systems engineering principles: Handle memory management, I/O optimization, and error recovery in data pipelines

🧠 Build → Use → Optimize#

This module follows TinyTorch’s Build → Use → Optimize framework:

Build: Implement dataset abstractions, data loaders, and preprocessing pipelines from engineering principles
Use: Apply your data system to real CIFAR-10 dataset with complete train/test workflows
Optimize: Analyze performance characteristics, memory usage, and system bottlenecks for production readiness

📚 What You’ll Build#

Complete Data Pipeline System#

# End-to-end data pipeline creation
train_loader, test_loader, normalizer = create_data_pipeline(
    dataset_path="data/cifar10/",
    batch_size=32,
    normalize=True,
    shuffle=True
)

# Ready for neural network training
for batch_images, batch_labels in train_loader:
    # batch_images.shape: (32, 3, 32, 32) - normalized pixel values
    # batch_labels.shape: (32,) - class indices
    predictions = model(batch_images)
    loss = compute_loss(predictions, batch_labels)
    # Continue training loop...

Dataset Abstraction System#

# Flexible interface supporting multiple datasets
class Dataset:
    def __getitem__(self, index):
        # Return (data, label) for any dataset type
        pass
    def __len__(self):
        # Enable len() and iteration
        pass

# Concrete implementation with real data
dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
print(f"Loaded {len(dataset)} real samples")  # 50,000 training images
image, label = dataset[0]  # Access individual samples
print(f"Sample shape: {image.shape}, Label: {label}")

Efficient Data Loading System#

# High-performance batching with memory optimization
dataloader = DataLoader(
    dataset=dataset,
    batch_size=32,          # Configurable batch size
    shuffle=True,           # Training randomization
    drop_last=False         # Handle incomplete batches
)

# Pythonic iteration interface
for batch_idx, (batch_data, batch_labels) in enumerate(dataloader):
    print(f"Batch {batch_idx}: {batch_data.shape}")
    # Automatic batching handles all the complexity

Data Preprocessing Pipeline#

# Production-ready normalization system
normalizer = Normalizer()

# Fit on training data (compute statistics once)
normalizer.fit(training_images)
print(f"Mean: {normalizer.mean}, Std: {normalizer.std}")

# Apply to any dataset (training, validation, test)
normalized_images = normalizer.transform(test_images)
# Ensures consistent preprocessing across data splits

🎯 NEW: CIFAR-10 Support for North Star Goal#

Built-in CIFAR-10 Download and Loading#

This module now includes complete CIFAR-10 support to achieve our semester goal of 75% accuracy:

from tinytorch.core.dataloader import CIFAR10Dataset, download_cifar10

# Download CIFAR-10 automatically (one-time, ~170MB)
dataset_path = download_cifar10()  # Downloads to ./data/cifar-10-batches-py

# Load training and test data
dataset = CIFAR10Dataset(download=True, flatten=False)
print(f"✅ Loaded {len(dataset.train_data)} training samples")
print(f"✅ Loaded {len(dataset.test_data)} test samples")

# Create DataLoaders for training
from tinytorch.core.dataloader import DataLoader
train_loader = DataLoader(dataset.train_data, dataset.train_labels, batch_size=32, shuffle=True)
test_loader = DataLoader(dataset.test_data, dataset.test_labels, batch_size=32, shuffle=False)

# Ready for CNN training!
for batch_images, batch_labels in train_loader:
    print(f"Batch shape: {batch_images.shape}")  # (32, 3, 32, 32) for CNNs
    break

What’s New in This Module#

✅ download_cifar10(): Automatically downloads and extracts CIFAR-10 dataset
✅ CIFAR10Dataset: Complete dataset class with train/test splits
✅ Real Data Support: Work with actual 32x32 RGB images, not toy data
✅ Production Features: Shuffling, batching, normalization for real training

🚀 Getting Started#

Prerequisites#

Ensure you have the foundational tensor operations:

# Activate TinyTorch environment
source bin/activate-tinytorch.sh

# Verify prerequisite modules
tito test --module tensor
tito test --module layers

Development Workflow#

Open the development file: modules/source/07_dataloader/dataloader_dev.py
Implement Dataset abstraction: Create the base interface for all data sources
Build CIFAR-10 dataset: Implement real dataset loading with binary file parsing
Create DataLoader system: Add batching, shuffling, and iteration functionality
Add preprocessing tools: Implement normalizer and transformation pipeline
Export and verify: tito export --module dataloader && tito test --module dataloader

🧪 Testing Your Implementation#

Comprehensive Test Suite#

Run the full test suite to verify data engineering functionality:

# TinyTorch CLI (recommended)
tito test --module dataloader

# Direct pytest execution
python -m pytest tests/ -k dataloader -v

Test Coverage Areas#

✅ Dataset Interface: Verify abstract base class and concrete implementations
✅ Real Data Loading: Test with actual CIFAR-10 dataset (downloads ~170MB)
✅ Batching System: Ensure correct batch shapes and memory efficiency
✅ Data Preprocessing: Verify normalization statistics and transformations
✅ Pipeline Integration: Test complete train/test workflow with real data

Inline Testing & Real Data Validation#

The module includes comprehensive feedback using real CIFAR-10 data:

# Example inline test output
🔬 Unit Test: CIFAR-10 dataset loading...
📥 Downloading CIFAR-10 dataset (170MB)...
✅ Successfully loaded 50,000 training samples
✅ Sample shapes correct: (3, 32, 32)
✅ Labels in valid range: [0, 9]
📈 Progress: CIFAR-10 Dataset ✓

# DataLoader testing with real data
🔬 Unit Test: DataLoader batching...
✅ Batch shapes correct: (32, 3, 32, 32)
✅ Shuffling produces different orders
✅ Iteration covers all samples exactly once
📈 Progress: DataLoader ✓

Manual Testing Examples#

from tinytorch.core.tensor import Tensor
from dataloader_dev import CIFAR10Dataset, DataLoader, Normalizer

# Test dataset loading with real data
dataset = CIFAR10Dataset("data/cifar10/", train=True, download=True)
print(f"Dataset size: {len(dataset)}")
print(f"Classes: {dataset.get_num_classes()}")

# Test data loading pipeline
dataloader = DataLoader(dataset, batch_size=16, shuffle=True)
for batch_images, batch_labels in dataloader:
    print(f"Batch shape: {batch_images.shape}")
    print(f"Label range: {batch_labels.min()} to {batch_labels.max()}")
    break  # Just test first batch

# Test preprocessing pipeline
normalizer = Normalizer()
sample_batch, _ = next(iter(dataloader))
normalizer.fit(sample_batch)
normalized = normalizer.transform(sample_batch)
print(f"Original range: [{sample_batch.min():.2f}, {sample_batch.max():.2f}]")
print(f"Normalized range: [{normalized.min():.2f}, {normalized.max():.2f}]")

🎯 Key Concepts#

Real-World Applications#

Production ML Systems: Companies like Netflix, Spotify use similar data pipelines for recommendation training
Computer Vision: ImageNet, COCO dataset loaders power research and production vision systems
Natural Language Processing: Text preprocessing pipelines enable language model training
Autonomous Systems: Real-time data streams from sensors require efficient pipeline architectures

Data Engineering Principles#

Interface Design: Abstract Dataset class enables switching between data sources seamlessly
Memory Efficiency: Streaming data loading prevents memory overflow with large datasets
I/O Optimization: Batching reduces system calls and improves throughput
Preprocessing Consistency: Fit-transform pattern ensures identical preprocessing across data splits

Systems Performance Considerations#

Batch Size Trade-offs: Larger batches improve GPU utilization but increase memory usage
Shuffling Strategy: Random access patterns for training vs sequential for inference
Caching and Storage: Balance between memory usage and I/O performance
Error Handling: Robust handling of corrupted data, network failures, disk issues

Production ML Pipeline Patterns#

ETL Design: Extract (load files), Transform (preprocess), Load (batch) pattern
Data Versioning: Reproducible datasets with consistent preprocessing
Pipeline Monitoring: Track data quality, distribution shifts, processing times
Scalability Planning: Design for growing datasets and distributed processing

🎉 Ready to Build?#

You’re about to build the data engineering foundation that powers every successful ML system! From startup prototypes to billion-dollar recommendation engines, they all depend on robust data pipelines like the one you’re building.

This module teaches you the systems thinking that separates hobby projects from production ML systems. You’ll work with real data, handle real performance constraints, and build infrastructure that scales. Take your time, think about edge cases, and enjoy building the backbone of machine learning!

Choose your preferred way to engage with this module:

🚀 Launch Binder

Run this module interactively in your browser. No installation required!

https://mybinder.org/v2/gh/mlsysbook/TinyTorch/main?filepath=modules/source/08_dataloader/dataloader_dev.ipynb

⚡ Open in Colab

Use Google Colab for GPU access and cloud compute power.

https://colab.research.google.com/github/mlsysbook/TinyTorch/blob/main/modules/source/08_dataloader/dataloader_dev.ipynb

📖 View Source

Browse the Python source code and understand the implementation.

https://github.com/mlsysbook/TinyTorch/blob/main/modules/source/08_dataloader/dataloader_dev.py

💾 Save Your Progress

Binder sessions are temporary! Download your completed notebook when done, or switch to local development for persistent work.

← Previous Module Next Module →