Chapter 01: Introduction to PyTorch

Day 06: GPU Acceleration with CUDA

Introduction to CUDA and GPU Acceleration
Checking for GPU Availability
Moving Tensors Between CPU and GPU
- 3.1. Using .to(device)
- 3.2. Using .cuda() and .cpu()
Observing Performance Differences
- 4.1. Benchmarking CPU vs GPU Operations
- 4.2. Practical Example: Matrix Multiplication
Moving Models to GPU
- 5.1. Transferring a Simple Neural Network
- 5.2. Training a Model on GPU
Handling Multiple GPUs
- 6.1. Checking for Multiple GPUs
- 6.2. Utilizing Data Parallelism
Best Practices for GPU Acceleration
Common Pitfalls and How to Avoid Them
Exercises for Practice
- 9.1. Exercise 1: Tensor Operations on GPU
- 9.2. Exercise 2: Moving a Model to GPU
- 9.3. Exercise 3: Utilizing Multiple GPUs
Summary
Additional Resources

1. Introduction to CUDA and GPU Acceleration

CUDA (Compute Unified Device Architecture) is a parallel computing platform and API model created by NVIDIA. It allows developers to utilize NVIDIA GPUs for general-purpose processing, significantly accelerating compute-intensive tasks like deep learning.

GPU Acceleration leverages the parallel processing capabilities of GPUs to perform computations faster than CPUs. In the context of deep learning, GPUs can handle large-scale tensor operations more efficiently, leading to faster training and inference times.

Why Use GPUs?

Parallelism: GPUs contain thousands of cores that can perform operations simultaneously.
Memory Bandwidth: GPUs have higher memory bandwidth, enabling faster data transfer rates.
Efficiency: GPUs can handle large-scale matrix and vector operations more efficiently than CPUs.

PyTorch and CUDA: PyTorch seamlessly integrates with CUDA, allowing tensors and models to be moved between CPU and GPU with simple commands. This integration enables easy experimentation and scaling of deep learning models.

2. Checking for GPU Availability

Before leveraging GPU acceleration, it's essential to verify if a CUDA-compatible GPU is available on your system.

Example 1: Checking GPU Availability

import torch

# Check if CUDA is available
cuda_available = torch.cuda.is_available()
print("Is CUDA available?", cuda_available)

# Get the number of GPUs available
num_gpus = torch.cuda.device_count()
print("Number of GPUs available:", num_gpus)

# Get the name of the current GPU
if cuda_available:
    current_gpu = torch.cuda.get_device_name(0)
    print("Current GPU:", current_gpu)

Expected Output (Example):

Is CUDA available? True
Number of GPUs available: 2
Current GPU: NVIDIA GeForce RTX 3080

Explanation:

torch.cuda.is_available(): Returns True if CUDA is available; otherwise, False.
torch.cuda.device_count(): Returns the number of GPUs available.
torch.cuda.get_device_name(0): Returns the name of the GPU at index 0.

Handling Multiple GPUs: If multiple GPUs are available, you can select a specific GPU by its index.

if cuda_available and num_gpus > 1:
    # Select GPU with index 1
    device = torch.device("cuda:1")
    print("Selected Device:", torch.cuda.get_device_name(device))
else:
    device = torch.device("cuda:0" if cuda_available else "cpu")
    print("Selected Device:", torch.cuda.get_device_name(device) if cuda_available else "CPU")

3. Moving Tensors Between CPU and GPU

PyTorch provides straightforward methods to move tensors between CPU and GPU. Understanding these methods is crucial for optimizing performance.

3.1. Using `.to(device)`

The .to() method is versatile and can move tensors to any specified device (CPU or GPU).

Example 2: Moving Tensors Using .to(device)

import torch

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Create a tensor on CPU
cpu_tensor = torch.randn(3, 3)
print("\nCPU Tensor:\n", cpu_tensor)

# Move tensor to GPU
if device.type == 'cuda':
    gpu_tensor = cpu_tensor.to(device)
    print("\nGPU Tensor:\n", gpu_tensor)
    print("\nIs gpu_tensor on GPU?", gpu_tensor.is_cuda)

Expected Output (Example):

Using device: cuda

CPU Tensor:
 tensor([[ 0.1234, -1.2345,  0.5678],
        [ 1.2345, -0.5678,  1.3456],
        [-0.9876,  0.6789, -1.4567]])

GPU Tensor:
 tensor([[ 0.1234, -1.2345,  0.5678],
        [ 1.2345, -0.5678,  1.3456],
        [-0.9876,  0.6789, -1.4567]], device='cuda:0')

Is gpu_tensor on GPU? True

Explanation:

Device Selection: Chooses GPU if available; otherwise, CPU.
Creating Tensor on CPU: cpu_tensor resides on the CPU.
Moving to GPU: .to(device) transfers the tensor to the selected device.

3.2. Using `.cuda()` and `.cpu()`

PyTorch also provides .cuda() and .cpu() methods for moving tensors explicitly to GPU or CPU, respectively.

Example 3: Moving Tensors Using .cuda() and .cpu()

import torch

# Check CUDA availability
cuda_available = torch.cuda.is_available()
print("Is CUDA available?", cuda_available)

# Create a tensor on CPU
tensor_cpu = torch.randn(2, 2)
print("\nTensor on CPU:\n", tensor_cpu)

if cuda_available:
    # Move tensor to GPU
    tensor_gpu = tensor_cpu.cuda()
    print("\nTensor on GPU:\n", tensor_gpu)
    print("\nIs tensor_gpu on GPU?", tensor_gpu.is_cuda)
    
    # Move tensor back to CPU
    tensor_back = tensor_gpu.cpu()
    print("\nTensor moved back to CPU:\n", tensor_back)
    print("\nIs tensor_back on GPU?", tensor_back.is_cuda)

Expected Output (Example):

Is CUDA available? True

Tensor on CPU:
 tensor([[ 0.1234, -1.2345],
        [ 0.5678,  1.3456]])

Tensor on GPU:
 tensor([[ 0.1234, -1.2345],
        [ 0.5678,  1.3456]], device='cuda:0')

Is tensor_gpu on GPU? True

Tensor moved back to CPU:
 tensor([[ 0.1234, -1.2345],
        [ 0.5678,  1.3456]])

Is tensor_back on GPU? False

Explanation:

.cuda(): Moves the tensor to the GPU.
.cpu(): Moves the tensor back to the CPU.

Note: These methods are less flexible than .to(device) and are typically used when explicitly targeting GPUs.

4. Observing Performance Differences

One of the primary benefits of using GPUs is the acceleration of tensor operations. Let's observe the performance differences between CPU and GPU.

4.1. Benchmarking CPU vs GPU Operations

We'll perform a simple operation (matrix multiplication) on both CPU and GPU and measure the execution time.

Example 4: Benchmarking Matrix Multiplication

import torch
import time

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Create large tensors
size = 10000
tensor_a = torch.randn(size, size)
tensor_b = torch.randn(size, size)

if device.type == 'cuda':
    tensor_a = tensor_a.to(device)
    tensor_b = tensor_b.to(device)

# Warm-up (for GPU to initialize)
if device.type == 'cuda':
    torch.matmul(tensor_a, tensor_b)
    torch.cuda.synchronize()

# CPU Benchmark
if device.type != 'cuda':
    start_time = time.time()
    result_cpu = torch.matmul(tensor_a, tensor_b)
    end_time = time.time()
    print(f"\nCPU Matrix Multiplication Time: {end_time - start_time:.4f} seconds")
else:
    # Move tensors to CPU for comparison
    tensor_a_cpu = tensor_a.to('cpu')
    tensor_b_cpu = tensor_b.to('cpu')
    
    # CPU Benchmark
    start_time = time.time()
    result_cpu = torch.matmul(tensor_a_cpu, tensor_b_cpu)
    end_time = time.time()
    print(f"\nCPU Matrix Multiplication Time: {end_time - start_time:.4f} seconds")
    
    # GPU Benchmark
    start_time = time.time()
    result_gpu = torch.matmul(tensor_a, tensor_b)
    torch.cuda.synchronize()  # Wait for GPU operations to finish
    end_time = time.time()
    print(f"GPU Matrix Multiplication Time: {end_time - start_time:.4f} seconds")
    
    # Compare results (optional)
    difference = torch.abs(result_cpu - result_gpu.cpu()).max()
    print(f"Maximum difference between CPU and GPU results: {difference.item()}")

Expected Output (Example):

Using device: cuda

CPU Matrix Multiplication Time: 15.2345 seconds
GPU Matrix Multiplication Time: 0.4567 seconds
Maximum difference between CPU and GPU results: 0.0000

Explanation:

Tensor Size: Large tensors (10000x10000) are used to magnify performance differences.
Warm-up: For GPUs, a warm-up run is performed to initialize CUDA context.
Synchronization: torch.cuda.synchronize() ensures that all GPU operations are complete before measuring time.
Performance Comparison: GPU operations are significantly faster than CPU operations for large tensor computations.
Result Comparison: Ensures that GPU and CPU results are consistent.

4.2. Practical Example: Matrix Multiplication

Let's perform matrix multiplication on smaller tensors to see the speedup.

Example 5: Matrix Multiplication on Small Tensors

import torch
import time

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Create smaller tensors
size = 1000
tensor_a = torch.randn(size, size, device=device)
tensor_b = torch.randn(size, size, device=device)

# Warm-up
if device.type == 'cuda':
    torch.matmul(tensor_a, tensor_b)
    torch.cuda.synchronize()

# CPU Benchmark
if device.type != 'cuda':
    start_time = time.time()
    result_cpu = torch.matmul(tensor_a, tensor_b)
    end_time = time.time()
    print(f"\nCPU Matrix Multiplication Time: {end_time - start_time:.4f} seconds")
else:
    # Move tensors to CPU for comparison
    tensor_a_cpu = tensor_a.to('cpu')
    tensor_b_cpu = tensor_b.to('cpu')
    
    # CPU Benchmark
    start_time = time.time()
    result_cpu = torch.matmul(tensor_a_cpu, tensor_b_cpu)
    end_time = time.time()
    print(f"\nCPU Matrix Multiplication Time: {end_time - start_time:.4f} seconds")
    
    # GPU Benchmark
    start_time = time.time()
    result_gpu = torch.matmul(tensor_a, tensor_b)
    torch.cuda.synchronize()
    end_time = time.time()
    print(f"GPU Matrix Multiplication Time: {end_time - start_time:.4f} seconds")
    
    # Compare results (optional)
    difference = torch.abs(result_cpu - result_gpu.cpu()).max()
    print(f"Maximum difference between CPU and GPU results: {difference.item()}")

Expected Output (Example):

Using device: cuda

CPU Matrix Multiplication Time: 2.3456 seconds
GPU Matrix Multiplication Time: 0.1234 seconds
Maximum difference between CPU and GPU results: 0.0000

Explanation:

Smaller Tensors: Demonstrates that even with smaller tensors, GPUs can offer significant speedups.
Consistency: Ensures that GPU computations are accurate by comparing results with CPU computations.

5. Moving Models to GPU

Just like tensors, neural network models need to be moved to the GPU to leverage acceleration during training and inference.

5.1. Transferring a Simple Neural Network

Example 6: Moving a Neural Network to GPU

import torch
import torch.nn as nn

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Define a simple neural network
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(10, 50)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(50, 1)
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Instantiate the model
model = SimpleNet()
print("\nModel on CPU:", next(model.parameters()).device)

# Move the model to GPU
if device.type == 'cuda':
    model = model.to(device)
    print("Model moved to GPU:", next(model.parameters()).device)
else:
    print("Model remains on CPU.")

Expected Output (Example):

Using device: cuda

Model on CPU: cpu
Model moved to GPU: cuda:0

Explanation:

Model Definition: A simple feedforward neural network with two linear layers and a ReLU activation.
Moving to GPU: .to(device) transfers all model parameters to the specified device.
Verification: Checks the device of the model's first parameter to confirm the transfer.

5.2. Training a Model on GPU

Example 7: Training a Simple Model on GPU

import torch
import torch.nn as nn
import torch.optim as optim

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Define a simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(100, 50)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(50, 10)
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Instantiate the model and move to device
model = SimpleNet().to(device)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Dummy data
batch_size = 64
inputs = torch.randn(batch_size, 100).to(device)
labels = torch.randint(0, 10, (batch_size,)).to(device)

# Training loop
for epoch in range(5):
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch [{epoch+1}/5], Loss: {loss.item():.4f}")

Expected Output (Example):

Using device: cuda

Epoch [1/5], Loss: 2.3025
Epoch [2/5], Loss: 2.3024
Epoch [3/5], Loss: 2.3023
Epoch [4/5], Loss: 2.3022
Epoch [5/5], Loss: 2.3021

Explanation:

Model Training on GPU: Demonstrates how to perform a training loop with model and data on the GPU.
Loss Reduction: Since the data is random, the loss might not decrease, but the primary goal is to show successful training steps on the GPU.

6. Handling Multiple GPUs

For larger models and datasets, utilizing multiple GPUs can further accelerate training. PyTorch provides utilities like DataParallel and DistributedDataParallel to facilitate multi-GPU training.

6.1. Checking for Multiple GPUs

Example 8: Detecting Multiple GPUs

import torch

# Check CUDA availability
cuda_available = torch.cuda.is_available()
print("Is CUDA available?", cuda_available)

if cuda_available:
    num_gpus = torch.cuda.device_count()
    print("Number of GPUs available:", num_gpus)
    for i in range(num_gpus):
        print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
else:
    print("No CUDA-compatible GPU found.")

Expected Output (Example):

Is CUDA available? True
Number of GPUs available: 2
GPU 0: NVIDIA GeForce RTX 3080
GPU 1: NVIDIA GeForce RTX 3070

Explanation:

Listing GPUs: Iterates through available GPUs and prints their names.

6.2. Utilizing Data Parallelism

Data Parallelism allows splitting the input data across multiple GPUs, processing each subset in parallel, and then combining the results. PyTorch's nn.DataParallel facilitates this process.

Example 9: Using nn.DataParallel for Multi-GPU Training

import torch
import torch.nn as nn
import torch.optim as optim

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Define a simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(100, 50)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(50, 10)
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu(out)
        out = self.fc2(out)
        return out

# Instantiate the model
model = SimpleNet()

if torch.cuda.device_count() > 1:
    print(f"Using {torch.cuda.device_count()} GPUs!")
    # Wrap the model with DataParallel
    model = nn.DataParallel(model)

# Move the model to device
model.to(device)

# Define loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Dummy data
batch_size = 128
inputs = torch.randn(batch_size, 100).to(device)
labels = torch.randint(0, 10, (batch_size,)).to(device)

# Training loop
for epoch in range(3):
    # Forward pass
    outputs = model(inputs)
    loss = criterion(outputs, labels)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(f"Epoch [{epoch+1}/3], Loss: {loss.item():.4f}")

Expected Output (Example):

Using device: cuda
Using 2 GPUs!
Epoch [1/3], Loss: 2.3025
Epoch [2/3], Loss: 2.3024
Epoch [3/3], Loss: 2.3023

Explanation:

nn.DataParallel: Automatically splits the input across available GPUs, replicates the model on each GPU, and aggregates the results.
Model Wrapping: If multiple GPUs are detected, the model is wrapped with DataParallel.
Training: The training loop remains unchanged, but computations are distributed across GPUs.

Note: nn.DataParallel is easy to implement but might not scale efficiently for very large models or multiple nodes. For more advanced parallelism, consider DistributedDataParallel.

7. Best Practices for GPU Acceleration

To maximize the benefits of GPU acceleration, adhere to the following best practices:

7.1. Choosing Between `view` and `reshape`

Use view When:
- The tensor is contiguous.
- You require the fastest possible reshaping without data copying.
Use reshape When:
- The tensor might be non-contiguous.
- You prefer flexibility over raw speed.

Example 10: Choosing Between view and reshape

import torch

# Create a contiguous tensor
tensor_contig = torch.arange(12).reshape(3, 4)
print("\nContiguous Tensor Shape:", tensor_contig.shape)

# Use view
reshaped_view = tensor_contig.view(6, 2)
print("Reshaped with view:", reshaped_view.shape)

# Transpose to make it non-contiguous
tensor_non_contig = tensor_contig.transpose(0, 1)
print("\nNon-Contiguous Tensor Shape:", tensor_non_contig.shape)

# Attempt to use view (will fail)
try:
    reshaped_view = tensor_non_contig.view(6, 2)
except RuntimeError as e:
    print("Error with view on non-contiguous tensor:", e)

# Use reshape instead
reshaped_reshape = tensor_non_contig.reshape(6, 2)
print("Reshaped with reshape:", reshaped_reshape.shape)

Expected Output:

Contiguous Tensor Shape: torch.Size([3, 4])
Reshaped with view: torch.Size([6, 2])

Non-Contiguous Tensor Shape: torch.Size([4, 3])
Error with view on non-contiguous tensor: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Reshaped with reshape: torch.Size([6, 2])

7.2. Minimizing Data Copies

Data transfer between CPU and GPU can be a bottleneck. To minimize data copies:

Batch Data Transfers: Move entire batches at once rather than individual samples.
Avoid Unnecessary Transfers: Keep data on the GPU once it's moved there, especially during training loops.
Use .to(device) Efficiently: Chain operations when possible to reduce intermediate tensors.

Example 11: Minimizing Data Transfers

import torch

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Create a large tensor
large_tensor = torch.randn(10000, 10000)

# Inefficient: Moving individual rows
start_time = time.time()
for row in large_tensor:
    row_gpu = row.to(device)
end_time = time.time()
print(f"Inefficient Data Transfer Time: {end_time - start_time:.4f} seconds")

# Efficient: Moving the entire tensor at once
start_time = time.time()
large_tensor_gpu = large_tensor.to(device)
end_time = time.time()
print(f"Efficient Data Transfer Time: {end_time - start_time:.4f} seconds")

Expected Output (Example):

Using device: cuda

Inefficient Data Transfer Time: 12.3456 seconds
Efficient Data Transfer Time: 0.1234 seconds

Explanation:

Inefficient Method: Moves each row individually, resulting in multiple data transfers and increased latency.
Efficient Method: Moves the entire tensor in a single operation, reducing overhead.

7.3. Using In-Place Operations Wisely

In-place operations modify tensors without making copies, saving memory and potentially improving performance. However, they should be used cautiously to avoid disrupting gradient computations.

Example 12: Using In-Place Operations

import torch
import torch.nn as nn

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create a tensor with requires_grad=True
tensor = torch.randn(3, 3, requires_grad=True, device=device)
print("\nOriginal Tensor:\n", tensor)

# Apply in-place ReLU
relu = nn.ReLU()
tensor_relu = relu(tensor)
print("\nTensor after ReLU:\n", tensor_relu)

# Apply in-place operation (unsafe)
try:
    tensor_relu += 1  # This modifies the tensor in-place
    print("\nTensor after in-place addition:\n", tensor_relu)
except RuntimeError as e:
    print("Error with in-place operation:", e)

Expected Output:

Original Tensor:
 tensor([[ 0.1234, -1.2345,  0.5678],
        [ 1.2345, -0.5678,  1.3456],
        [-0.9876,  0.6789, -1.4567]], device='cuda:0', requires_grad=True)

Tensor after ReLU:
 tensor([[0.1234, 0.0000, 0.5678],
        [1.2345, 0.0000, 1.3456],
        [0.0000, 0.6789, 0.0000]], device='cuda:0', grad_fn=<ReluBackward0>)

Tensor after in-place addition:
 tensor([[1.1234, 1.0000, 1.5678],
        [2.2345, 1.0000, 2.3456],
        [1.0000, 1.6789, 1.0000]], device='cuda:0', grad_fn=<AddBackward0>)

Caution:

In-place modifications can interfere with PyTorch's autograd mechanism, leading to unexpected behaviors during backpropagation.
Use in-place operations only when you are certain they won't affect gradient computations.

8. Common Pitfalls and How to Avoid Them

While working with CUDA and GPUs in PyTorch, certain common pitfalls can hinder performance or cause errors. Being aware of these can save time and frustration.

8.1. Mismatched Devices

Attempting to perform operations on tensors located on different devices (CPU vs. GPU) will result in errors.

Example 13: Mismatched Devices

import torch

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Create tensors on different devices
tensor_cpu = torch.randn(2, 2)
tensor_gpu = torch.randn(2, 2).to(device)

# Attempt to add tensors
try:
    result = tensor_cpu + tensor_gpu
except RuntimeError as e:
    print("\nError with mismatched devices:", e)

Expected Output:

Error with mismatched devices: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Solution:

Ensure all tensors involved in operations are on the same device.
Use .to(device) to move tensors as needed.

Corrected Example:

# Move tensor_cpu to GPU
if device.type == 'cuda':
    tensor_cpu = tensor_cpu.to(device)

# Now add tensors
result = tensor_cpu + tensor_gpu
print("\nAddition Result:\n", result)

8.2. Forgetting to Move Models to GPU

Forgetting to transfer your model to the GPU will result in tensors being on different devices, causing errors during training.

Example 14: Forgetting to Move Model to GPU

import torch
import torch.nn as nn

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define a simple model
model = nn.Linear(10, 2)

# Create input tensor on GPU
input_tensor = torch.randn(5, 10).to(device)

# Attempt to pass input through model (model is on CPU)
try:
    output = model(input_tensor)
except RuntimeError as e:
    print("\nError due to model on CPU:", e)

Expected Output:

Error due to model on CPU: Input type (torch.cuda.FloatTensor) and weight type (torch.FloatTensor) should be the same

Solution:

Move the model to the desired device before training or inference.

Corrected Example:

# Move model to device
model.to(device)

# Now pass input through model
output = model(input_tensor)
print("\nModel Output:\n", output)

8.3. Excessive Data Transfers

Frequent data transfers between CPU and GPU can create significant overhead, negating the performance benefits of GPU acceleration.

Example 15: Excessive Data Transfers

import torch
import time

# Define device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Create a large tensor
large_tensor = torch.randn(1000, 1000)

# Start time
start_time = time.time()

# Move tensor to GPU and back to CPU repeatedly
for _ in range(100):
    tensor_gpu = large_tensor.to(device)
    tensor_cpu = tensor_gpu.to('cpu')

# End time
end_time = time.time()
print(f"\nTime taken for excessive data transfers: {end_time - start_time:.4f} seconds")

Expected Output (Example):

Using device: cuda

Time taken for excessive data transfers: 2.3456 seconds

Solution:

Move data to the GPU once and perform all necessary operations there.
Avoid moving data back to the CPU unless absolutely necessary.

Optimized Example:

# Move tensor to GPU once
if device.type == 'cuda':
    large_tensor = large_tensor.to(device)

# Start time
start_time = time.time()

# Perform operations on GPU without moving back and forth
for _ in range(100):
    result = large_tensor * 2  # Example operation

# End time
end_time = time.time()
print(f"\nTime taken for operations on GPU: {end_time - start_time:.4f} seconds")

9. Exercises for Practice

Engaging with hands-on exercises will reinforce your understanding and ensure you can apply GPU acceleration techniques effectively.

9.1. Exercise 1: Tensor Operations on GPU

Task:

Check if a GPU is available.
Create a large tensor (e.g., 10000x10000) on the CPU.
Move the tensor to the GPU.
Perform a matrix multiplication operation on both CPU and GPU.
Compare the execution times.

Solution:

import torch
import time

# Step 1: Check GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Step 2: Create a large tensor on CPU
size = 5000
tensor_a = torch.randn(size, size)
tensor_b = torch.randn(size, size)

# Step 3: Move tensors to GPU
if device.type == 'cuda':
    tensor_a_gpu = tensor_a.to(device)
    tensor_b_gpu = tensor_b.to(device)

# Step 4: Perform matrix multiplication on CPU
start_time = time.time()
result_cpu = torch.matmul(tensor_a, tensor_b)
end_time = time.time()
cpu_time = end_time - start_time
print(f"\nCPU Matrix Multiplication Time: {cpu_time:.4f} seconds")

if device.type == 'cuda':
    # Warm-up GPU
    torch.matmul(tensor_a_gpu, tensor_b_gpu)
    torch.cuda.synchronize()
    
    # Perform matrix multiplication on GPU
    start_time = time.time()
    result_gpu = torch.matmul(tensor_a_gpu, tensor_b_gpu)
    torch.cuda.synchronize()
    end_time = time.time()
    gpu_time = end_time - start_time
    print(f"GPU Matrix Multiplication Time: {gpu_time:.4f} seconds")
    
    # Compare results
    difference = torch.abs(result_cpu - result_gpu.cpu()).max()
    print(f"Maximum difference between CPU and GPU results: {difference.item()}")

Expected Output (Example):

Using device: cuda

CPU Matrix Multiplication Time: 12.3456 seconds
GPU Matrix Multiplication Time: 0.4567 seconds
Maximum difference between CPU and GPU results: 0.0000

9.2. Exercise 2: Moving a Model to GPU

Task:

Define a simple neural network model.
Check for GPU availability.
Move the model to GPU if available.
Create dummy input data and move it to the same device.
Perform a forward pass and print the output device.

Solution:

import torch
import torch.nn as nn

# Step 1: Define a simple model
class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc = nn.Linear(100, 10)
    
    def forward(self, x):
        return self.fc(x)

# Instantiate the model
model = SimpleNet()

# Step 2: Check for GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using device:", device)

# Step 3: Move the model to GPU
model.to(device)
print("Model is on device:", next(model.parameters()).device)

# Step 4: Create dummy input and move to device
input_tensor = torch.randn(5, 100).to(device)
print("Input tensor is on device:", input_tensor.device)

# Step 5: Perform forward pass
output = model(input_tensor)
print("Output tensor is on device:", output.device)

Expected Output (Example):

Using device: cuda
Model is on device: cuda:0
Input tensor is on device: cuda:0
Output tensor is on device: cuda:0

9.3. Exercise 3: Utilizing Multiple GPUs

Task:

Check if multiple GPUs are available.
Define a neural network model.
Wrap the model with nn.DataParallel if multiple GPUs are available.
Move the model to GPU.
Create dummy input data and perform a forward pass.

Solution:

import torch
import torch.nn as nn

# Step 1: Check for multiple GPUs
cuda_available = torch.cuda.is_available()
num_gpus = torch.cuda.device_count()
print("CUDA Available:", cuda_available)
print("Number of GPUs:", num_gpus)

# Step 2: Define a model
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.relu = nn.ReLU()
        self.pool = nn.MaxPool2d(2)
        self.fc = nn.Linear(16*16*16, 10)  # Assuming input images are 32x32
    
    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        return x

# Instantiate the model
model = SimpleCNN()

# Step 3: Wrap with DataParallel if multiple GPUs
if cuda_available and num_gpus > 1:
    model = nn.DataParallel(model)
    print("Model wrapped with DataParallel")

# Step 4: Move the model to GPU
device = torch.device("cuda" if cuda_available else "cpu")
model.to(device)
print("Model is on device:", next(model.parameters()).device)

# Step 5: Create dummy input data
batch_size = 32
dummy_input = torch.randn(batch_size, 3, 32, 32).to(device)
print("Dummy input is on device:", dummy_input.device)

# Perform forward pass
output = model(dummy_input)
print("Output shape:", output.shape)
print("Output is on device:", output.device)

Expected Output (Example):

CUDA Available: True
Number of GPUs: 2
Model wrapped with DataParallel
Model is on device: cuda:0
Dummy input is on device: cuda:0
Output shape: torch.Size([32, 10])
Output is on device: cuda:0

Explanation:

DataParallel: Distributes the input batch across multiple GPUs, performs parallel computations, and gathers the results.
Model on cuda:0: The model is primarily on the first GPU, with DataParallel handling the distribution.

10. Summary

CUDA and GPU Acceleration:
- CUDA enables parallel computing on NVIDIA GPUs.
- GPUs offer significant speedups for tensor operations and deep learning tasks.
Checking GPU Availability:
- Use torch.cuda.is_available() to verify CUDA support.
- Use torch.cuda.device_count() and torch.cuda.get_device_name() to identify available GPUs.
Moving Tensors and Models:
- Use .to(device), .cuda(), and .cpu() to transfer tensors and models between CPU and GPU.
- Ensure all components (data and model) are on the same device to avoid errors.
Performance Benefits:
- GPUs accelerate compute-intensive operations like matrix multiplications and convolutions.
- Benchmarking demonstrates significant speedups on GPUs.
Handling Multiple GPUs:
- Utilize nn.DataParallel for simple multi-GPU training.
- For advanced use-cases, consider DistributedDataParallel.
Best Practices:
- Minimize data transfers between CPU and GPU.
- Choose between view and reshape based on tensor contiguity.
- Use in-place operations judiciously to avoid disrupting gradients.
Common Pitfalls:
- Mismatched device locations between tensors and models.
- Excessive data transfers leading to performance bottlenecks.
- Ignoring tensor contiguity requirements for certain operations.

By mastering these GPU acceleration techniques, you'll be well-equipped to optimize your deep learning workflows, achieve faster training times, and handle larger models and datasets efficiently.

11. Additional Resources

To further enhance your understanding of GPU acceleration and CUDA in PyTorch, explore the following resources:

Official PyTorch Documentation:
PyTorch Tutorials:
Books and Guides:
- Deep Learning with PyTorch by Eli Stevens, Luca Antiga, and Thomas Viehmann.
- Programming PyTorch for Deep Learning by Ian Pointer.
Community Forums and Support:
Online Courses and Tutorials:
- Coursera: Introduction to Deep Learning with PyTorch
- Udacity: Intro to Machine Learning with PyTorch

Tips for Learning:

Hands-On Practice: Implement the provided code examples and experiment with different tensor sizes and models.
Engage with the Community: Participate in forums, ask questions, and seek feedback on your implementations.
Build Projects: Apply GPU acceleration techniques in real-world projects to understand their practical applications and benefits.
Stay Updated: Follow PyTorch's official channels for the latest updates, best practices, and new features.

By leveraging these resources and actively practicing, you'll develop a robust understanding of GPU acceleration in PyTorch, enabling you to build and train efficient deep learning models.

Happy Coding!

Chapter 01: Introduction to PyTorch

Chapter 02: Building Blocks of Neural Networks

Day 06: GPU Acceleration with CUDA

Table of Contents

1. Introduction to CUDA and GPU Acceleration

2. Checking for GPU Availability

3. Moving Tensors Between CPU and GPU

3.1. Using `.to(device)`

3.2. Using `.cuda()` and `.cpu()`

4. Observing Performance Differences

4.1. Benchmarking CPU vs GPU Operations

4.2. Practical Example: Matrix Multiplication

5. Moving Models to GPU

5.1. Transferring a Simple Neural Network

5.2. Training a Model on GPU

6. Handling Multiple GPUs

6.1. Checking for Multiple GPUs

6.2. Utilizing Data Parallelism

7. Best Practices for GPU Acceleration

7.1. Choosing Between `view` and `reshape`

7.2. Minimizing Data Copies

7.3. Using In-Place Operations Wisely

8. Common Pitfalls and How to Avoid Them

8.1. Mismatched Devices

8.2. Forgetting to Move Models to GPU

8.3. Excessive Data Transfers

9. Exercises for Practice

9.1. Exercise 1: Tensor Operations on GPU

9.2. Exercise 2: Moving a Model to GPU

9.3. Exercise 3: Utilizing Multiple GPUs

10. Summary

11. Additional Resources

On this page

Chapter 01: Introduction to PyTorch

Chapter 02: Building Blocks of Neural Networks