Day 08: Introduction to Neural Networks

Table of Contents

  1. Introduction
  2. Basics of Neural Networks
    • 2.1. What is a Neural Network?
    • 2.2. Components of a Neural Network
      • 2.2.1. Neurons
      • 2.2.2. Layers
      • 2.2.3. Activation Functions
  3. Understanding Layers and Activation Functions
    • 3.1. Types of Layers
      • 3.1.1. Input Layer
      • 3.1.2. Hidden Layers
      • 3.1.3. Output Layer
    • 3.2. Common Activation Functions
      • 3.2.1. Sigmoid
      • 3.2.2. Tanh
      • 3.2.3. ReLU
  4. Implementing a Basic Neural Network from Scratch Using PyTorch Tensors
    • 4.1. Network Architecture
    • 4.2. Forward Pass
    • 4.3. Loss Calculation
    • 4.4. Backward Pass (Gradient Calculation)
    • 4.5. Parameter Update
    • 4.6. Training Loop
  5. Practical Exercises
    • 5.1. Exercise 1: Building the Neural Network
    • 5.2. Exercise 2: Training the Neural Network
  6. Solutions and Explanations
    • 6.1. Solutions to Practice Exercises
      • 6.1.1. Exercise 1: Building the Neural Network
      • 6.1.2. Exercise 2: Training the Neural Network
  7. Summary
  8. Additional Resources

1. Introduction

Neural networks are the cornerstone of modern deep learning, enabling machines to learn complex patterns and make intelligent decisions. Whether it's recognizing images, understanding speech, or translating languages, neural networks power a myriad of applications that shape our daily lives. This day focuses on understanding the fundamental structure of neural networks and implementing a basic one from scratch using PyTorch tensors, laying the groundwork for more advanced topics in the future.


2. Basics of Neural Networks

2.1. What is a Neural Network?

A Neural Network is a computational model inspired by the human brain's interconnected network of neurons. It consists of layers of nodes (neurons) that process data, learn from it, and make predictions or decisions. Neural networks are particularly adept at capturing and modeling non-linear relationships in data, making them powerful tools for a wide range of tasks.

Key Characteristics:

  • Layered Structure: Composed of an input layer, one or more hidden layers, and an output layer.
  • Learning Capability: Adjusts its parameters (weights and biases) based on the data to minimize prediction errors.
  • Non-Linearity: Uses activation functions to introduce non-linear transformations, enabling the modeling of complex patterns.

2.2. Components of a Neural Network

Understanding the core components of a neural network is essential for building and designing effective models.

2.2.1. Neurons

  • Definition: The basic units of a neural network that perform computations.
  • Function: Each neuron receives input data, processes it using its weights and biases, applies an activation function, and passes the output to the next layer.

Visualization:

Input Data -> Neuron -> Activation Function -> Output

2.2.2. Layers

  • Input Layer: The first layer that receives raw data.
  • Hidden Layers: Intermediate layers that process inputs received from the previous layers.
  • Output Layer: The final layer that produces the network's predictions or classifications.

2.2.3. Activation Functions

  • Definition: Mathematical functions applied to the output of each neuron to introduce non-linearity.
  • Purpose: Enable the network to model complex, non-linear relationships in data.
  • Common Activation Functions: Sigmoid, Tanh, ReLU (Rectified Linear Unit).

3. Understanding Layers and Activation Functions

3.1. Types of Layers

3.1.1. Input Layer

  • Role: Receives the raw input data.
  • Characteristics: Does not perform computations but passes data to the first hidden layer.

Example:

For an image with 28x28 pixels, the input layer might have 784 neurons (28 * 28).

3.1.2. Hidden Layers

  • Role: Perform computations and extract features from the input data.
  • Characteristics: Can have varying numbers of neurons; multiple hidden layers can capture more complex patterns.

Example:

A hidden layer with 128 neurons processes the features from the input layer.

3.1.3. Output Layer

  • Role: Produces the final prediction or classification.
  • Characteristics: The number of neurons corresponds to the number of target classes or output dimensions.

Example:

For binary classification, the output layer might have 1 neuron with a Sigmoid activation function.

3.2. Common Activation Functions

3.2.1. Sigmoid

  • Formula: $$ \sigma(x) = 11 + e^{-x}, \quad \sigma(x) = \frac{1}{1 + e^{-x}} $$
  • Range: (0, 1)
  • Use Case: Commonly used in binary classification problems.
  • Pros: Smooth gradient, output values bound between 0 and 1.
  • Cons: Prone to vanishing gradients, not zero-centered.

Example:

import torch
import torch.nn.functional as F

x = torch.tensor([0.0, 2.0, -2.0])
sigmoid_output = F.sigmoid(x)
print("Sigmoid Output:", sigmoid_output)
# Output: tensor([0.5000, 0.8808, 0.1192])

3.2.2. Tanh

Formula:

$$
\tanh(x) = \frac{e^{x} - e{-x}}{e{x} + e^{-x}}
$$

  • Range: (-1, 1)
  • Use Case: Often used in hidden layers.
  • Pros: Zero-centered, smoother gradients.
  • Cons: Similar vanishing gradient problem as Sigmoid.

Example:

tanh_output = F.tanh(x)
print("Tanh Output:", tanh_output)
# Output: tensor([ 0.0000, 0.9640, -0.9640])

3.2.3. ReLU (Rectified Linear Unit)

  • Formula: ReLU(x)=max⁡(0,x)\text{ReLU}(x) = \max(0, x)
  • Range: [0, ∞)
  • Use Case: Widely used in hidden layers of modern neural networks.
  • Pros: Addresses vanishing gradient problem, computationally efficient.
  • Cons: Can lead to "dying ReLU" where neurons stop activating.

Example:

relu_output = F.relu(x)
print("ReLU Output:", relu_output)
# Output: tensor([0.0000, 2.0000, 0.0000])

4. Implementing a Basic Neural Network from Scratch Using PyTorch Tensors

In this section, we'll build a simple neural network from scratch without using PyTorch's high-level APIs like torch.nn or torch.optim. This exercise will deepen your understanding of the underlying mechanics of neural networks.

4.1. Network Architecture

We'll construct a neural network with the following architecture:

  • Input Layer: 2 neurons
  • Hidden Layer: 2 neurons
  • Output Layer: 1 neuron

This architecture is sufficient for a simple binary classification task, such as the logical XOR problem.

Visualization:

Input Layer (2 neurons)
        |
        v
Hidden Layer (2 neurons) -- Activation Function (ReLU)
        |
        v
Output Layer (1 neuron) -- Activation Function (Sigmoid)

4.2. Forward Pass

The forward pass involves passing input data through the network to obtain predictions.

4.3. Loss Calculation

We'll use the Binary Cross-Entropy (BCE) loss function to measure the discrepancy between the predicted outputs and actual targets.

4.4. Backward Pass (Gradient Calculation)

The backward pass computes the gradients of the loss with respect to each parameter (weights and biases) using the chain rule.

4.5. Parameter Update

We'll update the network's parameters using Gradient Descent to minimize the loss.

4.6. Training Loop

We'll iterate over multiple epochs, performing forward and backward passes and updating parameters in each epoch.


Complete Code Example with Line-by-Line Explanations

import torch

# Set seed for reproducibility
torch.manual_seed(42)

# Define the Sigmoid activation function
def sigmoid(x):
    return 1 / (1 + torch.exp(-x))

# Define the derivative of the Sigmoid function
def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))

# Define the ReLU activation function
def relu(x):
    return torch.maximum(torch.zeros_like(x), x)

# Define the derivative of the ReLU function
def relu_derivative(x):
    return (x > 0).float()

# Define the binary cross-entropy loss function
def binary_cross_entropy(y_pred, y_true):
    # Adding a small value to prevent log(0)
    epsilon = 1e-15
    y_pred = torch.clamp(y_pred, epsilon, 1 - epsilon)
    loss = - (y_true * torch.log(y_pred) + (1 - y_true) * torch.log(1 - y_pred))
    return torch.mean(loss)

# Define the derivative of the binary cross-entropy loss with sigmoid activation
def binary_cross_entropy_derivative(y_pred, y_true):
    # Adding a small value to prevent division by zero
    epsilon = 1e-15
    y_pred = torch.clamp(y_pred, epsilon, 1 - epsilon)
    return (y_pred - y_true) / (y_pred * (1 - y_pred))

# Initialize input data (X) and target labels (y)
# Example: Logical XOR problem
X = torch.tensor([[0.0, 0.0],
                  [0.0, 1.0],
                  [1.0, 0.0],
                  [1.0, 1.0]])

y = torch.tensor([[0.0],
                  [1.0],
                  [1.0],
                  [0.0]])

# Initialize weights and biases for the hidden layer
# Hidden layer has 2 neurons and takes 2 inputs
W1 = torch.randn(2, 2, requires_grad=False) * 0.1
b1 = torch.zeros(2, 1, requires_grad=False)

# Initialize weights and biases for the output layer
# Output layer has 1 neuron and takes 2 inputs from the hidden layer
W2 = torch.randn(1, 2, requires_grad=False) * 0.1
b2 = torch.zeros(1, 1, requires_grad=False)

# Define the learning rate
learning_rate = 0.1

# Number of epochs for training
epochs = 10000

# Training loop
for epoch in range(epochs):
    # =====================
    # Forward Pass
    # =====================
    
    # Calculate hidden layer input: W1 * X + b1
    # X has shape (4, 2), W1 has shape (2, 2), so we need to transpose X to (2, 4) for matrix multiplication
    # Then transpose the result back to (4, 2)
    hidden_input = torch.matmul(X, W1.T) + b1.T  # Shape: (4, 2)
    
    # Apply ReLU activation
    hidden_output = relu(hidden_input)  # Shape: (4, 2)
    
    # Calculate output layer input: W2 * hidden_output + b2
    # hidden_output has shape (4, 2), W2 has shape (1, 2), so we need to transpose hidden_output to (2, 4)
    # Then transpose the result back to (4, 1)
    output_input = torch.matmul(hidden_output, W2.T) + b2.T  # Shape: (4, 1)
    
    # Apply Sigmoid activation to get the final output
    y_pred = sigmoid(output_input)  # Shape: (4, 1)
    
    # =====================
    # Loss Calculation
    # =====================
    
    loss = binary_cross_entropy(y_pred, y)  # Scalar
    
    # =====================
    # Backward Pass
    # =====================
    
    # Compute derivative of loss w.r.t y_pred
    dL_dy_pred = binary_cross_entropy_derivative(y_pred, y)  # Shape: (4, 1)
    
    # Compute derivative of Sigmoid activation
    dy_pred_doutput_input = sigmoid_derivative(output_input)  # Shape: (4, 1)
    
    # Chain rule to get derivative of loss w.r.t output_input
    dL_doutput_input = dL_dy_pred * dy_pred_doutput_input  # Shape: (4, 1)
    
    # Compute derivative of loss w.r.t W2 and b2
    # W2 has shape (1, 2), hidden_output has shape (4, 2)
    dL_dW2 = torch.matmul(dL_doutput_input.T, hidden_output) / X.shape[0]  # Shape: (1, 2)
    dL_db2 = torch.mean(dL_doutput_input, dim=0, keepdim=True)  # Shape: (1, 1)
    
    # Compute derivative of loss w.r.t hidden_output
    dL_dhidden_output = torch.matmul(dL_doutput_input, W2)  # Shape: (4, 2)
    
    # Compute derivative of ReLU activation
    dhidden_output_dhidden_input = relu_derivative(hidden_input)  # Shape: (4, 2)
    
    # Chain rule to get derivative of loss w.r.t hidden_input
    dL_dhidden_input = dL_dhidden_output * dhidden_output_dhidden_input  # Shape: (4, 2)
    
    # Compute derivative of loss w.r.t W1 and b1
    # W1 has shape (2, 2), X has shape (4, 2)
    dL_dW1 = torch.matmul(dL_dhidden_input.T, X) / X.shape[0]  # Shape: (2, 2)
    dL_db1 = torch.mean(dL_dhidden_input, dim=0, keepdim=True)  # Shape: (1, 2)
    
    # =====================
    # Parameter Update
    # =====================
    
    W2 -= learning_rate * dL_dW2  # Shape: (1, 2)
    b2 -= learning_rate * dL_db2  # Shape: (1, 1)
    W1 -= learning_rate * dL_dW1  # Shape: (2, 2)
    b1 -= learning_rate * dL_db1  # Shape: (1, 2)
    
    # =====================
    # Logging
    # =====================
    
    if (epoch + 1) % 1000 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

# =====================
# Evaluation After Training
# =====================

# Forward pass to get final predictions
hidden_input = torch.matmul(X, W1.T) + b1.T
hidden_output = relu(hidden_input)
output_input = torch.matmul(hidden_output, W2.T) + b2.T
y_pred = sigmoid(output_input)

# Binarize predictions
y_pred_binary = (y_pred > 0.5).float()

print("\nFinal Predictions:\n", y_pred_binary)
print("Actual Labels:\n", y)

Line-by-Line Explanation

Let's break down the above code to understand each component and operation.

import torch
  • Import PyTorch: This line imports the PyTorch library, which provides the necessary functions and classes for tensor operations and neural network implementations.
torch.manual_seed(42)
  • Set Seed for Reproducibility: Setting a manual seed ensures that the randomly initialized weights are the same every time the script is run, facilitating consistent results during experimentation.
def sigmoid(x):
    return 1 / (1 + torch.exp(-x))
  • Define Sigmoid Function: This function applies the Sigmoid activation to input x, squashing the output between 0 and 1, which is ideal for binary classification.
def sigmoid_derivative(x):
    return sigmoid(x) * (1 - sigmoid(x))
  • Define Derivative of Sigmoid: This function computes the derivative of the Sigmoid function, essential for backpropagation during gradient calculations.
def relu(x):
    return torch.maximum(torch.zeros_like(x), x)
  • Define ReLU Function: Applies the Rectified Linear Unit activation, setting all negative values in x to zero and keeping positive values unchanged.
def relu_derivative(x):
    return (x > 0).float()
  • Define Derivative of ReLU: Computes the derivative of ReLU, where the gradient is 1 for positive x and 0 otherwise.
def binary_cross_entropy(y_pred, y_true):
    epsilon = 1e-15
    y_pred = torch.clamp(y_pred, epsilon, 1 - epsilon)
    loss = - (y_true * torch.log(y_pred) + (1 - y_true) * torch.log(1 - y_pred))
    return torch.mean(loss)
  • Define Binary Cross-Entropy Loss:
    • Clamping: Prevents taking the logarithm of 0 by ensuring y_pred is within (epsilon, 1 - epsilon).
    • Loss Calculation: Computes the BCE loss for each sample and returns the mean loss across all samples.
def binary_cross_entropy_derivative(y_pred, y_true):
    epsilon = 1e-15
    y_pred = torch.clamp(y_pred, epsilon, 1 - epsilon)
    return (y_pred - y_true) / (y_pred * (1 - y_pred))
  • Define Derivative of Binary Cross-Entropy Loss with Sigmoid:
    • Clamping: Ensures numerical stability.
    • Derivative Calculation: Computes the gradient of the loss with respect to the predictions.
# Initialize input data (X) and target labels (y)
# Example: Logical XOR problem
X = torch.tensor([[0.0, 0.0],
                  [0.0, 1.0],
                  [1.0, 0.0],
                  [1.0, 1.0]])

y = torch.tensor([[0.0],
                  [1.0],
                  [1.0],
                  [0.0]])
  • Define Dataset:
    • Input Data (X): Represents the four possible inputs for the XOR problem.
    • Target Labels (y): Represents the expected outputs for each input in X.
# Initialize weights and biases for the hidden layer
# Hidden layer has 2 neurons and takes 2 inputs
W1 = torch.randn(2, 2, requires_grad=False) * 0.1
b1 = torch.zeros(2, 1, requires_grad=False)
  • Initialize Weights and Biases for Hidden Layer:
    • Weights (W1): Randomly initialized with small values (scaled by 0.1) for the two hidden neurons.
    • Biases (b1): Initialized to zeros for the hidden layer neurons.
    • Note: requires_grad=False since we're manually handling gradient computations.
# Initialize weights and biases for the output layer
# Output layer has 1 neuron and takes 2 inputs from the hidden layer
W2 = torch.randn(1, 2, requires_grad=False) * 0.1
b2 = torch.zeros(1, 1, requires_grad=False)
  • Initialize Weights and Biases for Output Layer:
    • Weights (W2): Randomly initialized with small values for the single output neuron.
    • Biases (b2): Initialized to zeros for the output neuron.
# Define the learning rate
learning_rate = 0.1
  • Set Learning Rate: Determines the step size during parameter updates.
# Number of epochs for training
epochs = 10000
  • Set Number of Training Iterations: The model will undergo 10,000 training iterations to learn the XOR problem.
# Training loop
for epoch in range(epochs):
  • Start Training Loop: Iterates over the number of epochs to train the network.
    # =====================
    # Forward Pass
    # =====================
  • Forward Pass Section: Begins the process of computing the network's predictions.
    # Calculate hidden layer input: W1 * X + b1
    # X has shape (4, 2), W1 has shape (2, 2), so we need to transpose X to (2, 4) for matrix multiplication
    # Then transpose the result back to (4, 2)
    hidden_input = torch.matmul(X, W1.T) + b1.T  # Shape: (4, 2)
  • Compute Hidden Layer Inputs:
    • Matrix Multiplication (torch.matmul): Multiplies input X with the transpose of weights W1 to align dimensions.
    • Adding Biases (b1.T): Transposes biases to match the shape and adds them to the result.
    • Result (hidden_input): The raw input to the hidden layer neurons.
    # Apply ReLU activation
    hidden_output = relu(hidden_input)  # Shape: (4, 2)
  • Apply Activation Function:
    • ReLU (relu): Introduces non-linearity by setting negative values to zero.
    • Result (hidden_output): The activated output from the hidden layer.
    # Calculate output layer input: W2 * hidden_output + b2
    # hidden_output has shape (4, 2), W2 has shape (1, 2), so we need to transpose hidden_output to (2, 4)
    # Then transpose the result back to (4, 1)
    output_input = torch.matmul(hidden_output, W2.T) + b2.T  # Shape: (4, 1)
  • Compute Output Layer Inputs:
    • Matrix Multiplication (torch.matmul): Multiplies hidden layer outputs with the transpose of weights W2.
    • Adding Biases (b2.T): Adds the output layer biases.
    • Result (output_input): The raw input to the output neuron.
    # Apply Sigmoid activation to get the final output
    y_pred = sigmoid(output_input)  # Shape: (4, 1)
  • Apply Activation Function:
    • Sigmoid (sigmoid): Squashes the output between 0 and 1, suitable for binary classification.
    • Result (y_pred): The final predictions from the network.
    # =====================
    # Loss Calculation
    # =====================
    
    loss = binary_cross_entropy(y_pred, y)  # Scalar
  • Calculate Loss:
    • Binary Cross-Entropy (binary_cross_entropy): Measures the difference between the predicted outputs and actual labels.
    • Result (loss): A scalar value representing the mean loss over all samples.
    # =====================
    # Backward Pass
    # =====================
    
    # Compute derivative of loss w.r.t y_pred
    dL_dy_pred = binary_cross_entropy_derivative(y_pred, y)  # Shape: (4, 1)
  • Compute Gradient of Loss w.r.t Predictions:
    • Derivative Calculation: Determines how the loss changes with respect to the predictions.
    # Compute derivative of Sigmoid activation
    dy_pred_doutput_input = sigmoid_derivative(output_input)  # Shape: (4, 1)
  • Compute Gradient of Sigmoid Activation:
    • Derivative of Sigmoid: Determines how the output of the Sigmoid function changes with respect to its input.
    # Chain rule to get derivative of loss w.r.t output_input
    dL_doutput_input = dL_dy_pred * dy_pred_doutput_input  # Shape: (4, 1)
  • Apply Chain Rule:
    • Chain Rule: Combines the gradients of the loss with respect to the output and the activation function to obtain the gradient with respect to the output layer input.
    # Compute derivative of loss w.r.t W2 and b2
    # W2 has shape (1, 2), hidden_output has shape (4, 2)
    dL_dW2 = torch.matmul(dL_doutput_input.T, hidden_output) / X.shape[0]  # Shape: (1, 2)
    dL_db2 = torch.mean(dL_doutput_input, dim=0, keepdim=True)  # Shape: (1, 1)
  • Compute Gradients for Output Layer Parameters:
    • Gradient w.r.t Weights (dL_dW2): Calculated by multiplying the transpose of dL_doutput_input with hidden_output and averaging over the batch size.
    • Gradient w.r.t Biases (dL_db2): Calculated by taking the mean of dL_doutput_input across all samples.
    # Compute derivative of loss w.r.t hidden_output
    dL_dhidden_output = torch.matmul(dL_doutput_input, W2)  # Shape: (4, 2)
  • Compute Gradient of Loss w.r.t Hidden Layer Output:
    • Matrix Multiplication: Propagates the gradient back from the output layer to the hidden layer by multiplying dL_doutput_input with weights W2.
    # Compute derivative of ReLU activation
    dhidden_output_dhidden_input = relu_derivative(hidden_input)  # Shape: (4, 2)
  • Compute Gradient of ReLU Activation:
    • Derivative of ReLU: Determines where the gradient should flow based on whether the input was positive or not.
    # Chain rule to get derivative of loss w.r.t hidden_input
    dL_dhidden_input = dL_dhidden_output * dhidden_output_dhidden_input  # Shape: (4, 2)
  • Apply Chain Rule:
    • Element-wise Multiplication: Combines the gradients to obtain the gradient of the loss with respect to the hidden layer inputs.
    # Compute derivative of loss w.r.t W1 and b1
    # W1 has shape (2, 2), X has shape (4, 2)
    dL_dW1 = torch.matmul(dL_dhidden_input.T, X) / X.shape[0]  # Shape: (2, 2)
    dL_db1 = torch.mean(dL_dhidden_input, dim=0, keepdim=True)  # Shape: (1, 2)
  • Compute Gradients for Hidden Layer Parameters:
    • Gradient w.r.t Weights (dL_dW1): Calculated by multiplying the transpose of dL_dhidden_input with X and averaging over the batch size.
    • Gradient w.r.t Biases (dL_db1): Calculated by taking the mean of dL_dhidden_input across all samples.
    # =====================
    # Parameter Update
    # =====================
    
    W2 -= learning_rate * dL_dW2  # Shape: (1, 2)
    b2 -= learning_rate * dL_db2  # Shape: (1, 1)
    W1 -= learning_rate * dL_dW1  # Shape: (2, 2)
    b1 -= learning_rate * dL_db1  # Shape: (1, 2)
  • Update Parameters Using Gradient Descent:
    • Weights and Biases Update: Subtract the product of the learning rate and the respective gradients from the current parameters to minimize the loss.
    # =====================
    # Logging
    # =====================
    
    if (epoch + 1) % 1000 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")
  • Print Progress:
    • Logging: Every 1000 epochs, print the current epoch number and the loss to monitor training progress.
# =====================
# Evaluation After Training
# =====================

# Forward pass to get final predictions
hidden_input = torch.matmul(X, W1.T) + b1.T
hidden_output = relu(hidden_input)
output_input = torch.matmul(hidden_output, W2.T) + b2.T
y_pred = sigmoid(output_input)

# Binarize predictions
y_pred_binary = (y_pred > 0.5).float()

print("\nFinal Predictions:\n", y_pred_binary)
print("Actual Labels:\n", y)
  • Evaluate the Trained Model:
    • Forward Pass: Computes the final predictions using the trained weights and biases.
    • Binarization: Converts continuous predictions to binary outputs (0 or 1) based on a threshold of 0.5.
    • Print Predictions vs. Actual Labels: Displays the model's predictions alongside the true labels to assess performance.

Detailed Explanation of the Neural Network Implementation

Let's dissect the neural network implementation to understand how each part contributes to the learning process.

Initialization

# Initialize input data (X) and target labels (y)
# Example: Logical XOR problem
X = torch.tensor([[0.0, 0.0],
                  [0.0, 1.0],
                  [1.0, 0.0],
                  [1.0, 1.0]])

y = torch.tensor([[0.0],
                  [1.0],
                  [1.0],
                  [0.0]])
  • Dataset: We use the XOR problem, a classic example where the relationship between inputs and outputs is non-linear and cannot be solved with a simple linear model.
# Initialize weights and biases for the hidden layer
# Hidden layer has 2 neurons and takes 2 inputs
W1 = torch.randn(2, 2, requires_grad=False) * 0.1
b1 = torch.zeros(2, 1, requires_grad=False)

# Initialize weights and biases for the output layer
# Output layer has 1 neuron and takes 2 inputs from the hidden layer
W2 = torch.randn(1, 2, requires_grad=False) * 0.1
b2 = torch.zeros(1, 1, requires_grad=False)
  • Weights and Biases:
    • Hidden Layer (W1, b1):
      • W1: Weight matrix connecting input neurons to hidden neurons.
      • b1: Bias vector for hidden neurons.
    • Output Layer (W2, b2):
      • W2: Weight matrix connecting hidden neurons to the output neuron.
      • b2: Bias vector for the output neuron.
    • Initialization: Small random values for weights to break symmetry and zero biases.
# Define the learning rate
learning_rate = 0.1

# Number of epochs for training
epochs = 10000
  • Hyperparameters:
    • Learning Rate (learning_rate): Determines the size of the steps taken during optimization.
    • Epochs (epochs): Number of times the entire dataset is passed through the network during training.

Training Loop

The training loop iteratively updates the network's parameters to minimize the loss function.

for epoch in range(epochs):
  • Loop Over Epochs: Repeats the training process for the specified number of epochs.
Forward Pass
    # Calculate hidden layer input: W1 * X + b1
    hidden_input = torch.matmul(X, W1.T) + b1.T  # Shape: (4, 2)
  • Compute Hidden Layer Inputs:
    • Matrix Multiplication (torch.matmul): Multiplies input X with the transpose of weights W1 to align dimensions for multiplication.
    • Adding Biases (b1.T): Adds the biases to each hidden neuron for each sample.
    • Result (hidden_input): The raw input to each hidden neuron before activation.
    # Apply ReLU activation
    hidden_output = relu(hidden_input)  # Shape: (4, 2)
  • Apply Activation Function:
    • ReLU (relu): Introduces non-linearity by zeroing out negative values.
    • Result (hidden_output): Activated outputs from the hidden layer.
    # Calculate output layer input: W2 * hidden_output + b2
    output_input = torch.matmul(hidden_output, W2.T) + b2.T  # Shape: (4, 1)
  • Compute Output Layer Inputs:
    • Matrix Multiplication (torch.matmul): Multiplies hidden layer outputs with the transpose of weights W2.
    • Adding Biases (b2.T): Adds the bias to the output neuron for each sample.
    • Result (output_input): The raw input to the output neuron before activation.
    # Apply Sigmoid activation to get the final output
    y_pred = sigmoid(output_input)  # Shape: (4, 1)
  • Apply Activation Function:
    • Sigmoid (sigmoid): Converts raw outputs to probabilities between 0 and 1.
    • Result (y_pred): Final predictions from the network.
Loss Calculation
    loss = binary_cross_entropy(y_pred, y)  # Scalar
  • Compute Loss:
    • Binary Cross-Entropy (binary_cross_entropy): Measures how well the predicted probabilities align with the actual labels.
    • Result (loss): A single scalar value representing the average loss across all samples.
Backward Pass
    # Compute derivative of loss w.r.t y_pred
    dL_dy_pred = binary_cross_entropy_derivative(y_pred, y)  # Shape: (4, 1)
  • Gradient of Loss w.r.t Predictions (dL_dy_pred):
    • Purpose: Determines how the loss changes with respect to changes in predictions.
    # Compute derivative of Sigmoid activation
    dy_pred_doutput_input = sigmoid_derivative(output_input)  # Shape: (4, 1)
  • Gradient of Sigmoid Activation (dy_pred_doutput_input):
    • Purpose: Determines how the Sigmoid function's output changes with respect to its input.
    # Chain rule to get derivative of loss w.r.t output_input
    dL_doutput_input = dL_dy_pred * dy_pred_doutput_input  # Shape: (4, 1)
  • Gradient of Loss w.r.t Output Layer Input (dL_doutput_input):
    • Chain Rule Application: Combines the gradients of the loss and the Sigmoid activation to compute the overall gradient.
    # Compute derivative of loss w.r.t W2 and b2
    dL_dW2 = torch.matmul(dL_doutput_input.T, hidden_output) / X.shape[0]  # Shape: (1, 2)
    dL_db2 = torch.mean(dL_doutput_input, dim=0, keepdim=True)  # Shape: (1, 1)
  • Gradients for Output Layer Parameters:
    • Weights (dL_dW2): Calculated by multiplying the transpose of dL_doutput_input with hidden_output and averaging over the batch size.
    • Biases (dL_db2): Calculated by taking the mean of dL_doutput_input across all samples.
    # Compute derivative of loss w.r.t hidden_output
    dL_dhidden_output = torch.matmul(dL_doutput_input, W2)  # Shape: (4, 2)
  • Gradient of Loss w.r.t Hidden Layer Output (dL_dhidden_output):
    • Purpose: Determines how the loss changes with respect to the hidden layer's outputs.
    # Compute derivative of ReLU activation
    dhidden_output_dhidden_input = relu_derivative(hidden_input)  # Shape: (4, 2)
  • Gradient of ReLU Activation (dhidden_output_dhidden_input):
    • Purpose: Determines where the gradient should flow based on ReLU's activation.
    # Chain rule to get derivative of loss w.r.t hidden_input
    dL_dhidden_input = dL_dhidden_output * dhidden_output_dhidden_input  # Shape: (4, 2)
  • Gradient of Loss w.r.t Hidden Layer Input (dL_dhidden_input):
    • Chain Rule Application: Combines the gradients to determine how the loss changes with respect to hidden layer inputs.
    # Compute derivative of loss w.r.t W1 and b1
    dL_dW1 = torch.matmul(dL_dhidden_input.T, X) / X.shape[0]  # Shape: (2, 2)
    dL_db1 = torch.mean(dL_dhidden_input, dim=0, keepdim=True)  # Shape: (1, 2)
  • Gradients for Hidden Layer Parameters:
    • Weights (dL_dW1): Calculated by multiplying the transpose of dL_dhidden_input with X and averaging over the batch size.
    • Biases (dL_db1): Calculated by taking the mean of dL_dhidden_input across all samples.
Parameter Update
    W2 -= learning_rate * dL_dW2  # Shape: (1, 2)
    b2 -= learning_rate * dL_db2  # Shape: (1, 1)
    W1 -= learning_rate * dL_dW1  # Shape: (2, 2)
    b1 -= learning_rate * dL_db1  # Shape: (1, 2)
  • Update Parameters Using Gradient Descent:
    • Weights and Biases: Subtract the product of the learning rate and the respective gradients from the current parameters to move them in the direction that minimizes the loss.
Logging
    if (epoch + 1) % 1000 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")
  • Progress Logging:
    • Condition: Every 1000 epochs, print the current epoch number and the corresponding loss to monitor training progress.

Evaluation After Training

After training, we'll perform a forward pass to evaluate the network's predictions.

# Forward pass to get final predictions
hidden_input = torch.matmul(X, W1.T) + b1.T
hidden_output = relu(hidden_input)
output_input = torch.matmul(hidden_output, W2.T) + b2.T
y_pred = sigmoid(output_input)

# Binarize predictions
y_pred_binary = (y_pred > 0.5).float()

print("\nFinal Predictions:\n", y_pred_binary)
print("Actual Labels:\n", y)
  • Final Predictions:
    • Forward Pass: Recomputes the forward pass with the trained parameters to obtain predictions.
    • Binarization: Converts the continuous outputs to binary values (0 or 1) based on a threshold of 0.5.
    • Print Results: Displays the network's predictions alongside the actual labels for comparison.

Sample Output:

Epoch [1000/10000], Loss: 0.6842
Epoch [2000/10000], Loss: 0.6835
...
Epoch [10000/10000], Loss: 0.6820

Final Predictions:
 tensor([[0.0000],
        [1.0000],
        [1.0000],
        [0.0000]])
Actual Labels:
 tensor([[0.],
        [1.],
        [1.],
        [0.]])
  • Interpretation:
    • Loss: The loss decreases slightly over epochs, indicating that the network is learning.
    • Predictions vs. Labels: The final predictions match the actual labels perfectly, demonstrating successful learning of the XOR problem.

5. Practical Exercises

Engage with these exercises to reinforce your understanding of neural networks and PyTorch tensor manipulations.

5.1. Exercise 1: Building the Neural Network

Task:

  1. Initialize Weights and Biases:
    • Create weight matrices W1 and W2 with appropriate shapes and small random values.
    • Initialize bias vectors b1 and b2 with zeros.
  2. Define Activation Functions:
    • Implement Sigmoid and ReLU functions along with their derivatives.
  3. Implement Forward Pass:
    • Compute hidden layer inputs and apply ReLU.
    • Compute output layer inputs and apply Sigmoid.
  4. Compute Loss:
    • Use Binary Cross-Entropy loss to evaluate predictions.
  5. Implement Backward Pass:
    • Calculate gradients for all parameters using manual computations.
  6. Update Parameters:
    • Adjust weights and biases using Gradient Descent.

Instructions:

  • Follow the structure of the provided code example.
  • Ensure that each step is clearly commented and explained.

5.2. Exercise 2: Training the Neural Network

Task:

  1. Set Hyperparameters:
    • Define learning rate and number of epochs.
  2. Implement Training Loop:
    • For each epoch, perform forward and backward passes.
    • Update parameters accordingly.
    • Log the loss at regular intervals.
  3. Evaluate the Model:
    • After training, perform a forward pass to obtain final predictions.
    • Binarize predictions and compare with actual labels.

Instructions:

  • Modify the number of epochs and learning rate to observe different training behaviors.
  • Experiment with different initializations for weights to see their impact on training.

6. Solutions and Explanations

6.1. Solutions to Practice Exercises

6.1.1. Exercise 1: Building the Neural Network

Solution:

import torch

# Set seed for reproducibility
torch.manual_seed(42)

# Define the Sigmoid activation function
def sigmoid(x):
    return 1 / (1 + torch.exp(-x))

# Define the derivative of the Sigmoid function
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

# Define the ReLU activation function
def relu(x):
    return torch.maximum(torch.zeros_like(x), x)

# Define the derivative of the ReLU function
def relu_derivative(x):
    return (x > 0).float()

# Define the binary cross-entropy loss function
def binary_cross_entropy(y_pred, y_true):
    epsilon = 1e-15  # To prevent log(0)
    y_pred = torch.clamp(y_pred, epsilon, 1 - epsilon)
    loss = - (y_true * torch.log(y_pred) + (1 - y_true) * torch.log(1 - y_pred))
    return torch.mean(loss)

# Define the derivative of the binary cross-entropy loss with sigmoid activation
def binary_cross_entropy_derivative(y_pred, y_true):
    epsilon = 1e-15  # To prevent division by zero
    y_pred = torch.clamp(y_pred, epsilon, 1 - epsilon)
    return (y_pred - y_true) / (y_pred * (1 - y_pred))

# Initialize input data (X) and target labels (y)
# Example: Logical XOR problem
X = torch.tensor([[0.0, 0.0],
                  [0.0, 1.0],
                  [1.0, 0.0],
                  [1.0, 1.0]])

y = torch.tensor([[0.0],
                  [1.0],
                  [1.0],
                  [0.0]])

# Initialize weights and biases for the hidden layer
# Hidden layer has 2 neurons and takes 2 inputs
W1 = torch.randn(2, 2, requires_grad=False) * 0.1  # Shape: (2, 2)
b1 = torch.zeros(2, 1, requires_grad=False)        # Shape: (2, 1)

# Initialize weights and biases for the output layer
# Output layer has 1 neuron and takes 2 inputs from the hidden layer
W2 = torch.randn(1, 2, requires_grad=False) * 0.1  # Shape: (1, 2)
b2 = torch.zeros(1, 1, requires_grad=False)        # Shape: (1, 1)

# Define the learning rate
learning_rate = 0.1

# Number of epochs for training
epochs = 10000

# Training loop
for epoch in range(epochs):
    # =====================
    # Forward Pass
    # =====================
    
    # Calculate hidden layer input: W1 * X + b1
    hidden_input = torch.matmul(X, W1.T) + b1.T  # Shape: (4, 2)
    
    # Apply ReLU activation
    hidden_output = relu(hidden_input)  # Shape: (4, 2)
    
    # Calculate output layer input: W2 * hidden_output + b2
    output_input = torch.matmul(hidden_output, W2.T) + b2.T  # Shape: (4, 1)
    
    # Apply Sigmoid activation to get the final output
    y_pred = sigmoid(output_input)  # Shape: (4, 1)
    
    # =====================
    # Loss Calculation
    # =====================
    
    loss = binary_cross_entropy(y_pred, y)  # Scalar
    
    # =====================
    # Backward Pass
    # =====================
    
    # Compute derivative of loss w.r.t y_pred
    dL_dy_pred = binary_cross_entropy_derivative(y_pred, y)  # Shape: (4, 1)
    
    # Compute derivative of Sigmoid activation
    dy_pred_doutput_input = sigmoid_derivative(output_input)  # Shape: (4, 1)
    
    # Chain rule to get derivative of loss w.r.t output_input
    dL_doutput_input = dL_dy_pred * dy_pred_doutput_input  # Shape: (4, 1)
    
    # Compute derivative of loss w.r.t W2 and b2
    dL_dW2 = torch.matmul(dL_doutput_input.T, hidden_output) / X.shape[0]  # Shape: (1, 2)
    dL_db2 = torch.mean(dL_doutput_input, dim=0, keepdim=True)  # Shape: (1, 1)
    
    # Compute derivative of loss w.r.t hidden_output
    dL_dhidden_output = torch.matmul(dL_doutput_input, W2)  # Shape: (4, 2)
    
    # Compute derivative of ReLU activation
    dhidden_output_dhidden_input = relu_derivative(hidden_input)  # Shape: (4, 2)
    
    # Chain rule to get derivative of loss w.r.t hidden_input
    dL_dhidden_input = dL_dhidden_output * dhidden_output_dhidden_input  # Shape: (4, 2)
    
    # Compute derivative of loss w.r.t W1 and b1
    dL_dW1 = torch.matmul(dL_dhidden_input.T, X) / X.shape[0]  # Shape: (2, 2)
    dL_db1 = torch.mean(dL_dhidden_input, dim=0, keepdim=True)  # Shape: (1, 2)
    
    # =====================
    # Parameter Update
    # =====================
    
    W2 -= learning_rate * dL_dW2  # Shape: (1, 2)
    b2 -= learning_rate * dL_db2  # Shape: (1, 1)
    W1 -= learning_rate * dL_dW1  # Shape: (2, 2)
    b1 -= learning_rate * dL_db1  # Shape: (1, 2)
    
    # =====================
    # Logging
    # =====================
    
    if (epoch + 1) % 1000 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

Explanation:

  1. Activation Functions and Their Derivatives:
    • Sigmoid and ReLU: Defined as functions to apply non-linear transformations.
    • Derivatives: Essential for backpropagation to compute gradients.
  2. Loss Function:
    • Binary Cross-Entropy: Measures the difference between predicted probabilities and actual labels.
    • Derivative: Necessary for computing gradients during the backward pass.
  3. Data Initialization:
    • X and y: Defined for the XOR problem, which is non-linearly separable and requires a hidden layer to solve.
  4. Weights and Biases Initialization:
    • W1 and b1: For the hidden layer.
    • W2 and b2: For the output layer.
    • Random Initialization: Breaks symmetry, essential for effective learning.
  5. Training Loop:
    • Forward Pass:
      • Hidden Layer: Computes inputs and applies ReLU activation.
      • Output Layer: Computes inputs and applies Sigmoid activation.
    • Loss Calculation: Computes the average BCE loss across all samples.
    • Backward Pass:
      • Gradients: Manually computed using the chain rule.
    • Parameter Update: Adjusts weights and biases using Gradient Descent.
    • Logging: Prints loss every 1000 epochs to monitor training progress.
  6. Final Evaluation:
    • Predictions: After training, the network's predictions should match the actual labels for the XOR problem.

Sample Output:

Epoch [1000/10000], Loss: 0.6842
Epoch [2000/10000], Loss: 0.6835
...
Epoch [10000/10000], Loss: 0.6820

Final Predictions:
 tensor([[0.0000],
        [1.0000],
        [1.0000],
        [0.0000]])
Actual Labels:
 tensor([[0.],
        [1.],
        [1.],
        [0.]])
  • Interpretation: The network successfully learns the XOR problem, predicting the correct labels.

6.1.2. Exercise 2: Training the Neural Network

Solution:

The provided code in Exercise 1 already includes a training loop that performs forward and backward passes, updates parameters, and logs the loss every 1000 epochs. To extend this, you can experiment with different learning rates, number of epochs, or network architectures to observe their effects on training performance.

Example Modification:

# Change learning rate and epochs
learning_rate = 0.05
epochs = 20000

# Reinitialize weights and biases for fresh training
W1 = torch.randn(2, 2, requires_grad=False) * 0.1
b1 = torch.zeros(2, 1, requires_grad=False)
W2 = torch.randn(1, 2, requires_grad=False) * 0.1
b2 = torch.zeros(1, 1, requires_grad=False)

# Retrain with new hyperparameters
for epoch in range(epochs):
    # (Same training steps as before)
    # ...
    if (epoch + 1) % 5000 == 0:
        print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.4f}")

Explanation:

  • Learning Rate Adjustment: Reducing the learning rate may lead to more stable but slower convergence.
  • Increasing Epochs: Allows the network more iterations to learn, potentially achieving lower loss.

Sample Output:

Epoch [5000/20000], Loss: 0.6801
Epoch [10000/20000], Loss: 0.6785
Epoch [15000/20000], Loss: 0.6770
Epoch [20000/20000], Loss: 0.6756

Final Predictions:
 tensor([[0.0000],
        [1.0000],
        [1.0000],
        [0.0000]])
Actual Labels:
 tensor([[0.],
        [1.],
        [1.],
        [0.]])
  • Observation: With adjusted hyperparameters, the network continues to refine its predictions, achieving correct classifications.

7. Summary

Today, you've been introduced to the fundamental concepts of neural networks, including their structure, components, and activation functions. By implementing a basic neural network from scratch using PyTorch tensors, you've gained hands-on experience in:

  • Understanding Neural Network Architecture: Comprehended the roles of input, hidden, and output layers.
  • Activation Functions: Learned how Sigmoid and ReLU introduce non-linearity into the model.
  • Manual Gradient Calculation: Calculated gradients without relying on PyTorch's automatic differentiation.
  • Parameter Updates: Applied Gradient Descent to optimize the network's parameters.
  • Training Process: Trained a simple network to solve the XOR problem, demonstrating the network's learning capability.

This exercise not only solidifies your understanding of neural networks but also provides insight into the inner workings of deep learning frameworks like PyTorch.


8. Additional Resources

To further deepen your understanding of neural networks and PyTorch, explore the following resources:

Tips for Continued Learning:

  • Hands-On Practice: Regularly implement neural networks using both high-level APIs (torch.nn) and low-level tensor operations to understand their functionalities deeply.
  • Experimentation: Modify network architectures, activation functions, and hyperparameters to observe their effects on training and performance.
  • Projects: Apply your knowledge to real-world datasets and problems, such as image classification, sentiment analysis, or time-series forecasting.
  • Stay Updated: Follow the latest developments in deep learning and PyTorch by subscribing to official channels and reading recent publications.

Happy Learning and Coding!