Machine learning > Neural Networks > Basic Neural Nets > Activation Functions (ReLU, Sigmoid, Tanh)

Activation Functions: ReLU, Sigmoid, and Tanh Explained

Activation functions are a crucial component of neural networks, introducing non-linearity and enabling the network to learn complex patterns. This tutorial explores three common activation functions: ReLU, Sigmoid, and Tanh. We'll delve into their mathematical definitions, properties, advantages, disadvantages, and practical considerations for choosing the right activation function for your neural network.

What are Activation Functions?

Activation functions introduce non-linearity into the output of a neuron. Without them, a neural network would simply be a linear regression model, unable to learn complex relationships in data. Activation functions decide whether a neuron should be 'activated' or not, based on the weighted sum of its inputs and a bias.

Sigmoid Activation Function

The sigmoid function outputs a value between 0 and 1. Its mathematical formula is: σ(x) = 1 / (1 + e^-x). It's often used in the output layer for binary classification problems, as its output can be interpreted as a probability.

Pros: Outputs a value between 0 and 1, making it suitable for probability prediction. Smooth gradient, preventing 'jumps' in output values.

Cons: Suffers from the vanishing gradient problem, especially for very large or very small input values, hindering learning in deep networks. Not zero-centered, which can slow down learning. Computationally expensive due to the exponential function.

import numpy as np

def sigmoid(x):
  return 1 / (1 + np.exp(-x))

# Example Usage
x = np.array([-2, -1, 0, 1, 2])
sigmoid_output = sigmoid(x)
print(f"Input: {x}")
print(f"Sigmoid Output: {sigmoid_output}")

Tanh Activation Function

The tanh (hyperbolic tangent) function outputs a value between -1 and 1. Its mathematical formula is: tanh(x) = (e^x - e^-x) / (e^x + e^-x). It's similar to the sigmoid function but zero-centered, which often leads to faster convergence during training.

Pros: Zero-centered output, which can speed up learning compared to sigmoid. Smooth gradient.

Cons: Still suffers from the vanishing gradient problem, although less severely than sigmoid. Computationally expensive due to the exponential function.

import numpy as np

def tanh(x):
  return np.tanh(x)

# Example Usage
x = np.array([-2, -1, 0, 1, 2])
tanh_output = tanh(x)
print(f"Input: {x}")
print(f"Tanh Output: {tanh_output}")

ReLU Activation Function

The ReLU (Rectified Linear Unit) function outputs the input directly if it's positive, otherwise, it outputs zero. Its mathematical formula is: ReLU(x) = max(0, x). It's the most popular activation function for many types of neural networks due to its simplicity and efficiency.

Pros: Computationally efficient, as it only involves a comparison. Alleviates the vanishing gradient problem to a large extent compared to sigmoid and tanh. Promotes sparsity, as some neurons will be inactive.

Cons: Can suffer from the 'dying ReLU' problem, where neurons become inactive and stop learning if they consistently receive negative inputs. Not zero-centered.

import numpy as np

def relu(x):
  return np.maximum(0, x)

# Example Usage
x = np.array([-2, -1, 0, 1, 2])
relu_output = relu(x)
print(f"Input: {x}")
print(f"ReLU Output: {relu_output}")

When to Use Them

Sigmoid: Use in the output layer for binary classification problems where you need a probability-like output.

Tanh: Consider using tanh in hidden layers as an alternative to sigmoid, especially when you want zero-centered activations.

ReLU: Use ReLU as the default activation function for hidden layers in most neural networks. Consider alternatives like Leaky ReLU or ELU to address the dying ReLU problem if necessary.

Alternatives to ReLU

Leaky ReLU: Introduces a small slope for negative inputs, preventing the dying ReLU problem. Leaky ReLU(x) = x if x > 0 else alpha * x (where alpha is a small constant, e.g., 0.01).

ELU (Exponential Linear Unit): Similar to Leaky ReLU, but uses an exponential function for negative inputs. ELU(x) = x if x > 0 else alpha * (exp(x) - 1) (where alpha is a constant).

SELU (Scaled Exponential Linear Unit): A variant of ELU that is self-normalizing, meaning that it tends to push the activations towards zero mean and unit variance, which can improve training speed and stability.

Memory Footprint

ReLU generally has a lower memory footprint compared to Sigmoid and Tanh, especially during backpropagation, because it involves simpler calculations. Sigmoid and Tanh require calculating exponentials, which are more memory-intensive.

Real-Life Use Case Section

ReLU: Commonly used in deep learning models for image recognition, object detection, and natural language processing due to its computational efficiency and ability to mitigate the vanishing gradient problem.

Sigmoid: Used in logistic regression and in the output layer of neural networks for binary classification tasks, such as spam detection or fraud detection.

Tanh: Can be used in recurrent neural networks (RNNs) and LSTMs, especially in the hidden states, although ReLU-based activations are becoming increasingly popular even there.

Best Practices

Start with ReLU for hidden layers and monitor for the dying ReLU problem. If it occurs, consider using Leaky ReLU or ELU.

Use Sigmoid in the output layer for binary classification problems.

Experiment with different activation functions to find the best one for your specific task and dataset.

Interview Tip

When discussing activation functions in interviews, be prepared to explain their mathematical properties, advantages, disadvantages, and when you would choose one over another. Be sure to mention the vanishing gradient problem and the dying ReLU problem, and how alternative activations like Leaky ReLU and ELU address these issues. Give real world examples to show how each function helps.

← Backpropagation: A Step-by-Step Guide with Python Implementation →

FAQ

What is the vanishing gradient problem?

The vanishing gradient problem occurs when the gradients during backpropagation become very small, making it difficult for the network to learn. This is particularly prevalent in deep networks with sigmoid or tanh activation functions, as their gradients can saturate to zero for large or small input values.
What is the dying ReLU problem?

The dying ReLU problem occurs when a ReLU neuron becomes inactive and stops learning because it consistently receives negative inputs. This can happen if the neuron's weights are updated in a way that causes it to always output zero.
Why is ReLU so popular?

ReLU is popular due to its computational efficiency, ability to alleviate the vanishing gradient problem, and promotion of sparsity in the network. However, it's important to be aware of the dying ReLU problem and consider alternatives if it occurs.

Clustering Algorithms

Computer Vision

Data Handling for ML

Data Preprocessing

Deep Learning

Dimensionality Reduction

Ethics and Fairness in ML

Fundamentals of Machine Learning

Linear Models

ML in Production

Model Deployment

Model Evaluation and Selection

Model Interpretability

Natural Language Processing (NLP)

Neural Networks

Reinforcement Learning

Support Vector Machines

Time Series Forecasting

Tools and Libraries

Tree-based Models