Python > Data Science and Machine Learning Libraries > TensorFlow and Keras > Natural Language Processing

Sentiment Analysis with Keras and IMDB Dataset

This snippet demonstrates sentiment analysis using Keras with the IMDB dataset. It includes steps for loading the dataset, preprocessing the text, building a model, training, and evaluating the sentiment analysis performance.

Loading the IMDB Dataset

This section loads the IMDB dataset, which consists of movie reviews labeled as positive or negative. `max_features` limits the vocabulary to the most frequent 10000 words. `maxlen` sets the maximum length of each review to 500 words. The dataset is split into training and testing sets. The `imdb.load_data` function automatically downloads and preprocesses the data.

from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence

max_features = 10000  # Number of words to consider as features
maxlen = 500  # Cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

Padding Sequences

The reviews have variable lengths, so `sequence.pad_sequences` is used to pad or truncate them to a fixed length (`maxlen`). This ensures that all inputs to the neural network have the same shape.

print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

Building the Model

This defines the neural network architecture. An `Embedding` layer converts word indices into dense vectors. An `LSTM` layer is used for processing sequential data. Dropout layers are added to prevent overfitting. The `Dense` layer outputs a single value between 0 and 1, representing the predicted sentiment. The model is compiled with the Adam optimizer and binary cross-entropy loss.

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

Training and Evaluating the Model

The model is trained using the `fit` method. The `validation_data` argument provides a test set to monitor the model's performance during training. After training, the model is evaluated on the test set to measure its final performance.

print('Train...')
model.fit(input_train, y_train,
          batch_size=batch_size,
          epochs=2,
          validation_data=(input_test, y_test))
score, acc = model.evaluate(input_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Concepts Behind the Snippet

This snippet utilizes an Embedding layer to represent words as vectors, an LSTM layer to capture sequential relationships in the text, and a dense layer to make the final sentiment prediction. The sigmoid activation function is used to output a probability between 0 and 1, representing the likelihood of positive sentiment.

Real-Life Use Case

Sentiment analysis has numerous real-world applications, including monitoring customer reviews, analyzing social media posts, and understanding public opinion on products or services. This helps businesses gain valuable insights into customer satisfaction and brand perception.

Best Practices

Hyperparameter Tuning: Experiment with different hyperparameters such as the embedding dimension, LSTM units, dropout rates, and learning rate to optimize model performance.
Data Augmentation: Consider data augmentation techniques to increase the size of the training dataset and improve model generalization.
Pre-trained Embeddings: Use pre-trained word embeddings such as GloVe or Word2Vec to improve the model's understanding of the text.

Interview Tip

Understand the purpose of each layer in the model and the role of recurrent neural networks in processing sequential data. Be prepared to discuss the challenges of training recurrent neural networks, such as vanishing gradients, and techniques to mitigate them.

When to Use Them

Sentiment analysis is appropriate when you want to understand the emotional tone of text data, such as customer reviews, social media posts, or survey responses. This is useful for gauging customer satisfaction, identifying potential issues, and tracking brand reputation.

Memory Footprint

The memory footprint of this model depends on the vocabulary size, sequence length, embedding dimension, and the number of LSTM units. Larger vocabularies and longer sequences will require more memory. Using pre-trained embeddings can reduce memory usage as the embedding layer will have fixed weights and won't require training.

Alternatives

Transformers: Models like BERT, RoBERTa, and DistilBERT can achieve state-of-the-art results on sentiment analysis tasks.
Traditional Machine Learning Models: Naive Bayes, Support Vector Machines (SVMs), and Logistic Regression can also be used for sentiment analysis, especially with feature engineering techniques like TF-IDF.

Pros

Effective for Sequential Data: LSTM layers are well-suited for processing sequential data like text.
Handles Variable Length Inputs: Padding allows the model to handle inputs of different lengths.
Relatively Simple: Easier to implement compared to more complex models like Transformers.

Cons

Vanishing Gradients: Recurrent neural networks can suffer from vanishing gradients, making it difficult to train long sequences.
Computational Cost: Training LSTMs can be computationally expensive, especially with large datasets.
Less Powerful than Transformers: Transformers generally achieve better performance on sentiment analysis tasks.

← Scikit-learn Pipeline for Data Preprocessing and Model Training Sentiment Analysis with NLTK's VADER →

FAQ

What is the purpose of dropout in LSTM layers?

Dropout is a regularization technique that helps prevent overfitting by randomly setting a fraction of the inputs to zero during training. This forces the network to learn more robust features.
How does the Embedding layer work?

The Embedding layer maps each word index to a dense vector of fixed size. These vectors are learned during training and represent the semantic meaning of the words.
Why is the sigmoid activation function used in the output layer?

The sigmoid activation function outputs a value between 0 and 1, which can be interpreted as the probability of the review being positive. This makes it suitable for binary classification problems like sentiment analysis.

Advanced Python Concepts

Advanced Topics and Specializations

Core Python Basics

Data Science and Machine Learning Libraries

Deployment and Distribution

Evolving Python

GUI Programming with Python

Modules and Packages

Object-Oriented Programming (OOP) in Python

Python Ecosystem and Community

Quality and Best Practices

Testing in Python

Web Development with Python

Working with Data

Working with External Resources