Python > Data Science and Machine Learning Libraries > TensorFlow and Keras > Natural Language Processing
Sentiment Analysis with Keras and IMDB Dataset
This snippet demonstrates sentiment analysis using Keras with the IMDB dataset. It includes steps for loading the dataset, preprocessing the text, building a model, training, and evaluating the sentiment analysis performance.
Loading the IMDB Dataset
This section loads the IMDB dataset, which consists of movie reviews labeled as positive or negative. `max_features` limits the vocabulary to the most frequent 10000 words. `maxlen` sets the maximum length of each review to 500 words. The dataset is split into training and testing sets. The `imdb.load_data` function automatically downloads and preprocesses the data.
from tensorflow.keras.datasets import imdb
from tensorflow.keras.preprocessing import sequence
max_features = 10000 # Number of words to consider as features
maxlen = 500 # Cut texts after this number of words (among top max_features most common words)
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
Padding Sequences
The reviews have variable lengths, so `sequence.pad_sequences` is used to pad or truncate them to a fixed length (`maxlen`). This ensures that all inputs to the neural network have the same shape.
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)
Building the Model
This defines the neural network architecture. An `Embedding` layer converts word indices into dense vectors. An `LSTM` layer is used for processing sequential data. Dropout layers are added to prevent overfitting. The `Dense` layer outputs a single value between 0 and 1, representing the predicted sentiment. The model is compiled with the Adam optimizer and binary cross-entropy loss.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
model = Sequential()
model.add(Embedding(max_features, 128))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
Training and Evaluating the Model
The model is trained using the `fit` method. The `validation_data` argument provides a test set to monitor the model's performance during training. After training, the model is evaluated on the test set to measure its final performance.
print('Train...')
model.fit(input_train, y_train,
batch_size=batch_size,
epochs=2,
validation_data=(input_test, y_test))
score, acc = model.evaluate(input_test, y_test, batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)
Concepts Behind the Snippet
This snippet utilizes an Embedding layer to represent words as vectors, an LSTM layer to capture sequential relationships in the text, and a dense layer to make the final sentiment prediction. The sigmoid activation function is used to output a probability between 0 and 1, representing the likelihood of positive sentiment.
Real-Life Use Case
Sentiment analysis has numerous real-world applications, including monitoring customer reviews, analyzing social media posts, and understanding public opinion on products or services. This helps businesses gain valuable insights into customer satisfaction and brand perception.
Best Practices
Interview Tip
Understand the purpose of each layer in the model and the role of recurrent neural networks in processing sequential data. Be prepared to discuss the challenges of training recurrent neural networks, such as vanishing gradients, and techniques to mitigate them.
When to Use Them
Sentiment analysis is appropriate when you want to understand the emotional tone of text data, such as customer reviews, social media posts, or survey responses. This is useful for gauging customer satisfaction, identifying potential issues, and tracking brand reputation.
Memory Footprint
The memory footprint of this model depends on the vocabulary size, sequence length, embedding dimension, and the number of LSTM units. Larger vocabularies and longer sequences will require more memory. Using pre-trained embeddings can reduce memory usage as the embedding layer will have fixed weights and won't require training.
Alternatives
Pros
Cons
FAQ
-
What is the purpose of dropout in LSTM layers?
Dropout is a regularization technique that helps prevent overfitting by randomly setting a fraction of the inputs to zero during training. This forces the network to learn more robust features. -
How does the Embedding layer work?
The Embedding layer maps each word index to a dense vector of fixed size. These vectors are learned during training and represent the semantic meaning of the words. -
Why is the sigmoid activation function used in the output layer?
The sigmoid activation function outputs a value between 0 and 1, which can be interpreted as the probability of the review being positive. This makes it suitable for binary classification problems like sentiment analysis.