Python > Working with Data > Data Analysis with Pandas > Data Cleaning and Manipulation
Handling Missing Data with Pandas
Pandas
provides robust tools for handling missing data, represented as NaN
(Not a Number). This snippet demonstrates how to identify, fill, and drop missing values in a DataFrame.
Importing Pandas and Creating a DataFrame with Missing Values
This section imports the pandas
and numpy
libraries. It then creates a sample DataFrame df
containing some missing values (np.nan
) in various columns.
import pandas as pd
import numpy as np
data = {'A': [1, 2, np.nan, 4, 5],
'B': [5, np.nan, 7, 8, 9],
'C': [10, 11, 12, np.nan, 14]}
df = pd.DataFrame(data)
Identifying Missing Values
Here, df.isnull()
creates a boolean mask indicating missing values. df.isnull().sum()
then counts the number of missing values in each column.
print("Original DataFrame:\n", df)
print("\nMissing Values (Boolean Mask):\n", df.isnull())
print("\nNumber of Missing Values per Column:\n", df.isnull().sum())
Filling Missing Values
This section demonstrates various methods for filling missing values:
df.fillna(0)
: Fills all missing values with 0.df.fillna(df.mean())
: Fills missing values with the mean of their respective columns.df.fillna(method='ffill')
: Uses forward fill, propagating the last valid observation forward.df.fillna(method='bfill')
: Uses backward fill, propagating the next valid observation backward.
# Fill missing values with 0
df_filled_0 = df.fillna(0)
print("\nDataFrame with Missing Values Filled with 0:\n", df_filled_0)
# Fill missing values with the mean of the column
df_filled_mean = df.fillna(df.mean())
print("\nDataFrame with Missing Values Filled with Mean:\n", df_filled_mean)
# Forward fill (propagate last valid observation forward)
df_ffill = df.fillna(method='ffill')
print("\nDataFrame with Forward Fill:\n", df_ffill)
# Backward fill (propagate next valid observation backward)
df_bfill = df.fillna(method='bfill')
print("\nDataFrame with Backward Fill:\n", df_bfill)
Dropping Rows or Columns with Missing Values
This section demonstrates how to drop rows or columns containing missing values:
df.dropna()
: Drops rows containing any missing values.df.dropna(axis=1)
: Drops columns containing any missing values.df.dropna(thresh=2)
: Keeps only rows with at least 2 non-NA values.
# Drop rows with any missing values
df_dropped_rows = df.dropna()
print("\nDataFrame with Rows with Missing Values Dropped:\n", df_dropped_rows)
# Drop columns with any missing values
df_dropped_cols = df.dropna(axis=1)
print("\nDataFrame with Columns with Missing Values Dropped:\n", df_dropped_cols)
# Drop rows with at least 2 non-NA values
df_dropped_thresh = df.dropna(thresh=2)
print("\nDataFrame with Rows with at least 2 Non-NA values:\n", df_dropped_thresh)
Real-Life Use Case
In a real-world scenario, you might encounter missing data in customer information, sensor readings, or financial records. Properly handling this missing data is crucial for accurate analysis and model building. For example, you could use mean imputation for filling in missing age values in a customer dataset.
Best Practices
When to Use Them
fillna(0)
when missing values represent absence of a quantity.fillna(df.mean())
for continuous data when you want to preserve the distribution.ffill
or bfill
for time-series data when values are likely to be similar to adjacent entries.dropna()
when missing values are infrequent and removing them won't significantly reduce the dataset size.
Alternatives
Pros and Cons
fillna
dropna
FAQ
-
What does
NaN
stand for?
NaN
stands for 'Not a Number'. It's a special floating-point value used to represent missing or undefined numerical data. -
How do I replace specific missing values with different values?
You can use thefillna()
method with a dictionary to specify different replacement values for different columns. For example:df.fillna({'A': 0, 'B': df['B'].mean()})
.