Python > Working with Data > Data Analysis with Pandas > Data Cleaning and Manipulation
Data Type Conversion in Pandas
This snippet demonstrates how to convert the data types of columns in a Pandas DataFrame. Converting to the correct data type is important for efficient memory usage and accurate analysis.
Creating a DataFrame
This section imports the pandas
library and creates a sample DataFrame df
where all columns are initially of type object
(string).
import pandas as pd
data = {'col1': ['1', '2', '3'],
'col2': ['4.5', '5.6', '6.7'],
'col3': ['True', 'False', 'True']}
df = pd.DataFrame(data)
Checking Data Types
This code prints the original DataFrame and then displays the data types of each column using df.dtypes
. You'll see that all columns are initially 'object'.
print("Original DataFrame:\n", df)
print("\nData Types:\n", df.dtypes)
Converting Data Types
This section demonstrates how to convert the data types of the columns:
The code then prints the DataFrame and the updated data types.pd.to_numeric(df['col1'], errors='coerce').astype('Int64')
: Converts 'col1' to integer. pd.to_numeric
handles potential errors by setting invalid values to NaN
(with errors='coerce'
). We use .astype('Int64')
instead of int
to allow for null values to be represented.pd.to_numeric(df['col2'], errors='coerce')
: Converts 'col2' to float, also using errors='coerce'
.df['col3'].astype('bool')
: Converts 'col3' to boolean.
# Convert 'col1' to integer
df['col1'] = pd.to_numeric(df['col1'], errors='coerce').astype('Int64')
# Convert 'col2' to float
df['col2'] = pd.to_numeric(df['col2'], errors='coerce')
# Convert 'col3' to boolean
df['col3'] = df['col3'].astype('bool')
print("\nDataFrame after Conversion:\n", df)
print("\nData Types after Conversion:\n", df.dtypes)
Real-Life Use Case
In data analysis, it's common to receive data where numeric columns are stored as strings. Converting these columns to numeric types (int or float) is essential for performing mathematical operations and building statistical models. Similarly, converting string representations of booleans to actual boolean values enables logical operations.
Best Practices
errors='coerce'
when converting to numeric types to handle potential errors gracefully.Categorical
data type for columns with a limited number of unique values to save memory.
When to Use Them
Pros and Cons
pd.to_numeric
astype
FAQ
-
What happens if I try to convert a string that cannot be converted to a number?
If you don't useerrors='coerce'
, the conversion will raise an error. If you useerrors='coerce'
, the invalid values will be replaced withNaN
. -
Why use 'Int64' instead of 'int'?
'Int64' is a nullable integer type, meaning it can hold NaN values. The standard 'int' type cannot represent missing values directly. If you have missing values in your data, using a nullable integer type is crucial.