Python > Working with Data > Data Analysis with Pandas > Data Selection and Indexing
Pandas Boolean Indexing for Data Selection
This snippet demonstrates how to use boolean indexing in Pandas to select data from a DataFrame based on one or more conditions. It shows how to create boolean masks and apply them to filter rows.
Introduction to Boolean Indexing
Boolean indexing (also known as boolean masking) is a powerful technique in Pandas for selecting rows from a DataFrame that meet specific criteria. It involves creating a boolean array (a mask) that indicates which rows should be selected (True
) and which rows should be excluded (False
). This mask is then used to index the DataFrame, returning only the rows where the mask is True
.
Creating a Sample DataFrame
This code creates a sample Pandas DataFrame with columns 'Name', 'Age', 'City', and 'Salary'. This DataFrame will be used in the following examples to demonstrate boolean indexing.
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 22, 28, 24],
'City': ['New York', 'London', 'Paris', 'Tokyo', 'Sydney'],
'Salary': [60000, 75000, 55000, 80000, 65000]
}
df = pd.DataFrame(data)
print(df)
Creating a Boolean Mask
This code creates a boolean mask based on the condition df['Age'] > 25
. The mask will be a Pandas Series of boolean values, where each value corresponds to a row in the DataFrame. A value of True
indicates that the age in that row is greater than 25, and a value of False
indicates that it is not.
mask = df['Age'] > 25
print(mask)
Applying the Boolean Mask
This code applies the boolean mask to the DataFrame. The result, filtered_df
, will contain only the rows where the corresponding value in the mask is True
. In this case, it will contain the rows for Bob and David, as their ages are greater than 25.
filtered_df = df[mask]
print(filtered_df)
Combining Multiple Conditions
This code combines two conditions using the &
(and) operator. The first condition is df['Age'] > 25
, and the second condition is df['City'] != 'London'
(city is not London). The resulting mask will be True
only for rows that satisfy both conditions. In this case, it will only contain the row for David.
mask = (df['Age'] > 25) & (df['City'] != 'London')
filtered_df = df[mask]
print(filtered_df)
Using the OR Operator
This code combines two conditions using the |
(or) operator. The first condition is df['Age'] < 23
, and the second condition is df['Salary'] > 70000
. The resulting mask will be True
for rows that satisfy either of the conditions. In this case, it will contain the rows for Charlie (Age < 23) and David (Salary > 70000).
mask = (df['Age'] < 23) | (df['Salary'] > 70000)
filtered_df = df[mask]
print(filtered_df)
Using the .isin() Method
This code uses the .isin()
method to check if the value in the 'City' column is present in the list ['New York', 'Paris']
. The resulting mask will be True
for rows where the city is either 'New York' or 'Paris'. In this case, it will contain the rows for Alice and Charlie.
mask = df['City'].isin(['New York', 'Paris'])
filtered_df = df[mask]
print(filtered_df)
Real-Life Use Case
Imagine you are analyzing sales data. You can use boolean indexing to select orders placed in a specific region, with a total value above a certain threshold, and delivered before a specific date.
Best Practices
&
(and) or |
(or). This improves readability and prevents unexpected behavior due to operator precedence.df[df['A'] > 0]['B'] > 1
) as this can create temporary copies of the DataFrame and reduce performance. Instead, combine the conditions into a single mask: df[(df['A'] > 0) & (df['B'] > 1)]
.
Interview Tip
Be prepared to explain how boolean indexing works and to provide examples of how you would use it to solve a data filtering problem. Be able to discuss the use of the &
, |
, and .isin()
operators.
When to Use Them
Use boolean indexing when you need to select rows from a DataFrame that meet specific criteria. It is particularly useful when you have multiple conditions that need to be combined.
Memory Footprint
Boolean indexing generally creates a view of the original DataFrame, rather than a copy. This means that modifying the selected data may affect the original DataFrame. To avoid this, use the .copy()
method to create a copy of the filtered DataFrame.
Alternatives
While boolean indexing is a powerful technique, you can also use .loc
and .iloc
for data selection. However, boolean indexing is often more concise and readable when dealing with complex filtering conditions.
Pros
Cons
FAQ
-
How do I combine multiple conditions?
Use the&
(and) and|
(or) operators to combine multiple conditions. Remember to use parentheses to group conditions correctly. -
How do I check if a value is in a list?
Use the.isin()
method to check if a value in a column is present in a list of values. -
Does boolean indexing create a copy of the DataFrame?
Generally, boolean indexing creates a view of the original DataFrame. To create a copy, use the.copy()
method.