Python > Working with Data > Data Analysis with Pandas > Series and DataFrames
Creating and Manipulating Pandas DataFrames
This snippet demonstrates how to create, access, and manipulate Pandas DataFrames, a fundamental data structure for tabular data analysis in Python.
Creating a Pandas DataFrame
This code shows three ways to create a Pandas DataFrame: from a dictionary, from a list of dictionaries, and from a list of lists. When created from a dictionary, keys become the column names. When created from a list of dictionaries, each dictionary represents a row. When created from a list of lists, the `columns` parameter specifies the column names.
import pandas as pd
# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo']}
df1 = pd.DataFrame(data)
print("DataFrame from a dictionary:\n", df1)
# Creating a DataFrame from a list of dictionaries
data_list = [
{'Name': 'Alice', 'Age': 25, 'City': 'New York'},
{'Name': 'Bob', 'Age': 30, 'City': 'London'},
{'Name': 'Charlie', 'Age': 22, 'City': 'Paris'},
{'Name': 'David', 'Age': 28, 'City': 'Tokyo'}
]
df2 = pd.DataFrame(data_list)
print("\nDataFrame from a list of dictionaries:\n", df2)
# Creating a DataFrame from a list of lists
data_lol = [["Alice", 25, "New York"], ["Bob", 30, "London"], ["Charlie", 22, "Paris"], ["David", 28, "Tokyo"]]
df3 = pd.DataFrame(data_lol, columns=["Name", "Age", "City"])
print("\nDataFrame from a list of lists:\n", df3)
Accessing Data in a DataFrame
This demonstrates how to access data within a DataFrame using column names, row labels (`.loc`), row positions (`.iloc`), and specific element locations. `.loc` uses labels, while `.iloc` uses integer positions.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
# Accessing a column
print("Name column:\n", df['Name'])
# Accessing a row using .loc (label-based)
print("\nRow at index 1 (Bob):\n", df.loc[1])
# Accessing a row using .iloc (integer-based)
print("\nRow at position 2 (Charlie):\n", df.iloc[2])
# Accessing a specific element
print("\nAge of Alice:", df.loc[0, 'Age'])
print("\nAge of Bob:", df.iloc[1, 1])
Modifying a DataFrame
This illustrates how to modify existing elements, add new columns, add new rows, and delete columns and rows from a Pandas DataFrame. Adding a new row requires using `pd.concat` to combine the existing DataFrame with a new DataFrame containing the row. Deletion of a column is performed using the `del` keyword. Deleting a row requires the use of the drop method.
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 22, 28],
'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)
# Modifying an element
df.loc[1, 'Age'] = 31
print("DataFrame after modifying Bob's age:\n", df)
# Adding a new column
df['Salary'] = [60000, 70000, 55000, 65000]
print("\nDataFrame after adding a salary column:\n", df)
# Adding a new row
new_row = {'Name': 'Eve', 'Age': 24, 'City': 'Sydney', 'Salary': 58000}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
print("\nDataFrame after adding a new row:\n", df)
# Deleting a column
del df['City']
print("\nDataFrame after deleting the City column:\n", df)
#Deleting a row
df = df.drop(5)
print("\nDataFrame after deleting the last row:\n", df)
Real-Life Use Case: Analyzing Sales Data
Imagine you have sales data with columns like 'Product', 'Date', 'Quantity', and 'Price'. A Pandas DataFrame can store this data, allowing you to calculate total sales, identify best-selling products, and analyze sales trends over time.
Best Practices
Interview Tip
Be prepared to discuss common DataFrame operations such as filtering, sorting, grouping, and merging data. These are essential for data analysis tasks.
Concepts Behind the Snippet
This code demonstrates the basic operations on Pandas DataFrames, including creation, data access, modification, adding rows and columns, and deleting rows and columns. Understanding these operations is crucial for effective data manipulation and analysis using Pandas.
Alternatives
Alternatives to Pandas DataFrames include NumPy arrays (for numerical data), dictionaries of lists, and SQL databases. The choice depends on the size of the data, the complexity of the analysis, and performance requirements.
FAQ
-
What is the difference between `.loc` and `.iloc`?
`.loc` is label-based, meaning it uses the index labels to access rows and columns. `.iloc` is integer-based, meaning it uses integer positions to access rows and columns. -
How do I filter a DataFrame based on a condition?
You can use boolean indexing to filter a DataFrame. For example, `df[df['Age'] > 25]` will return a DataFrame containing only rows where the 'Age' column is greater than 25.