Python > Working with Data > Data Analysis with Pandas > Series and DataFrames

Creating and Manipulating Pandas DataFrames

This snippet demonstrates how to create, access, and manipulate Pandas DataFrames, a fundamental data structure for tabular data analysis in Python.

Creating a Pandas DataFrame

This code shows three ways to create a Pandas DataFrame: from a dictionary, from a list of dictionaries, and from a list of lists. When created from a dictionary, keys become the column names. When created from a list of dictionaries, each dictionary represents a row. When created from a list of lists, the `columns` parameter specifies the column names.

import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df1 = pd.DataFrame(data)
print("DataFrame from a dictionary:\n", df1)

# Creating a DataFrame from a list of dictionaries
data_list = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York'},
    {'Name': 'Bob', 'Age': 30, 'City': 'London'},
    {'Name': 'Charlie', 'Age': 22, 'City': 'Paris'},
    {'Name': 'David', 'Age': 28, 'City': 'Tokyo'}
]
df2 = pd.DataFrame(data_list)
print("\nDataFrame from a list of dictionaries:\n", df2)

# Creating a DataFrame from a list of lists
data_lol = [["Alice", 25, "New York"], ["Bob", 30, "London"], ["Charlie", 22, "Paris"], ["David", 28, "Tokyo"]]
df3 = pd.DataFrame(data_lol, columns=["Name", "Age", "City"])
print("\nDataFrame from a list of lists:\n", df3)

Accessing Data in a DataFrame

This demonstrates how to access data within a DataFrame using column names, row labels (`.loc`), row positions (`.iloc`), and specific element locations. `.loc` uses labels, while `.iloc` uses integer positions.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# Accessing a column
print("Name column:\n", df['Name'])

# Accessing a row using .loc (label-based)
print("\nRow at index 1 (Bob):\n", df.loc[1])

# Accessing a row using .iloc (integer-based)
print("\nRow at position 2 (Charlie):\n", df.iloc[2])

# Accessing a specific element
print("\nAge of Alice:", df.loc[0, 'Age'])
print("\nAge of Bob:", df.iloc[1, 1])

Modifying a DataFrame

This illustrates how to modify existing elements, add new columns, add new rows, and delete columns and rows from a Pandas DataFrame. Adding a new row requires using `pd.concat` to combine the existing DataFrame with a new DataFrame containing the row. Deletion of a column is performed using the `del` keyword. Deleting a row requires the use of the drop method.

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
        'Age': [25, 30, 22, 28],
        'City': ['New York', 'London', 'Paris', 'Tokyo']}
df = pd.DataFrame(data)

# Modifying an element
df.loc[1, 'Age'] = 31
print("DataFrame after modifying Bob's age:\n", df)

# Adding a new column
df['Salary'] = [60000, 70000, 55000, 65000]
print("\nDataFrame after adding a salary column:\n", df)

# Adding a new row
new_row = {'Name': 'Eve', 'Age': 24, 'City': 'Sydney', 'Salary': 58000}
df = pd.concat([df, pd.DataFrame([new_row])], ignore_index=True)
print("\nDataFrame after adding a new row:\n", df)

# Deleting a column
del df['City']
print("\nDataFrame after deleting the City column:\n", df)

#Deleting a row
df = df.drop(5)
print("\nDataFrame after deleting the last row:\n", df) 

Real-Life Use Case: Analyzing Sales Data

Imagine you have sales data with columns like 'Product', 'Date', 'Quantity', and 'Price'. A Pandas DataFrame can store this data, allowing you to calculate total sales, identify best-selling products, and analyze sales trends over time.

Best Practices

  • Use descriptive column names: Clear and meaningful column names make your data easier to understand and work with.
  • Set an appropriate index: Choose a column to serve as the index if it provides meaningful information for accessing rows.
  • Handle missing data: Be aware of missing data (NaN) and use appropriate methods to handle it.

Interview Tip

Be prepared to discuss common DataFrame operations such as filtering, sorting, grouping, and merging data. These are essential for data analysis tasks.

Concepts Behind the Snippet

This code demonstrates the basic operations on Pandas DataFrames, including creation, data access, modification, adding rows and columns, and deleting rows and columns. Understanding these operations is crucial for effective data manipulation and analysis using Pandas.

Alternatives

Alternatives to Pandas DataFrames include NumPy arrays (for numerical data), dictionaries of lists, and SQL databases. The choice depends on the size of the data, the complexity of the analysis, and performance requirements.

FAQ

  • What is the difference between `.loc` and `.iloc`?

    `.loc` is label-based, meaning it uses the index labels to access rows and columns. `.iloc` is integer-based, meaning it uses integer positions to access rows and columns.
  • How do I filter a DataFrame based on a condition?

    You can use boolean indexing to filter a DataFrame. For example, `df[df['Age'] > 25]` will return a DataFrame containing only rows where the 'Age' column is greater than 25.