Day 06: Data Manipulation with Pandas
📑 Table of Contents
- 🌟 Welcome to Day 6
- 📚 Introduction to Pandas
- What is Pandas?
- Benefits of Using Pandas
- Installing Pandas
- 🧩 Core Concepts
- Series
- DataFrame
- Indexing and Selecting Data
- Handling Missing Data
- Merging and Joining DataFrames
- Grouping and Aggregating Data
- Pivot Tables
- Datetime Operations
- 💻 Hands-On Coding
- Example Scripts
- 🧩 Interactive Exercises
- 📚 Resources
- 💡 Tips and Tricks
- 💡 Best Practices
- 💡 Advanced Topics
- 💡 Real-World Applications
- 💡 Machine Learning Integration
- 💡 Conclusion
1. 🌟 Welcome to Day 6
Welcome to Day 6 of "Becoming a Scikit-Learn Boss in 90 Days"! 🎉 Today, we delve into the powerful world of Pandas, a cornerstone library for data manipulation and analysis in Python. Pandas provides data structures like Series
and DataFrame
that make handling structured data intuitive and efficient. Mastering Pandas is essential for cleaning, transforming, and preparing data for machine learning models. Let's get started and unlock the full potential of your data! 🚀
2. 📚 Introduction to Pandas
What is Pandas?
Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is built on top of NumPy and integrates seamlessly with many other data science libraries like Matplotlib, Scikit-Learn, and SciPy.
Key Features:
- DataFrame Object: A powerful 2-dimensional data structure with labeled axes (rows and columns).
- Handling Missing Data: Easily manage and clean incomplete data.
- Data Alignment and Integrated Handling of Time Series: Efficiently align data based on labels.
- Flexible Group By Functionality: Perform split-apply-combine operations on datasets.
- Rich Functionality for Data Manipulation: Including merging, joining, reshaping, and pivoting.
Benefits of Using Pandas
- Ease of Use: Intuitive data structures and functions for data manipulation.
- Performance: Optimized for performance with large datasets.
- Flexibility: Supports a wide range of data formats and operations.
- Integration: Works well with other data science and visualization libraries.
- Community Support: Extensive documentation and a large user community.
Installing Pandas
If you haven't installed Pandas yet, you can do so using pip
:
pip install pandas
Or, if you're using Anaconda:
conda install pandas
3. 🧩 Core Concepts
📝 Series
A Series
is a one-dimensional labeled array capable of holding any data type.
Creating a Series:
import pandas as pd
# From a Python list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Creating a Series with Custom Index:
data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
print(series)
Output:
a 10
b 20
c 30
d 40
e 50
dtype: int64
📝 DataFrame
A DataFrame
is a two-dimensional labeled data structure with columns of potentially different types.
Creating a DataFrame:
import pandas as pd
# From a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston
Creating a DataFrame from a List of Lists:
import pandas as pd
data = [
['Alice', 25, 'New York'],
['Bob', 30, 'Los Angeles'],
['Charlie', 35, 'Chicago'],
['David', 40, 'Houston']
]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston
📝 Indexing and Selecting Data
Pandas offers multiple ways to select and manipulate data within Series
and DataFrame
.
Selecting a Column:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
# Selecting the 'Age' column
age = df['Age']
print(age)
Output:
0 25
1 30
2 35
3 40
Name: Age, dtype: int64
Selecting Multiple Columns:
# Selecting 'Name' and 'City' columns
subset = df[['Name', 'City']]
print(subset)
Output:
Name City
0 Alice New York
1 Bob Los Angeles
2 Charlie Chicago
3 David Houston
Selecting Rows by Index:
# Selecting the first two rows
rows = df.iloc[0:2]
print(rows)
Output:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
Selecting Rows by Label:
# Setting a custom index
df.set_index('Name', inplace=True)
print(df)
Output:
Age City
Name
Alice 25 New York
Bob 30 Los Angeles
Charlie 35 Chicago
David 40 Houston
Selecting a Row by Label:
# Selecting the row for 'Charlie'
charlie = df.loc['Charlie']
print(charlie)
Output:
Age 35
City Chicago
Name: Charlie, dtype: object
📝 Handling Missing Data
Missing data is common in real-world datasets. Pandas provides robust methods to handle it.
Detecting Missing Values:
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, np.nan, 35, 40],
'City': ['New York', 'Los Angeles', np.nan, 'Houston']
}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 Alice 25.0 New York
1 Bob NaN Los Angeles
2 Charlie 35.0 NaN
3 David 40.0 Houston
Dropping Missing Values:
# Drop rows with any missing values
cleaned_df = df.dropna()
print(cleaned_df)
Output:
Name Age City
0 Alice 25.0 New York
3 David 40.0 Houston
Filling Missing Values:
# Fill missing 'Age' with the mean age
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
# Fill missing 'City' with 'Unknown'
df['City'].fillna('Unknown', inplace=True)
print(df)
Output:
Name Age City
0 Alice 25.0 New York
1 Bob 33.333333 Los Angeles
2 Charlie 35.0 Unknown
3 David 40.0 Houston
📝 Merging and Joining DataFrames
Combine multiple DataFrames using various join operations.
Merging DataFrames:
import pandas as pd
# First DataFrame
data1 = {
'EmployeeID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
}
df1 = pd.DataFrame(data1)
# Second DataFrame
data2 = {
'EmployeeID': [3, 4, 5, 6],
'Department': ['HR', 'Engineering', 'Marketing', 'Finance']
}
df2 = pd.DataFrame(data2)
# Merge on 'EmployeeID'
merged_df = pd.merge(df1, df2, on='EmployeeID', how='inner')
print(merged_df)
Output:
EmployeeID Name Department
0 3 Charlie HR
1 4 David Engineering
Joining DataFrames:
import pandas as pd
# Set 'EmployeeID' as index
df1.set_index('EmployeeID', inplace=True)
df2.set_index('EmployeeID', inplace=True)
# Join the DataFrames
joined_df = df1.join(df2, how='outer', lsuffix='_left', rsuffix='_right')
print(joined_df)
Output:
Name Department
EmployeeID
1 Alice NaN
2 Bob NaN
3 Charlie HR
4 David Engineering
5 NaN Marketing
6 NaN Finance
📝 Grouping and Aggregating Data
Group data based on certain criteria and perform aggregate operations.
Grouping Data:
import pandas as pd
data = {
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Salary': [50000, 80000, 55000, 90000, 60000, 75000]
}
df = pd.DataFrame(data)
# Group by 'Department'
grouped = df.groupby('Department')
Aggregating Data:
# Calculate mean salary per department
mean_salary = grouped['Salary'].mean()
print(mean_salary)
Output:
Department
Engineering 85000.0
Finance 75000.0
HR 52500.0
Marketing 60000.0
Name: Salary, dtype: float64
Multiple Aggregations:
# Calculate mean and sum of salaries per department
agg_salary = grouped['Salary'].agg(['mean', 'sum'])
print(agg_salary)
Output:
mean sum
Department
Engineering 85000 170000
Finance 75000 75000
HR 52500 105000
Marketing 60000 60000
📝 Pivot Tables
Create pivot tables to summarize data.
Creating a Pivot Table:
import pandas as pd
data = {
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance', 'HR'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Gender': ['F', 'M', 'M', 'M', 'F', 'M', 'F'],
'Salary': [50000, 80000, 55000, 90000, 60000, 75000, 58000]
}
df = pd.DataFrame(data)
# Pivot table: Average salary by Department and Gender
pivot = pd.pivot_table(df, values='Salary', index='Department', columns='Gender', aggfunc='mean')
print(pivot)
Output:
Gender F M
Department
Engineering NaN 85000.0
Finance NaN 75000.0
HR 54000.0 55000.0
Marketing 60000.0 NaN
📝 Datetime Operations
Handle and manipulate datetime data efficiently.
Converting to Datetime:
import pandas as pd
data = {
'Event': ['Start', 'Middle', 'End'],
'Date': ['2024-01-01', '2024-06-15', '2024-12-31']
}
df = pd.DataFrame(data)
# Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
print(df)
Output:
Event Date
0 Start 2024-01-01
1 Middle 2024-06-15
2 End 2024-12-31
Extracting Date Components:
# Extract year, month, and day
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
print(df)
Output:
Event Date Year Month Day
0 Start 2024-01-01 2024 1 1
1 Middle 2024-06-15 2024 6 15
2 End 2024-12-31 2024 12 31
Filtering by Date Range:
# Filter events after June 1, 2024
filtered = df[df['Date'] > '2024-06-01']
print(filtered)
Output:
Event Date Year Month Day
1 Middle 2024-06-15 2024 6 15
2 End 2024-12-31 2024 12 31
4. 💻 Hands-On Coding
🎉 Example Scripts
📝 Script 1: Basic DataFrame Operations
# basic_dataframe_operations.py
import pandas as pd
# Creating a DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
# Display the DataFrame
print("Original DataFrame:")
print(df)
# Adding a new column
df['Salary'] = [70000, 80000, 90000, 100000]
print("\nDataFrame after adding 'Salary' column:")
print(df)
# Dropping a column
df.drop('City', axis=1, inplace=True)
print("\nDataFrame after dropping 'City' column:")
print(df)
Output:
Original DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 David 40 Houston
DataFrame after adding 'Salary' column:
Name Age City Salary
0 Alice 25 New York 70000
1 Bob 30 Los Angeles 80000
2 Charlie 35 Chicago 90000
3 David 40 Houston 100000
DataFrame after dropping 'City' column:
Name Age Salary
0 Alice 25 70000
1 Bob 30 80000
2 Charlie 35 90000
3 David 40 100000
📝 Script 2: Handling Missing Data
# handling_missing_data.py
import pandas as pd
import numpy as np
# Creating a DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, np.nan, 35, 40],
'City': ['New York', 'Los Angeles', np.nan, 'Houston']
}
df = pd.DataFrame(data)
print("Original DataFrame with missing values:")
print(df)
# Dropping rows with any missing values
cleaned_df = df.dropna()
print("\nDataFrame after dropping missing values:")
print(cleaned_df)
# Filling missing values
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
df['City'].fillna('Unknown', inplace=True)
print("\nDataFrame after filling missing values:")
print(df)
Output:
Original DataFrame with missing values:
Name Age City
0 Alice 25.0 New York
1 Bob NaN Los Angeles
2 Charlie 35.0 NaN
3 David 40.0 Houston
DataFrame after dropping missing values:
Name Age City
0 Alice 25.0 New York
3 David 40.0 Houston
DataFrame after filling missing values:
Name Age City
0 Alice 25.000000 New York
1 Bob 33.333333 Los Angeles
2 Charlie 35.000000 Unknown
3 David 40.000000 Houston
📝 Script 3: Merging DataFrames
# merging_dataframes.py
import pandas as pd
# First DataFrame
data1 = {
'EmployeeID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David']
}
df1 = pd.DataFrame(data1)
# Second DataFrame
data2 = {
'EmployeeID': [3, 4, 5, 6],
'Department': ['HR', 'Engineering', 'Marketing', 'Finance']
}
df2 = pd.DataFrame(data2)
# Merge DataFrames on 'EmployeeID'
merged_df = pd.merge(df1, df2, on='EmployeeID', how='inner')
print("Merged DataFrame (Inner Join):")
print(merged_df)
# Outer Join
outer_merged = pd.merge(df1, df2, on='EmployeeID', how='outer', indicator=True)
print("\nMerged DataFrame (Outer Join with Indicator):")
print(outer_merged)
Output:
Merged DataFrame (Inner Join):
EmployeeID Name Department
0 3 Charlie HR
1 4 David Engineering
Merged DataFrame (Outer Join with Indicator):
EmployeeID Name Department _merge
0 1 Alice NaN left_only
1 2 Bob NaN left_only
2 3 Charlie HR both
3 4 David Engineering both
4 5 NaN Marketing right_only
5 6 NaN Finance right_only
📝 Script 4: Grouping and Aggregating Data
# grouping_aggregating.py
import pandas as pd
# Creating a DataFrame
data = {
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
'Salary': [50000, 80000, 55000, 90000, 60000, 75000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Group by 'Department'
grouped = df.groupby('Department')
# Calculate mean salary per department
mean_salary = grouped['Salary'].mean()
print("\nMean Salary per Department:")
print(mean_salary)
# Multiple aggregations
agg_salary = grouped['Salary'].agg(['mean', 'sum', 'count'])
print("\nAggregated Salary Statistics:")
print(agg_salary)
Output:
Original DataFrame:
Department Employee Salary
0 HR Alice 50000
1 Engineering Bob 80000
2 HR Charlie 55000
3 Engineering David 90000
4 Marketing Eve 60000
5 Finance Frank 75000
Mean Salary per Department:
Department
Engineering 85000.0
Finance 75000.0
HR 52500.0
Marketing 60000.0
Name: Salary, dtype: float64
Aggregated Salary Statistics:
mean sum count
Department
Engineering 85000 170000 2
Finance 75000 75000 1
HR 52500 105000 2
Marketing 60000 60000 1
📝 Script 5: Creating Pivot Tables
# pivot_tables.py
import pandas as pd
# Creating a DataFrame
data = {
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance', 'HR'],
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
'Gender': ['F', 'M', 'M', 'M', 'F', 'M', 'F'],
'Salary': [50000, 80000, 55000, 90000, 60000, 75000, 58000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Creating a pivot table: Average salary by Department and Gender
pivot = pd.pivot_table(df, values='Salary', index='Department', columns='Gender', aggfunc='mean')
print("\nPivot Table - Average Salary by Department and Gender:")
print(pivot)
Output:
Original DataFrame:
Department Employee Gender Salary
0 HR Alice F 50000
1 Engineering Bob M 80000
2 HR Charlie M 55000
3 Engineering David M 90000
4 Marketing Eve F 60000
5 Finance Frank M 75000
6 HR Grace F 58000
Pivot Table - Average Salary by Department and Gender:
Gender F M
Department
Engineering NaN 85000.0
Finance NaN 75000.0
HR 54000.0 55000.0
Marketing 60000.0 NaN
📝 Script 6: Datetime Operations
# datetime_operations.py
import pandas as pd
# Creating a DataFrame with date information
data = {
'Event': ['Start', 'Middle', 'End'],
'Date': ['2024-01-01', '2024-06-15', '2024-12-31']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Converting 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
print("\nDataFrame after converting 'Date' to datetime:")
print(df)
# Extracting year, month, and day
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
print("\nDataFrame with extracted date components:")
print(df)
# Filtering events after June 1, 2024
filtered = df[df['Date'] > '2024-06-01']
print("\nFiltered DataFrame (Events after June 1, 2024):")
print(filtered)
Output:
Original DataFrame:
Event Date
0 Start 2024-01-01
1 Middle 2024-06-15
2 End 2024-12-31
DataFrame after converting 'Date' to datetime:
Event Date
0 Start 2024-01-01
1 Middle 2024-06-15
2 End 2024-12-31
DataFrame with extracted date components:
Event Date Year Month Day
0 Start 2024-01-01 2024 1 1
1 Middle 2024-06-15 2024 6 15
2 End 2024-12-31 2024 12 31
Filtered DataFrame (Events after June 1, 2024):
Event Date Year Month Day
1 Middle 2024-06-15 2024 6 15
2 End 2024-12-31 2024 12 31
5. 🧩 Interactive Exercises
📝 Exercise 1: Creating and Manipulating DataFrames
- Add a new column for passing status based on grades (pass if grade >= 60).
- Filter out students who are passing.
- Calculate the average age of passing students.
Task: Create a DataFrame from a dictionary containing student names, ages, and grades. Perform the following operations:
import pandas as pd
# Create DataFrame
data = {
'Name': ['John', 'Emma', 'Liam', 'Olivia', 'Noah'],
'Age': [20, 21, 19, 22, 20],
'Grade': [85, 58, 90, 76, 45]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Add 'Passed' column
df['Passed'] = df['Grade'] >= 60
print("\nDataFrame after adding 'Passed' column:")
print(df)
# Filter passing students
passing_students = df[df['Passed']]
print("\nPassing Students:")
print(passing_students)
# Calculate average age of passing students
average_age = passing_students['Age'].mean()
print(f"\nAverage Age of Passing Students: {average_age}")
Output:
Original DataFrame:
Name Age Grade
0 John 20 85
1 Emma 21 58
2 Liam 19 90
3 Olivia 22 76
4 Noah 20 45
DataFrame after adding 'Passed' column:
Name Age Grade Passed
0 John 20 85 True
1 Emma 21 58 False
2 Liam 19 90 True
3 Olivia 22 76 True
4 Noah 20 45 False
Passing Students:
Name Age Grade Passed
0 John 20 85 True
2 Liam 19 90 True
3 Olivia 22 76 True
Average Age of Passing Students: 20.333333333333332
📝 Exercise 2: Handling Missing Data
- Identify missing values.
- Fill missing 'Age' with the median age.
- Fill missing 'City' with the mode of the 'City' column.
- Drop any remaining rows with missing values.
Task: Given a DataFrame with missing values in the 'Age' and 'City' columns, perform the following:
import pandas as pd
import numpy as np
# Create DataFrame with missing values
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, np.nan, 35, 40, np.nan],
'City': ['New York', 'Los Angeles', np.nan, 'Houston', 'Chicago']
}
df = pd.DataFrame(data)
print("Original DataFrame with missing values:")
print(df)
# Identify missing values
print("\nMissing Values:")
print(df.isnull())
# Fill missing 'Age' with median age
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)
print(f"\nMedian Age: {median_age}")
print("DataFrame after filling missing 'Age':")
print(df)
# Fill missing 'City' with mode of 'City'
mode_city = df['City'].mode()[0]
df['City'].fillna(mode_city, inplace=True)
print(f"\nMode City: {mode_city}")
print("DataFrame after filling missing 'City':")
print(df)
# Drop any remaining rows with missing values
cleaned_df = df.dropna()
print("\nDataFrame after dropping remaining missing values:")
print(cleaned_df)
Output:
Original DataFrame with missing values:
Name Age City
0 Alice 25.0 New York
1 Bob NaN Los Angeles
2 Charlie 35.0 NaN
3 David 40.0 Houston
4 Eve NaN Chicago
Missing Values:
Name Age City
0 False False False
1 False True False
2 False False True
3 False False False
4 False True False
Median Age: 30.0
DataFrame after filling missing 'Age':
Name Age City
0 Alice 25.0 New York
1 Bob 30.0 Los Angeles
2 Charlie 35.0 NaN
3 David 40.0 Houston
4 Eve 30.0 Chicago
Mode City: Chicago
DataFrame after filling missing 'City':
Name Age City
0 Alice 25.0 New York
1 Bob 30.0 Los Angeles
2 Charlie 35.0 Chicago
3 David 40.0 Houston
4 Eve 30.0 Chicago
DataFrame after dropping remaining missing values:
Name Age City
0 Alice 25.0 New York
1 Bob 30.0 Los Angeles
2 Charlie 35.0 Chicago
3 David 40.0 Houston
4 Eve 30.0 Chicago
📝 Exercise 3: Merging DataFrames
- Merge the DataFrames on 'ProductID'.
- Perform an inner join and display the result.
- Perform a left join and display the result.
Task: Given two DataFrames containing product information and sales data, perform the following:
import pandas as pd
# Product Information DataFrame
products = {
'ProductID': [101, 102, 103, 104],
'ProductName': ['Laptop', 'Smartphone', 'Tablet', 'Monitor'],
'Price': [1200, 800, 300, 200]
}
df_products = pd.DataFrame(products)
print("Products DataFrame:")
print(df_products)
# Sales Data DataFrame
sales = {
'SaleID': [1, 2, 3, 4, 5],
'ProductID': [101, 102, 105, 103, 101],
'Quantity': [5, 10, 2, 7, 3]
}
df_sales = pd.DataFrame(sales)
print("\nSales DataFrame:")
print(df_sales)
# Inner Join
inner_join = pd.merge(df_sales, df_products, on='ProductID', how='inner')
print("\nInner Join Result:")
print(inner_join)
# Left Join
left_join = pd.merge(df_sales, df_products, on='ProductID', how='left')
print("\nLeft Join Result:")
print(left_join)
Output:
Products DataFrame:
ProductID ProductName Price
0 101 Laptop 1200
1 102 Smartphone 800
2 103 Tablet 300
3 104 Monitor 200
Sales DataFrame:
SaleID ProductID Quantity
0 1 101 5
1 2 102 10
2 3 105 2
3 4 103 7
4 5 101 3
Inner Join Result:
SaleID ProductID Quantity ProductName Price
0 1 101 5 Laptop 1200
1 2 102 10 Smartphone 800
2 4 103 7 Tablet 300
3 5 101 3 Laptop 1200
Left Join Result:
SaleID ProductID Quantity ProductName Price
0 1 101 5 Laptop 1200.0
1 2 102 10 Smartphone 800.0
2 3 105 2 NaN NaN
3 4 103 7 Tablet 300.0
4 5 101 3 Laptop 1200.0
📝 Exercise 4: Grouping and Aggregating Data
- Group the data by 'Region' and calculate the total sales.
- Group the data by 'Product' and calculate the average sales.
- Find the region with the highest total sales.
Task: Given a DataFrame containing sales data, perform the following:
import pandas as pd
# Sales DataFrame
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'],
'Sales': [5000, 7000, 6000, 4000, 3000, 8000, 6500, 7200]
}
df = pd.DataFrame(data)
print("Sales DataFrame:")
print(df)
# Group by 'Region' and calculate total sales
total_sales_region = df.groupby('Region')['Sales'].sum()
print("\nTotal Sales by Region:")
print(total_sales_region)
# Group by 'Product' and calculate average sales
average_sales_product = df.groupby('Product')['Sales'].mean()
print("\nAverage Sales by Product:")
print(average_sales_product)
# Find the region with the highest total sales
top_region = total_sales_region.idxmax()
print(f"\nRegion with Highest Total Sales: {top_region}")
Output:
Sales DataFrame:
Region Product Sales
0 North Laptop 5000
1 South Smartphone 7000
2 East Laptop 6000
3 West Tablet 4000
4 North Tablet 3000
5 South Laptop 8000
6 East Smartphone 6500
7 West Laptop 7200
Total Sales by Region:
Region
East 12500
North 8000
South 15000
West 11200
Name: Sales, dtype: int64
Average Sales by Product:
Product
Laptop 6550.0
Smartphone 6750.0
Tablet 3500.0
Name: Sales, dtype: float64
Region with Highest Total Sales: South
📝 Exercise 5: Creating Pivot Tables
Task: Given a DataFrame with sales data, create a pivot table to summarize the total sales for each product in each region.
import pandas as pd
# Sales DataFrame
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'],
'Sales': [5000, 7000, 6000, 4000, 3000, 8000, 6500, 7200]
}
df = pd.DataFrame(data)
print("Sales DataFrame:")
print(df)
# Create pivot table
pivot = pd.pivot_table(df, values='Sales', index='Region', columns='Product', aggfunc='sum', fill_value=0)
print("\nPivot Table - Total Sales by Region and Product:")
print(pivot)
Output:
Sales DataFrame:
Region Product Sales
0 North Laptop 5000
1 South Smartphone 7000
2 East Laptop 6000
3 West Tablet 4000
4 North Tablet 3000
5 South Laptop 8000
6 East Smartphone 6500
7 West Laptop 7200
Pivot Table - Total Sales by Region and Product:
Product Laptop Smartphone Tablet
Region
East 6000 6500 0
North 5000 0 3000
South 8000 7000 0
West 7200 0 4000
📝 Exercise 6: Datetime Operations
- Convert the 'Date' column to datetime.
- Extract the month from the 'Date' column.
- Filter events that occurred in the second half of the year.
Task: Given a DataFrame with event dates, perform the following:
import pandas as pd
# Event DataFrame
data = {
'Event': ['Start', 'Middle', 'End', 'Review'],
'Date': ['2024-01-15', '2024-07-20', '2024-12-05', '2024-06-30']
}
df = pd.DataFrame(data)
print("Original Event DataFrame:")
print(df)
# Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
print("\nDataFrame after converting 'Date' to datetime:")
print(df)
# Extract month
df['Month'] = df['Date'].dt.month
print("\nDataFrame with extracted 'Month':")
print(df)
# Filter events in the second half of the year (July to December)
second_half = df[df['Month'] >= 7]
print("\nEvents in the Second Half of the Year:")
print(second_half)
Output:
Original Event DataFrame:
Event Date
0 Start 2024-01-15
1 Middle 2024-07-20
2 End 2024-12-05
3 Review 2024-06-30
DataFrame after converting 'Date' to datetime:
Event Date
0 Start 2024-01-15
1 Middle 2024-07-20
2 End 2024-12-05
3 Review 2024-06-30
DataFrame with extracted 'Month':
Event Date Month
0 Start 2024-01-15 1
1 Middle 2024-07-20 7
2 End 2024-12-05 12
3 Review 2024-06-30 6
Events in the Second Half of the Year:
Event Date Month
1 Middle 2024-07-20 7
2 End 2024-12-05 12
📝 Exercise 7: Filtering and Sorting Data
- Filter employees with a salary greater than $75,000.
- Sort the filtered data by salary in descending order.
- Reset the index of the sorted DataFrame.
Task: Given a DataFrame with employee data, perform the following:
import pandas as pd
# Employee DataFrame
data = {
'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing'],
'Salary': [70000, 80000, 90000, 85000, 60000]
}
df = pd.DataFrame(data)
print("Original Employee DataFrame:")
print(df)
# Filter employees with salary > 75000
high_earners = df[df['Salary'] > 75000]
print("\nEmployees with Salary > $75,000:")
print(high_earners)
# Sort by salary in descending order
sorted_high_earners = high_earners.sort_values(by='Salary', ascending=False)
print("\nSorted High Earners (Descending Salary):")
print(sorted_high_earners)
# Reset index
sorted_high_earners.reset_index(drop=True, inplace=True)
print("\nSorted High Earners with Reset Index:")
print(sorted_high_earners)
Output:
Original Employee DataFrame:
Employee Department Salary
0 Alice HR 70000
1 Bob Engineering 80000
2 Charlie HR 90000
3 David Engineering 85000
4 Eve Marketing 60000
Employees with Salary > $75,000:
Employee Department Salary
1 Bob Engineering 80000
2 Charlie HR 90000
3 David Engineering 85000
Sorted High Earners (Descending Salary):
Employee Department Salary
2 Charlie HR 90000
3 David Engineering 85000
1 Bob Engineering 80000
Sorted High Earners with Reset Index:
Employee Department Salary
0 Charlie HR 90000
1 David Engineering 85000
2 Bob Engineering 80000
📝 Exercise 8: Creating Visualizations with Pandas
Task: Given a DataFrame with sales data, create a bar chart showing total sales per department.
import pandas as pd
import matplotlib.pyplot as plt
# Sales DataFrame
data = {
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance'],
'Sales': [50000, 80000, 55000, 90000, 60000, 75000]
}
df = pd.DataFrame(data)
print("Sales DataFrame:")
print(df)
# Group by 'Department' and sum sales
sales_summary = df.groupby('Department')['Sales'].sum()
print("\nTotal Sales by Department:")
print(sales_summary)
# Plotting the bar chart
sales_summary.plot(kind='bar', color='skyblue')
plt.title('Total Sales per Department')
plt.xlabel('Department')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
Output:
Sales DataFrame:
Department Sales
0 HR 50000
1 Engineering 80000
2 HR 55000
3 Engineering 90000
4 Marketing 60000
5 Finance 75000
Total Sales by Department:
Department
Engineering 170000
Finance 75000
HR 105000
Marketing 60000
Name: Sales, dtype: int64
A bar chart displaying total sales per department will be shown.
6. 📚 Resources
Enhance your learning with these excellent resources:
- Official Pandas Documentation
- Pandas Tutorial by W3Schools
- Real Python Pandas Tutorials
- Pandas Cookbook by Julia Evans
- Codecademy Pandas Course
- Kaggle Pandas Tutorials
- Python for Data Analysis by Wes McKinney
- DataCamp Pandas Courses
- LeetCode Pandas Problems
7. 💡 Tips and Tricks
💡 Pro Tip
Leverage Pandas Vectorization: Similar to NumPy, Pandas operations are optimized for performance. Avoid using Python loops for data manipulation tasks. Utilize Pandas' built-in functions and vectorized operations to perform operations on entire DataFrames or Series at once.
import pandas as pd
# Inefficient loop
df = pd.DataFrame({'A': range(1000)})
df['B'] = 0
for i in range(len(df)):
df.at[i, 'B'] = df.at[i, 'A'] * 2
# Efficient vectorized operation
df['B'] = df['A'] * 2
🛠️ Recommended Tools
- Jupyter Notebook: Ideal for interactive data analysis and visualization.
- Visual Studio Code: A versatile code editor with excellent Pandas support.
- PyCharm: An IDE with powerful features for Python development.
- Spyder: An IDE tailored for scientific computing and data analysis.
- Google Colab: An online Jupyter notebook environment that doesn't require setup.
🚀 Speed Up Your Coding
Use Chaining Carefully: While method chaining can make your code more concise, excessive chaining can make debugging difficult. Break down complex operations into simpler steps when necessary.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [5, 6, 7, 8]
})
# Chained operations
result = df[df['A'] > 2].sort_values(by='B').reset_index(drop=True)
print(result)
Understand Indexing and Selection: Mastering .loc[]
, .iloc[]
, and boolean indexing can help you access and manipulate data more efficiently.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]
})
# Using .loc[] for label-based indexing
subset = df.loc[0:1, ['A', 'C']]
print(subset)
# Using .iloc[] for position-based indexing
subset = df.iloc[0:2, [0, 2]]
print(subset)
# Boolean indexing
filtered = df[df['A'] > 1]
print(filtered)
Use Built-in Functions: Familiarize yourself with Pandas' extensive library of functions to perform complex operations effortlessly.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
# Sum of columns
column_sum = df.sum()
print(column_sum) # Output: A 6, B 15
# Mean of columns
column_mean = df.mean()
print(column_mean) # Output: A 2.0, B 5.0
🔍 Debugging Tips
- Use Debuggers: Tools like the built-in debugger in VS Code or PyCharm allow you to step through your code, inspect variables, and understand the flow of execution.
Use Pandas Profiling: Tools like Pandas Profiling can generate detailed reports of your DataFrame, helping you identify issues quickly.
import pandas as pd
import pandas_profiling
profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file("your_report.html")
Leverage Print Statements and .head()
: Use print statements and the .head()
method to inspect intermediate results and understand how your DataFrames are being transformed.
print(df.head())
Use Assertions: Incorporate assertions to ensure your DataFrames have the expected dimensions and data types.
assert df.shape[0] > 0, "DataFrame is empty"
assert 'A' in df.columns, "'A' column is missing"
Check DataFrame Shapes: Always verify the shape of your DataFrames when performing operations to ensure they align as expected.
print(df.shape)
8. 💡 Best Practices
💡 Write Readable Code
Comment Your Code: Explain complex operations or the purpose of certain blocks of code.
import pandas as pd
# Calculate the mean of the dataset
mean_value = df['Salary'].mean()
Use Descriptive Variable Names: Choose names that clearly describe the purpose of the variable.
import pandas as pd
temperatures_celsius = pd.Series([22.5, 23.0, 21.5, 24.0])
temperatures_fahrenheit = temperatures_celsius * 9/5 + 32
💡 Avoid Common Pitfalls
Data Type Mismatch: Ensure that operations are performed on compatible data types to avoid unexpected results.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['C'] = df['A'].astype(float) / df['B']
Immutable Operations: Remember that some operations return new DataFrames instead of modifying in place.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]})
df = df + 10 # Correct: reassign to modify
💡 Optimize Performance
Profile Your Code: Use profiling tools like cProfile
to identify bottlenecks.
import pandas as pd
import cProfile
def compute():
df = pd.read_csv('large_dataset.csv')
return df.describe()
cProfile.run('compute()')
Leverage In-Place Operations: Use in-place operations (inplace=True
) to save memory and improve speed.
import pandas as pd
df.drop('B', axis=1, inplace=True)
Minimize Memory Footprint: Use appropriate data types and avoid unnecessary DataFrame copies.
import pandas as pd
df = pd.DataFrame({
'A': pd.Series([1, 2, 3], dtype='int8'),
'B': pd.Series([4.0, 5.0, 6.0], dtype='float32')
})
💡 Stay Updated
Pandas is continuously evolving. Keep up with the latest updates and best practices by following the official Pandas Release Notes.
9. 💡 Advanced Topics
💡 Working with Time Series Data
Pandas provides extensive functionality for handling time series data, including date range generation, frequency conversion, and moving window statistics.
Example:
import pandas as pd
import numpy as np
# Create a date range
dates = pd.date_range(start='2024-01-01', periods=100, freq='D')
df = pd.DataFrame({'Date': dates, 'Value': np.random.randn(100)})
# Set 'Date' as the index
df.set_index('Date', inplace=True)
# Resample to weekly frequency and calculate mean
weekly_mean = df.resample('W').mean()
print(weekly_mean)
Output:
Value
Date
2024-01-07 0.123456
2024-01-14 -0.654321
...
2024-04-12 0.789012
💡 MultiIndex and Hierarchical Indexing
Pandas allows for multiple levels of indexing, enabling more complex data representations and manipulations.
Example:
import pandas as pd
# Creating a MultiIndex DataFrame
arrays = [
['Bar', 'Bar', 'Baz', 'Baz', 'Foo', 'Foo', 'Qux', 'Qux'],
['One', 'Two', 'One', 'Two', 'One', 'Two', 'One', 'Two']
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['First', 'Second'])
df = pd.DataFrame({'A': range(8), 'B': range(8, 16)}, index=index)
print("MultiIndex DataFrame:")
print(df)
# Accessing data using MultiIndex
print("\nAccessing data for ('Bar', 'One'):")
print(df.loc[('Bar', 'One')])
Output:
MultiIndex DataFrame:
A B
First Second
Bar One 0 8
Two 1 9
Baz One 2 10
Two 3 11
Foo One 4 12
Two 5 13
Qux One 6 14
Two 7 15
Accessing data for ('Bar', 'One'):
A 0
B 8
Name: (Bar, One), dtype: int64
💡 Advanced Merging Techniques
Pandas offers advanced merging capabilities, including merging on multiple keys and using different join types.
Example:
import pandas as pd
# First DataFrame
data1 = {
'EmployeeID': [1, 2, 3, 4],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'DeptID': [10, 20, 10, 30]
}
df1 = pd.DataFrame(data1)
# Second DataFrame
data2 = {
'DeptID': [10, 20, 30, 40],
'Department': ['HR', 'Engineering', 'Marketing', 'Finance']
}
df2 = pd.DataFrame(data2)
# Merge on 'DeptID'
merged_df = pd.merge(df1, df2, on='DeptID', how='left')
print("Merged DataFrame:")
print(merged_df)
Output:
Merged DataFrame:
EmployeeID Name DeptID Department
0 1 Alice 10 HR
1 2 Bob 20 Engineering
2 3 Charlie 10 HR
3 4 David 30 Marketing
💡 Pivoting Data
Pivoting allows you to reshape data based on column values, making it easier to analyze.
Example:
import pandas as pd
# Sales DataFrame
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'],
'Sales': [5000, 7000, 6000, 4000, 3000, 8000, 6500, 7200]
}
df = pd.DataFrame(data)
print("Original Sales DataFrame:")
print(df)
# Create pivot table
pivot = df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='sum', fill_value=0)
print("\nPivot Table - Total Sales by Region and Product:")
print(pivot)
Output:
Original Sales DataFrame:
Region Product Sales
0 North Laptop 5000
1 South Smartphone 7000
2 East Laptop 6000
3 West Tablet 4000
4 North Tablet 3000
5 South Laptop 8000
6 East Smartphone 6500
7 West Laptop 7200
Pivot Table - Total Sales by Region and Product:
Product Laptop Smartphone Tablet
Region
East 6000 6500 0
North 5000 0 3000
South 8000 7000 0
West 7200 0 4000
💡 Advanced Datetime Operations
Perform complex datetime manipulations, such as shifting dates, calculating differences, and resampling.
Example:
import pandas as pd
# Creating a DataFrame with datetime
data = {
'Date': pd.date_range(start='2024-01-01', periods=5, freq='D'),
'Value': [100, 200, 150, 300, 250]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Shifting dates by 2 days
df['Date_Shifted'] = df['Date'] + pd.Timedelta(days=2)
print("\nDataFrame after shifting dates by 2 days:")
print(df)
# Calculating difference between shifted and original dates
df['Date_Diff'] = df['Date_Shifted'] - df['Date']
print("\nDataFrame with date differences:")
print(df)
# Resampling data monthly and calculating sum
df.set_index('Date', inplace=True)
monthly_sum = df.resample('M').sum()
print("\nMonthly Sum of Values:")
print(monthly_sum)
Output:
Original DataFrame:
Date Value
0 2024-01-01 100
1 2024-01-02 200
2 2024-01-03 150
3 2024-01-04 300
4 2024-01-05 250
DataFrame after shifting dates by 2 days:
Date Value Date_Shifted
0 2024-01-01 100 2024-01-03
1 2024-01-02 200 2024-01-04
2 2024-01-03 150 2024-01-05
3 2024-01-04 300 2024-01-06
4 2024-01-05 250 2024-01-07
DataFrame with date differences:
Date Value Date_Shifted Date_Diff
0 2024-01-01 100 2024-01-03 2 days
1 2024-01-02 200 2024-01-04 2 days
2 2024-01-03 150 2024-01-05 2 days
3 2024-01-04 300 2024-01-06 2 days
4 2024-01-05 250 2024-01-07 2 days
Monthly Sum of Values:
Value
Date
2024-01-31 1000
10. 💡 Real-World Applications
💡 Data Analysis
Pandas is a foundational tool for data analysis, enabling efficient data manipulation and computation.
import pandas as pd
# Load data from a CSV file
data = pd.read_csv('data.csv')
# Display the first few rows
print(data.head())
# Compute statistics
mean = data.mean()
median = data.median()
std_dev = data.std()
print("\nMean of each column:")
print(mean)
print("\nMedian of each column:")
print(median)
print("\nStandard Deviation of each column:")
print(std_dev)
Output:
Column1 Column2 Column3
0 1 4 7
1 2 5 8
2 3 6 9
3 4 7 10
4 5 8 11
Mean of each column:
Column1 3.0
Column2 6.0
Column3 9.0
dtype: float64
Median of each column:
Column1 3.0
Column2 6.0
Column3 9.0
dtype: float64
Standard Deviation of each column:
Column1 1.581139
Column2 1.581139
Column3 1.581139
dtype: float64
(Note: Replace 'data.csv'
with your actual data file path.)
💡 Image Processing
Pandas can be used in conjunction with libraries like OpenCV or PIL for image processing tasks, such as analyzing pixel data or metadata.
import pandas as pd
from PIL import Image
import numpy as np
# Load an image
image = Image.open('image.jpg').convert('RGB')
arr = np.array(image)
# Create a DataFrame with pixel data
pixels = pd.DataFrame(arr.reshape(-1, 3), columns=['Red', 'Green', 'Blue'])
print("Pixel DataFrame:")
print(pixels.head())
# Analyze color distribution
color_counts = pixels.groupby(['Red', 'Green', 'Blue']).size().reset_index(name='Count')
print("\nColor Distribution:")
print(color_counts.head())
Output:
Pixel DataFrame:
Red Green Blue
0 34 67 89
1 34 67 89
2 34 67 89
3 35 68 90
4 35 68 90
Color Distribution:
Red Green Blue Count
0 34 67 89 3
1 35 68 90 2
...
(Note: Replace 'image.jpg'
with your actual image file path.)
💡 Financial Modeling
Perform complex financial calculations and simulations using Pandas' data manipulation capabilities.
import pandas as pd
import numpy as np
# Create a DataFrame with stock prices
dates = pd.date_range(start='2024-01-01', periods=100, freq='D')
prices = np.random.lognormal(mean=0.001, sigma=0.02, size=100).cumprod() * 100
df = pd.DataFrame({'Date': dates, 'Price': prices})
df.set_index('Date', inplace=True)
print("Stock Prices DataFrame:")
print(df.head())
# Calculate daily returns
df['Daily_Return'] = df['Price'].pct_change()
print("\nDataFrame with Daily Returns:")
print(df.head())
# Calculate moving average
df['Moving_Avg_10'] = df['Price'].rolling(window=10).mean()
print("\nDataFrame with 10-Day Moving Average:")
print(df.head(15))
# Plotting the moving average
df[['Price', 'Moving_Avg_10']].plot(figsize=(10, 5))
plt.title('Stock Prices and 10-Day Moving Average')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
Output:
Stock Prices DataFrame:
Price
Date
2024-01-01 101.234567
2024-01-02 102.345678
2024-01-03 103.456789
2024-01-04 104.567890
2024-01-05 105.678901
DataFrame with Daily Returns:
Price Daily_Return
Date
2024-01-01 101.234567 NaN
2024-01-02 102.345678 0.010000
2024-01-03 103.456789 0.010000
2024-01-04 104.567890 0.010000
2024-01-05 105.678901 0.010000
DataFrame with 10-Day Moving Average:
Price Daily_Return Moving_Avg_10
Date
2024-01-01 101.234567 NaN NaN
2024-01-02 102.345678 0.010000 NaN
2024-01-03 103.456789 0.010000 NaN
2024-01-04 104.567890 0.010000 NaN
2024-01-05 105.678901 0.010000 NaN
2024-01-06 106.789012 0.010000 NaN
2024-01-07 107.890123 0.010000 NaN
2024-01-08 108.901234 0.010000 NaN
2024-01-09 109.012345 0.010000 NaN
2024-01-10 110.123456 0.010000 105.000000
2024-01-11 111.234567 0.010000 106.111111
2024-01-12 112.345678 0.010000 107.222222
2024-01-13 113.456789 0.010000 108.333333
2024-01-14 114.567890 0.010000 109.444444
2024-01-15 115.678901 0.010000 110.555556
The plot displaying stock prices and the 10-day moving average will be shown.
11. 💡 Machine Learning Integration
💡 Feature Engineering with Pandas
Pandas is essential for creating and transforming features needed for machine learning models.
import pandas as pd
import numpy as np
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, 35, 40, 28],
'Salary': [70000, 80000, 90000, 100000, 75000],
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Creating new features
df['Age_Squared'] = df['Age'] ** 2
df['Salary_Log'] = np.log(df['Salary'])
df['Is_Engineering'] = df['Department'] == 'Engineering'
print("\nDataFrame with New Features:")
print(df)
Output:
Original DataFrame:
Name Age Salary Department
0 Alice 25 70000 HR
1 Bob 30 80000 Engineering
2 Charlie 35 90000 HR
3 David 40 100000 Engineering
4 Eve 28 75000 Marketing
DataFrame with New Features:
Name Age Salary Department Age_Squared Salary_Log Is_Engineering
0 Alice 25 70000 HR 625 11.156252 False
1 Bob 30 80000 Engineering 900 11.289781 True
2 Charlie 35 90000 HR 1225 11.407569 False
3 David 40 100000 Engineering 1600 11.512925 True
4 Eve 28 75000 Marketing 784 11.223251 False
💡 Data Preprocessing with Pandas
Pandas plays a crucial role in preprocessing steps such as handling missing values, encoding categorical variables, and scaling features.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Sample DataFrame with missing values and categorical data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, np.nan, 40, 28],
'Salary': [70000, 80000, 90000, np.nan, 75000],
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Handling missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("\nDataFrame after handling missing values:")
print(df)
# Encoding categorical variables
le = LabelEncoder()
df['Department_Encoded'] = le.fit_transform(df['Department'])
print("\nDataFrame after encoding 'Department':")
print(df)
# Feature scaling
scaler = StandardScaler()
df[['Age_Scaled', 'Salary_Scaled']] = scaler.fit_transform(df[['Age', 'Salary']])
print("\nDataFrame after feature scaling:")
print(df)
Output:
Original DataFrame:
Name Age Salary Department
0 Alice 25.0 70000.0 HR
1 Bob 30.0 80000.0 Engineering
2 Charlie NaN 90000.0 HR
3 David 40.0 NaN Engineering
4 Eve 28.0 75000.0 Marketing
DataFrame after handling missing values:
Name Age Salary Department
0 Alice 25.0 70000.0 HR
1 Bob 30.0 80000.0 Engineering
2 Charlie 28.0 90000.0 HR
3 David 40.0 78750.0 Engineering
4 Eve 28.0 75000.0 Marketing
DataFrame after encoding 'Department':
Name Age Salary Department Department_Encoded
0 Alice 25.0 70000.0 HR 2
1 Bob 30.0 80000.0 Engineering 0
2 Charlie 28.0 90000.0 HR 2
3 David 40.0 78750.0 Engineering 0
4 Eve 28.0 75000.0 Marketing 1
DataFrame after feature scaling:
Name Age Salary Department Department_Encoded Age_Scaled Salary_Scaled
0 Alice 25.0 70000.0 HR 2 -1.341641 -0.801784
1 Bob 30.0 80000.0 Engineering 0 -0.447214 0.000000
2 Charlie 28.0 90000.0 HR 2 -0.894427 1.414214
3 David 40.0 78750.0 Engineering 0 1.341641 0.530330
4 Eve 28.0 75000.0 Marketing 1 -0.894427 -0.235702
💡 Handling High-Dimensional Data
Efficiently manage and manipulate high-dimensional datasets using Pandas' powerful indexing and reshaping capabilities.
import pandas as pd
import numpy as np
# Creating a high-dimensional DataFrame
arrays = [
['A', 'A', 'B', 'B', 'C', 'C'],
['One', 'Two', 'One', 'Two', 'One', 'Two']
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Category', 'Subcategory'])
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60]}, index=index)
print("Original MultiIndex DataFrame:")
print(df)
# Unstacking the DataFrame
unstacked = df.unstack(level='Subcategory')
print("\nUnstacked DataFrame:")
print(unstacked)
# Stacking the DataFrame back
stacked = unstacked.stack(level='Subcategory')
print("\nStacked DataFrame:")
print(stacked)
Output:
Original MultiIndex DataFrame:
Values
Category Subcategory
A One 10
Two 20
B One 30
Two 40
C One 50
Two 60
Unstacked DataFrame:
Values
Subcategory One Two
Category
A 10 20
B 30 40
C 50 60
Stacked DataFrame:
Values
Category Subcategory
A One 10
Two 20
B One 30
Two 40
C One 50
Two 60
💡 Advanced Datetime Operations
Perform complex datetime manipulations, such as shifting dates, calculating differences, and resampling.
import pandas as pd
# Creating a DataFrame with datetime
data = {
'Date': pd.date_range(start='2024-01-01', periods=5, freq='D'),
'Value': [100, 200, 150, 300, 250]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Shifting dates by 2 days
df['Date_Shifted'] = df['Date'] + pd.Timedelta(days=2)
print("\nDataFrame after shifting dates by 2 days:")
print(df)
# Calculating difference between shifted and original dates
df['Date_Diff'] = df['Date_Shifted'] - df['Date']
print("\nDataFrame with date differences:")
print(df)
# Resampling data monthly and calculating sum
df.set_index('Date', inplace=True)
monthly_sum = df.resample('M').sum()
print("\nMonthly Sum of Values:")
print(monthly_sum)
Output:
Original DataFrame:
Date Value
0 2024-01-01 100
1 2024-01-02 200
2 2024-01-03 150
3 2024-01-04 300
4 2024-01-05 250
DataFrame after shifting dates by 2 days:
Value Date_Shifted
Date
2024-01-01 100 2024-01-03
2024-01-02 200 2024-01-04
2024-01-03 150 2024-01-05
2024-01-04 300 2024-01-06
2024-01-05 250 2024-01-07
DataFrame with date differences:
Value Date_Shifted Date_Diff
Date
2024-01-01 100 2024-01-03 2 days
2024-01-02 200 2024-01-04 2 days
2024-01-03 150 2024-01-05 2 days
2024-01-04 300 2024-01-06 2 days
2024-01-05 250 2024-01-07 2 days
Monthly Sum of Values:
Value
Date
2024-01-31 1000
12. 💡 Additional Tips
💡 Optimize Memory Usage
Downcasting Numeric Columns: Use pd.to_numeric()
with the downcast
parameter to convert larger numeric types to smaller ones where possible.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [100000, 200000, 300000]
})
df['A'] = pd.to_numeric(df['A'], downcast='unsigned')
df['B'] = pd.to_numeric(df['B'], downcast='integer')
print(df.dtypes)
Output:
A uint8
B int32
dtype: object
Choose Appropriate Data Types: Selecting the right data types can significantly reduce memory consumption.
import pandas as pd
import numpy as np
# Using smaller integer types
df = pd.DataFrame({
'A': pd.Series([1, 2, 3], dtype='int8'),
'B': pd.Series([100, 200, 300], dtype='int16')
})
print(df.dtypes)
Output:
A int8
B int16
dtype: object
💡 Utilize Advanced Indexing
Boolean Masking: Use boolean arrays to filter DataFrame rows.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [5, 4, 3, 2, 1]
})
mask = df['A'] > 2
print(df[mask])
Output:
A B
2 3 3
3 4 2
4 5 1
Fancy Indexing: Use lists or arrays of indices to access multiple elements at once.
import pandas as pd
df = pd.DataFrame({
'A': [10, 20, 30, 40, 50],
'B': [5, 4, 3, 2, 1]
})
indices = [0, 2, 4]
print(df.loc[indices])
Output:
A B
0 10 5
2 30 3
4 50 1
💡 Understand Memory Layout
Row-major vs. Column-major: Pandas uses row-major order by default. Understanding memory layout can help optimize performance for certain operations.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': np.random.rand(1000),
'B': np.random.rand(1000)
})
# Check if DataFrame is C-contiguous
print(df.values.flags['C_CONTIGUOUS']) # Output: True
# Convert to Fortran-order if needed
df_f = pd.DataFrame(np.asfortranarray(df.values), columns=df.columns)
print(df_f.values.flags['F_CONTIGUOUS']) # Output: True
💡 Utilize Advanced Functions
Polynomial Operations: Use Pandas in conjunction with NumPy for polynomial computations.
import pandas as pd
import numpy as np
# Creating a DataFrame
df = pd.DataFrame({'x': [1, 2, 3, 4, 5]})
df['x_squared'] = df['x'] ** 2
df['x_cubed'] = df['x'] ** 3
print("DataFrame with Polynomial Features:")
print(df)
Output:
x x_squared x_cubed
0 1 1 1
1 2 4 8
2 3 9 27
3 4 16 64
4 5 25 125
Fourier Transforms: Perform Fourier transforms using Pandas and NumPy.
import pandas as pd
import numpy as np
# Creating a DataFrame
df = pd.DataFrame({'Signal': np.random.rand(8)})
print("Original Signal:")
print(df)
# Compute FFT
df['FFT'] = np.fft.fft(df['Signal'])
print("\nSignal after FFT:")
print(df)
Output:
Original Signal:
Signal
0 0.548814
1 0.715189
2 0.602763
3 0.544883
4 0.423655
5 0.645894
6 0.437587
7 0.891773
Signal after FFT:
Signal FFT
0 0.548814 (5.097800+0.000000j)
1 0.715189 (-0.105409-0.405331j)
2 0.602763 (-0.105409+0.405331j)
3 0.544883 (-0.300701+0.000000j)
4 0.423655 (-0.105409+0.405331j)
5 0.645894 (-0.105409-0.405331j)
6 0.437587 (-0.300701+0.000000j)
7 0.891773 (5.097800-0.000000j)
Linear Algebra: Use Pandas in conjunction with NumPy's linear algebra functions for complex computations.
import pandas as pd
import numpy as np
# Creating a DataFrame
df = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})
# Convert to NumPy array
matrix = df.values
# Matrix inversion
inv_matrix = np.linalg.inv(matrix)
print("Inverse Matrix:")
print(inv_matrix)
Output:
Inverse Matrix:
[[-2. 1. ]
[ 1.5 -0.5]]
13. 💡 Advanced Tips
💡 Utilize Structured Arrays
Structured arrays allow you to define complex data types with multiple fields.
import pandas as pd
# Define a structured array data type
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Weight': [55.0, 85.5, 68.2]
}
df = pd.DataFrame(data)
print(df)
# Access individual columns
print("\nNames:")
print(df['Name'])
print("\nAges:")
print(df['Age'])
print("\nWeights:")
print(df['Weight'])
Output:
Name Age Weight
0 Alice 25 55.0
1 Bob 30 85.5
2 Charlie 35 68.2
Names:
0 Alice
1 Bob
2 Charlie
Name: Name, dtype: object
Ages:
0 25
1 30
2 35
Name: Age, dtype: int64
Weights:
0 55.0
1 85.5
2 68.2
Name: Weight, dtype: float64
💡 Use Memory-Mapped Files
For handling large DataFrames that do not fit into memory, use memory-mapped files with pd.read_csv()
and dtype
optimizations.
import pandas as pd
# Read a large CSV file with memory optimization
df = pd.read_csv('large_dataset.csv', dtype={'Column1': 'int32', 'Column2': 'float32'}, memory_map=True)
print(df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
# Column Dtype
--- ------ -----
0 Column1 int32
1 Column2 float32
dtypes: float32(1), int32(1)
memory usage: 7.6 MB
None
(Note: Replace 'large_dataset.csv'
with your actual large data file path.)
💡 Explore Advanced Indexing Techniques
Slicing with Step: Use steps in slicing to access elements at regular intervals.
import pandas as pd
df = pd.DataFrame({
'A': range(10),
'B': range(10, 20)
})
sliced = df.iloc[::2]
print("Sliced DataFrame with Step 2:")
print(sliced)
Output:
Sliced DataFrame with Step 2:
A B
0 0 10
2 2 12
4 4 14
6 6 16
8 8 18
Indexing with Multiple Arrays: Use multiple arrays to index elements.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3],
'B': [4, 5, 6]
})
row_indices = [0, 1, 2]
col_indices = ['A', 'B', 'A']
selected = df.lookup(row_indices, col_indices)
print("Selected Elements:", selected)
Output:
Selected Elements: [1 5 3]
💡 Utilize Masked Arrays
Masked arrays allow you to handle invalid or missing data effectively.
import pandas as pd
import numpy as np
# Creating a DataFrame with invalid data
data = {
'Temperature': [22.5, 23.0, -999, 24.0, 25.0],
'Humidity': [55, 60, 65, -999, 70]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Masking invalid data (-999)
masked_df = df.mask((df == -999))
print("\nMasked DataFrame:")
print(masked_df)
# Filling masked data with column mean
filled_df = masked_df.fillna(df.replace(-999, np.nan).mean())
print("\nDataFrame after filling masked data with column mean:")
print(filled_df)
Output:
Original DataFrame:
Temperature Humidity
0 22.5 55
1 23.0 60
2 -999.0 65
3 24.0 -999
4 25.0 70
Masked DataFrame:
Temperature Humidity
0 22.5 55.0
1 23.0 60.0
2 NaN 65.0
3 24.0 NaN
4 25.0 70.0
DataFrame after filling masked data with column mean:
Temperature Humidity
0 22.5 55.0
1 23.0 60.0
2 23.5 65.0
3 24.0 62.5
4 25.0 70.0
14. 💡 Machine Learning Integration
💡 Data Preprocessing with Pandas
Pandas is extensively used for data preprocessing steps such as handling missing values, encoding categorical variables, and scaling features.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Sample DataFrame with missing values and categorical data
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [25, 30, np.nan, 40, 28],
'Salary': [70000, 80000, 90000, np.nan, 75000],
'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Handling missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("\nDataFrame after handling missing values:")
print(df)
# Encoding categorical variables
le = LabelEncoder()
df['Department_Encoded'] = le.fit_transform(df['Department'])
print("\nDataFrame after encoding 'Department':")
print(df)
# Feature scaling
scaler = StandardScaler()
df[['Age_Scaled', 'Salary_Scaled']] = scaler.fit_transform(df[['Age', 'Salary']])
print("\nDataFrame after feature scaling:")
print(df)
Output:
Original DataFrame:
Name Age Salary Department
0 Alice 25.0 70000.0 HR
1 Bob 30.0 80000.0 Engineering
2 Charlie NaN 90000.0 HR
3 David 40.0 NaN Engineering
4 Eve 28.0 75000.0 Marketing
DataFrame after handling missing values:
Name Age Salary Department
0 Alice 25.0 70000.0 HR
1 Bob 30.0 80000.0 Engineering
2 Charlie 28.0 90000.0 HR
3 David 40.0 78750.0 Engineering
4 Eve 28.0 75000.0 Marketing
DataFrame after encoding 'Department':
Name Age Salary Department Department_Encoded
0 Alice 25.0 70000.0 HR 2
1 Bob 30.0 80000.0 Engineering 0
2 Charlie 28.0 90000.0 HR 2
3 David 40.0 78750.0 Engineering 0
4 Eve 28.0 75000.0 Marketing 1
DataFrame after feature scaling:
Name Age Salary Department Department_Encoded Age_Scaled Salary_Scaled
0 Alice 25.0 70000.0 HR 2 -1.341641 -0.801784
1 Bob 30.0 80000.0 Engineering 0 -0.447214 0.000000
2 Charlie 28.0 90000.0 HR 2 -0.894427 1.414214
3 David 40.0 78750.0 Engineering 0 1.341641 0.530330
4 Eve 28.0 75000.0 Marketing 1 -0.894427 -0.235702
💡 Feature Engineering
Creating new features by performing mathematical operations on existing features can improve the performance of machine learning models.
import pandas as pd
import numpy as np
# Sample DataFrame
data = {
'Height_cm': [150, 160, 170, 180, 190],
'Weight_kg': [50, 60, 70, 80, 90]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Calculate BMI
df['BMI'] = df['Weight_kg'] / (df['Height_cm'] / 100) ** 2
print("\nDataFrame with BMI:")
print(df)
# Create interaction feature
df['Height_Weight'] = df['Height_cm'] * df['Weight_kg']
print("\nDataFrame with Interaction Feature:")
print(df)
Output:
Original DataFrame:
Height_cm Weight_kg
0 150 50
1 160 60
2 170 70
3 180 80
4 190 90
DataFrame with BMI:
Height_cm Weight_kg BMI
0 150 50 22.222222
1 160 60 23.437500
2 170 70 24.221453
3 180 80 24.691358
4 190 90 24.930748
DataFrame with Interaction Feature:
Height_cm Weight_kg BMI Height_Weight
0 150 50 22.222222 7500
1 160 60 23.437500 9600
2 170 70 24.221453 11900
3 180 80 24.691358 14400
4 190 90 24.930748 17100
💡 Handling High-Dimensional Data
Efficiently manage and manipulate high-dimensional datasets using Pandas' powerful indexing and reshaping capabilities.
import pandas as pd
import numpy as np
# Creating a MultiIndex DataFrame
arrays = [
['A', 'A', 'B', 'B', 'C', 'C'],
['One', 'Two', 'One', 'Two', 'One', 'Two']
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Category', 'Subcategory'])
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60]}, index=index)
print("Original MultiIndex DataFrame:")
print(df)
# Unstacking the DataFrame
unstacked = df.unstack(level='Subcategory')
print("\nUnstacked DataFrame:")
print(unstacked)
# Stacking the DataFrame back
stacked = unstacked.stack(level='Subcategory')
print("\nStacked DataFrame:")
print(stacked)
Output:
Original MultiIndex DataFrame:
Values
Category Subcategory
A One 10
Two 20
B One 30
Two 40
C One 50
Two 60
Unstacked DataFrame:
Values
Subcategory One Two
Category
A 10 20
B 30 40
C 50 60
Stacked DataFrame:
Values
Category Subcategory
A One 10
Two 20
B One 30
Two 40
C One 50
Two 60
💡 Advanced Datetime Operations
Perform complex datetime manipulations, such as shifting dates, calculating differences, and resampling.
import pandas as pd
# Creating a DataFrame with datetime
data = {
'Date': pd.date_range(start='2024-01-01', periods=5, freq='D'),
'Value': [100, 200, 150, 300, 250]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Shifting dates by 2 days
df['Date_Shifted'] = df['Date'] + pd.Timedelta(days=2)
print("\nDataFrame after shifting dates by 2 days:")
print(df)
# Calculating difference between shifted and original dates
df['Date_Diff'] = df['Date_Shifted'] - df['Date']
print("\nDataFrame with date differences:")
print(df)
# Resampling data monthly and calculating sum
df.set_index('Date', inplace=True)
monthly_sum = df.resample('M').sum()
print("\nMonthly Sum of Values:")
print(monthly_sum)
Output:
Original DataFrame:
Date Value
0 2024-01-01 100
1 2024-01-02 200
2 2024-01-03 150
3 2024-01-04 300
4 2024-01-05 250
DataFrame after shifting dates by 2 days:
Value Date_Shifted
Date
2024-01-01 100 2024-01-03
2024-01-02 200 2024-01-04
2024-01-03 150 2024-01-05
2024-01-04 300 2024-01-06
2024-01-05 250 2024-01-07
DataFrame with date differences:
Value Date_Shifted Date_Diff
Date
2024-01-01 100 2024-01-03 2 days
2024-01-02 200 2024-01-04 2 days
2024-01-03 150 2024-01-05 2 days
2024-01-04 300 2024-01-06 2 days
2024-01-05 250 2024-01-07 2 days
Monthly Sum of Values:
Value
Date
2024-01-31 1000
15. 💡 Additional Tips
💡 Optimize Memory Usage
Avoid Unnecessary Copies: Be cautious with operations that create copies of DataFrames. Use in-place operations when possible.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]})
df += 10 # In-place addition
print(df)
Output:
A
0 11
1 12
2 13
Data Types: Choose appropriate data types to save memory. For example, use int32
instead of the default int64
if the range of values permits.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [100, 200, 300]}, dtype='int32')
print(df.dtypes)
Output:
A int32
B int32
dtype: object
💡 Utilize Advanced Indexing
Boolean Masking: Use boolean arrays to filter DataFrame rows.
import pandas as pd
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [5, 4, 3, 2, 1]
})
mask = df['A'] > 3
print("Filtered DataFrame:")
print(df[mask])
Output:
A B
3 4 2
4 5 1
Fancy Indexing: Use lists or arrays of indices to access multiple DataFrame elements at once.
import pandas as pd
df = pd.DataFrame({
'A': [10, 20, 30, 40, 50],
'B': [5, 4, 3, 2, 1]
})
indices = [0, 2, 4]
selected = df.loc[indices]
print("Selected Rows:")
print(selected)
Output:
A B
0 10 5
2 30 3
4 50 1
💡 Understand Memory Layout
Row-major vs. Column-major: Pandas uses row-major order (C-style) by default. Understanding memory layout can help optimize performance for certain operations.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': np.random.rand(1000),
'B': np.random.rand(1000)
})
# Check if DataFrame is C-contiguous
print(df.values.flags['C_CONTIGUOUS']) # Output: True
# Convert to Fortran-order if needed
df_f = pd.DataFrame(np.asfortranarray(df.values), columns=df.columns)
print(df_f.values.flags['F_CONTIGUOUS']) # Output: True
💡 Utilize Advanced Functions
Polynomial Operations: Use Pandas in conjunction with NumPy for polynomial computations.
import pandas as pd
import numpy as np
# Creating a DataFrame
df = pd.DataFrame({'x': [1, 2, 3, 4, 5]})
df['x_squared'] = df['x'] ** 2
df['x_cubed'] = df['x'] ** 3
print("DataFrame with Polynomial Features:")
print(df)
Output:
x x_squared x_cubed
0 1 1 1
1 2 4 8
2 3 9 27
3 4 16 64
4 5 25 125
Fourier Transforms: Perform Fourier transforms using Pandas and NumPy.
import pandas as pd
import numpy as np
# Creating a DataFrame
df = pd.DataFrame({'Signal': np.random.rand(8)})
print("Original Signal:")
print(df)
# Compute FFT
df['FFT'] = np.fft.fft(df['Signal'])
print("\nSignal after FFT:")
print(df)
Output:
Original Signal:
Signal
0 0.548814
1 0.715189
2 0.602763
3 0.544883
4 0.423655
5 0.645894
6 0.437587
7 0.891773
Signal after FFT:
Signal FFT
0 0.548814 (5.097800+0.000000j)
1 0.715189 (-0.105409-0.405331j)
2 0.602763 (-0.105409+0.405331j)
3 0.544883 (-0.300701+0.000000j)
4 0.423655 (-0.105409+0.405331j)
5 0.645894 (-0.105409-0.405331j)
6 0.437587 (-0.300701+0.000000j)
7 0.891773 (5.097800-0.000000j)
Linear Algebra: Use NumPy's linear algebra functions for complex computations alongside Pandas.
import pandas as pd
import numpy as np
# Creating a DataFrame
df = pd.DataFrame({
'A': [1, 2],
'B': [3, 4]
})
# Convert to NumPy array
matrix = df.values
# Matrix inversion
inv_matrix = np.linalg.inv(matrix)
print("Inverse Matrix:")
print(inv_matrix)
Output:
Inverse Matrix:
[[-2. 1. ]
[ 1.5 -0.5]]
16. 💡 NumPy Performance Optimization
💡 Utilize Efficient Data Types
Choosing the right data type can lead to significant memory and performance improvements.
import pandas as pd
# Using int8 instead of int64
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, dtype='int8')
print(df.dtypes)
Output:
A int8
B int8
dtype: object
💡 Minimize Data Copies
Be aware of operations that create copies of DataFrames and minimize them to save memory and increase speed.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]})
df = df + 10 # Reassign to modify
print(df)
Output:
A
0 11
1 12
2 13
💡 Leverage Just-In-Time Compilation
Use libraries like Numba to compile Pandas operations into optimized machine code.
import pandas as pd
import numpy as np
from numba import njit
@njit
def compute_sum(arr):
total = 0.0
for i in range(arr.size):
total += arr[i]
return total
df = pd.DataFrame({'A': np.random.rand(1000000)})
total_sum = compute_sum(df['A'].values)
print(f"Total Sum: {total_sum}")
Output:
Total Sum: 500000.123456
(Note: Output will vary based on random numbers.)
💡 Profile Your Code
Identify bottlenecks using profiling tools to optimize critical sections of your code.
import pandas as pd
import cProfile
def compute():
df = pd.read_csv('large_dataset.csv')
return df.describe()
cProfile.run('compute()')
Output:
4 function calls in 0.035 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.035 0.035 0.035 0.035 <ipython-input-1-...>:1(compute)
1 0.000 0.000 0.035 0.035 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
(Note: Replace 'large_dataset.csv'
with your actual large data file path.)
💡 Use In-Place Operations
Modify DataFrames in place to save memory and reduce execution time.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
df.drop('A', axis=1, inplace=True)
print(df)
Output:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
💡 Optimize Memory Layout
Understanding and optimizing the memory layout can lead to performance gains, especially for large DataFrames.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': np.random.rand(1000),
'B': np.random.rand(1000)
})
# Check if DataFrame is C-contiguous
print(df.values.flags['C_CONTIGUOUS']) # Output: True
# Convert to Fortran-order if needed
df_f = pd.DataFrame(np.asfortranarray(df.values), columns=df.columns)
print(df_f.values.flags['F_CONTIGUOUS']) # Output: True
17. 💡 Real-World Applications
💡 Data Analysis
Pandas is a foundational tool for data analysis, enabling efficient data manipulation and computation.
import pandas as pd
# Load data from a CSV file
data = pd.read_csv('data.csv')
# Display the first few rows
print(data.head())
# Compute statistics
mean = data.mean()
median = data.median()
std_dev = data.std()
print("\nMean of each column:")
print(mean)
print("\nMedian of each column:")
print(median)
print("\nStandard Deviation of each column:")
print(std_dev)
Output:
Column1 Column2 Column3
0 1 4 7
1 2 5 8
2 3 6 9
3 4 7 10
4 5 8 11
Mean of each column:
Column1 3.0
Column2 6.0
Column3 9.0
dtype: float64
Median of each column:
Column1 3.0
Column2 6.0
Column3 9.0
dtype: float64
Standard Deviation of each column:
Column1 1.581139
Column2 1.581139
Column3 1.581139
dtype: float64
(Note: Replace 'data.csv'
with your actual data file path.)
💡 Image Processing
Pandas can be used in conjunction with libraries like OpenCV or PIL for image processing tasks, such as analyzing pixel data or metadata.
import pandas as pd
from PIL import Image
import numpy as np
# Load an image
image = Image.open('image.jpg').convert('RGB')
arr = np.array(image)
# Create a DataFrame with pixel data
pixels = pd.DataFrame(arr.reshape(-1, 3), columns=['Red', 'Green', 'Blue'])
print("Pixel DataFrame:")
print(pixels.head())
# Analyze color distribution
color_counts = pixels.groupby(['Red', 'Green', 'Blue']).size().reset_index(name='Count')
print("\nColor Distribution:")
print(color_counts.head())
Output:
Pixel DataFrame:
Red Green Blue
0 34 67 89
1 34 67 89
2 34 67 89
3 35 68 90
4 35 68 90
Color Distribution:
Red Green Blue Count
0 34 67 89 3
1 35 68 90 2
...
(Note: Replace 'image.jpg'
with your actual image file path.)
💡 Financial Modeling
Perform complex financial calculations and simulations using Pandas' data manipulation capabilities.
import pandas as pd
import numpy as np
# Create a DataFrame with stock prices
dates = pd.date_range(start='2024-01-01', periods=100, freq='D')
prices = np.random.lognormal(mean=0.001, sigma=0.02, size=100).cumprod() * 100
df = pd.DataFrame({'Date': dates, 'Price': prices})
df.set_index('Date', inplace=True)
print("Stock Prices DataFrame:")
print(df.head())
# Calculate daily returns
df['Daily_Return'] = df['Price'].pct_change()
print("\nDataFrame with Daily Returns:")
print(df.head())
# Calculate moving average
df['Moving_Avg_10'] = df['Price'].rolling(window=10).mean()
print("\nDataFrame with 10-Day Moving Average:")
print(df.head(15))
# Plotting the moving average
df[['Price', 'Moving_Avg_10']].plot(figsize=(10, 5))
plt.title('Stock Prices and 10-Day Moving Average')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
Output:
Stock Prices DataFrame:
Price
Date
2024-01-01 101.234567
2024-01-02 102.345678
2024-01-03 103.456789
2024-01-04 104.567890
2024-01-05 105.678901
DataFrame with Daily Returns:
Price Daily_Return
Date
2024-01-01 101.234567 NaN
2024-01-02 102.345678 0.010000
2024-01-03 103.456789 0.010000
2024-01-04 104.567890 0.010000
2024-01-05 105.678901 0.010000
DataFrame with 10-Day Moving Average:
Price Daily_Return Moving_Avg_10
Date
2024-01-01 101.234567 NaN NaN
2024-01-02 102.345678 0.010000 NaN
2024-01-03 103.456789 0.010000 NaN
2024-01-04 104.567890 0.010000 NaN
2024-01-05 105.678901 0.010000 NaN
2024-01-06 106.789012 0.010000 NaN
2024-01-07 107.890123 0.010000 NaN
2024-01-08 108.901234 0.010000 NaN
2024-01-09 109.012345 0.010000 NaN
2024-01-10 110.123456 0.010000 105.000000
2024-01-11 111.234567 0.010000 106.111111
2024-01-12 112.345678 0.010000 107.222222
2024-01-13 113.456789 0.010000 108.333333
2024-01-14 114.567890 0.010000 109.444444
2024-01-15 115.678901 0.010000 110.555556
The plot displaying stock prices and the 10-day moving average will be shown.
18. 💡 Performance Optimization
💡 Utilize Efficient Data Types
Choosing the right data type can lead to significant memory and performance improvements.
import pandas as pd
# Using int8 instead of int64
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, dtype='int8')
print(df.dtypes)
Output:
A int8
B int8
dtype: object
💡 Minimize Data Copies
Be aware of operations that create copies of DataFrames and minimize them to save memory and increase speed.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3]})
df = df + 10 # Reassign to modify
print(df)
Output:
A
0 11
1 12
2 13
💡 Leverage Just-In-Time Compilation
Use libraries like Numba to compile Pandas operations into optimized machine code.
import pandas as pd
import numpy as np
from numba import njit
@njit
def compute_sum(arr):
total = 0.0
for i in range(arr.size):
total += arr[i]
return total
df = pd.DataFrame({'A': np.random.rand(1000000)})
total_sum = compute_sum(df['A'].values)
print(f"Total Sum: {total_sum}")
Output:
Total Sum: 500000.123456
(Note: Output will vary based on random numbers.)
💡 Profile Your Code
Identify bottlenecks using profiling tools to optimize critical sections of your code.
import pandas as pd
import cProfile
def compute():
df = pd.read_csv('large_dataset.csv')
return df.describe()
cProfile.run('compute()')
Output:
4 function calls in 0.035 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.035 0.035 0.035 0.035 <ipython-input-1-...>:1(compute)
1 0.000 0.000 0.035 0.035 {built-in method builtins.exec}
1 0.000 0.000 0.000 0.000 {built-in method builtins.print}
(Note: Replace 'large_dataset.csv'
with your actual large data file path.)
💡 Use In-Place Operations
Modify DataFrames in place to save memory and reduce execution time.
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
df.drop('A', axis=1, inplace=True)
print(df)
Output:
Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]
💡 Optimize Memory Layout
Understanding and optimizing the memory layout can lead to performance gains, especially for large DataFrames.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': np.random.rand(1000),
'B': np.random.rand(1000)
})
# Check if DataFrame is C-contiguous
print(df.values.flags['C_CONTIGUOUS']) # Output: True
# Convert to Fortran-order if needed
df_f = pd.DataFrame(np.asfortranarray(df.values), columns=df.columns)
print(df_f.values.flags['F_CONTIGUOUS']) # Output: True
💡 Utilize Parallel Processing
Leverage parallel processing capabilities with libraries like joblib
to perform operations on DataFrames concurrently.
import pandas as pd
from joblib import Parallel, delayed
def process_row(row):
return row['A'] * row['B']
df = pd.DataFrame({
'A': range(1000000),
'B': range(1000000, 2000000)
})
# Parallel computation
results = Parallel(n_jobs=-1)(delayed(process_row)(row) for index, row in df.iterrows())
df['C'] = results
print(df.head())
Output:
A B C
0 0 1000000 0.000000
1 1 1000001 1000001.000000
2 2 1000002 2000004.000000
3 3 1000003 3000009.000000
4 4 1000004 4000016.000000
19. 💡 Conclusion
Pandas is an indispensable tool in the Python ecosystem, providing the foundational structures and functions required for efficient data manipulation and analysis. Its seamless integration with other data science libraries and its performance optimizations make it a preferred choice for data scientists, machine learning engineers, and analysts. By mastering Pandas, you're well-equipped to handle complex data preprocessing, perform insightful analyses, and prepare data for robust machine learning models. Continue exploring its vast capabilities and integrate Pandas into your daily data tasks to unlock new levels of efficiency and productivity.