Skip to main content

Phase 01: Foundations of Python and Mathematics

Day 06: Data Manipulation with Pandas

📑 Table of Contents

  1. 🌟 Welcome to Day 6
  2. 📚 Introduction to Pandas
    • What is Pandas?
    • Benefits of Using Pandas
    • Installing Pandas
  3. 🧩 Core Concepts
    • Series
    • DataFrame
    • Indexing and Selecting Data
    • Handling Missing Data
    • Merging and Joining DataFrames
    • Grouping and Aggregating Data
    • Pivot Tables
    • Datetime Operations
  4. 💻 Hands-On Coding
    • Example Scripts
  5. 🧩 Interactive Exercises
  6. 📚 Resources
  7. 💡 Tips and Tricks
  8. 💡 Best Practices
  9. 💡 Advanced Topics
  10. 💡 Real-World Applications
  11. 💡 Machine Learning Integration
  12. 💡 Conclusion

1. 🌟 Welcome to Day 6

Welcome to Day 6 of "Becoming a Scikit-Learn Boss in 90 Days"! 🎉 Today, we delve into the powerful world of Pandas, a cornerstone library for data manipulation and analysis in Python. Pandas provides data structures like Series and DataFrame that make handling structured data intuitive and efficient. Mastering Pandas is essential for cleaning, transforming, and preparing data for machine learning models. Let's get started and unlock the full potential of your data! 🚀


2. 📚 Introduction to Pandas

What is Pandas?

Pandas is an open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools. It is built on top of NumPy and integrates seamlessly with many other data science libraries like Matplotlib, Scikit-Learn, and SciPy.

Key Features:

  • DataFrame Object: A powerful 2-dimensional data structure with labeled axes (rows and columns).
  • Handling Missing Data: Easily manage and clean incomplete data.
  • Data Alignment and Integrated Handling of Time Series: Efficiently align data based on labels.
  • Flexible Group By Functionality: Perform split-apply-combine operations on datasets.
  • Rich Functionality for Data Manipulation: Including merging, joining, reshaping, and pivoting.

Benefits of Using Pandas

  • Ease of Use: Intuitive data structures and functions for data manipulation.
  • Performance: Optimized for performance with large datasets.
  • Flexibility: Supports a wide range of data formats and operations.
  • Integration: Works well with other data science and visualization libraries.
  • Community Support: Extensive documentation and a large user community.

Installing Pandas

If you haven't installed Pandas yet, you can do so using pip:

pip install pandas

Or, if you're using Anaconda:

conda install pandas

3. 🧩 Core Concepts

📝 Series

A Series is a one-dimensional labeled array capable of holding any data type.

Creating a Series:

import pandas as pd

# From a Python list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)

Output:

0    10
1    20
2    30
3    40
4    50
dtype: int64

Creating a Series with Custom Index:

data = [10, 20, 30, 40, 50]
index = ['a', 'b', 'c', 'd', 'e']
series = pd.Series(data, index=index)
print(series)

Output:

a    10
b    20
c    30
d    40
e    50
dtype: int64

📝 DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types.

Creating a DataFrame:

import pandas as pd

# From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston

Creating a DataFrame from a List of Lists:

import pandas as pd

data = [
    ['Alice', 25, 'New York'],
    ['Bob', 30, 'Los Angeles'],
    ['Charlie', 35, 'Chicago'],
    ['David', 40, 'Houston']
]
columns = ['Name', 'Age', 'City']
df = pd.DataFrame(data, columns=columns)
print(df)

Output:

      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston

📝 Indexing and Selecting Data

Pandas offers multiple ways to select and manipulate data within Series and DataFrame.

Selecting a Column:

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)

# Selecting the 'Age' column
age = df['Age']
print(age)

Output:

0    25
1    30
2    35
3    40
Name: Age, dtype: int64

Selecting Multiple Columns:

# Selecting 'Name' and 'City' columns
subset = df[['Name', 'City']]
print(subset)

Output:

      Name         City
0    Alice     New York
1      Bob  Los Angeles
2  Charlie      Chicago
3    David      Houston

Selecting Rows by Index:

# Selecting the first two rows
rows = df.iloc[0:2]
print(rows)

Output:

    Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles

Selecting Rows by Label:

# Setting a custom index
df.set_index('Name', inplace=True)
print(df)

Output:

         Age         City
Name                      
Alice      25     New York
Bob        30  Los Angeles
Charlie    35      Chicago
David      40      Houston

Selecting a Row by Label:

# Selecting the row for 'Charlie'
charlie = df.loc['Charlie']
print(charlie)

Output:

Age         35
City    Chicago
Name: Charlie, dtype: object

📝 Handling Missing Data

Missing data is common in real-world datasets. Pandas provides robust methods to handle it.

Detecting Missing Values:

import pandas as pd
import numpy as np

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, np.nan, 35, 40],
    'City': ['New York', 'Los Angeles', np.nan, 'Houston']
}
df = pd.DataFrame(data)
print(df)

Output:

      Name   Age         City
0    Alice  25.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  35.0          NaN
3    David  40.0      Houston

Dropping Missing Values:

# Drop rows with any missing values
cleaned_df = df.dropna()
print(cleaned_df)

Output:

    Name   Age      City
0  Alice  25.0  New York
3  David  40.0   Houston

Filling Missing Values:

# Fill missing 'Age' with the mean age
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)

# Fill missing 'City' with 'Unknown'
df['City'].fillna('Unknown', inplace=True)

print(df)

Output:

      Name   Age         City
0    Alice  25.0     New York
1      Bob  33.333333  Los Angeles
2  Charlie  35.0        Unknown
3    David  40.0      Houston

📝 Merging and Joining DataFrames

Combine multiple DataFrames using various join operations.

Merging DataFrames:

import pandas as pd

# First DataFrame
data1 = {
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
}
df1 = pd.DataFrame(data1)

# Second DataFrame
data2 = {
    'EmployeeID': [3, 4, 5, 6],
    'Department': ['HR', 'Engineering', 'Marketing', 'Finance']
}
df2 = pd.DataFrame(data2)

# Merge on 'EmployeeID'
merged_df = pd.merge(df1, df2, on='EmployeeID', how='inner')
print(merged_df)

Output:

   EmployeeID     Name   Department
0           3  Charlie           HR
1           4    David  Engineering

Joining DataFrames:

import pandas as pd

# Set 'EmployeeID' as index
df1.set_index('EmployeeID', inplace=True)
df2.set_index('EmployeeID', inplace=True)

# Join the DataFrames
joined_df = df1.join(df2, how='outer', lsuffix='_left', rsuffix='_right')
print(joined_df)

Output:

              Name   Department
EmployeeID                     
1            Alice          NaN
2              Bob          NaN
3          Charlie           HR
4            David  Engineering
5              NaN    Marketing
6              NaN      Finance

📝 Grouping and Aggregating Data

Group data based on certain criteria and perform aggregate operations.

Grouping Data:

import pandas as pd

data = {
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [50000, 80000, 55000, 90000, 60000, 75000]
}
df = pd.DataFrame(data)

# Group by 'Department'
grouped = df.groupby('Department')

Aggregating Data:

# Calculate mean salary per department
mean_salary = grouped['Salary'].mean()
print(mean_salary)

Output:

Department
Engineering    85000.0
Finance        75000.0
HR             52500.0
Marketing      60000.0
Name: Salary, dtype: float64

Multiple Aggregations:

# Calculate mean and sum of salaries per department
agg_salary = grouped['Salary'].agg(['mean', 'sum'])
print(agg_salary)

Output:

             mean    sum
Department                
Engineering 85000 170000
Finance      75000  75000
HR           52500 105000
Marketing    60000  60000

📝 Pivot Tables

Create pivot tables to summarize data.

Creating a Pivot Table:

import pandas as pd

data = {
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance', 'HR'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Gender': ['F', 'M', 'M', 'M', 'F', 'M', 'F'],
    'Salary': [50000, 80000, 55000, 90000, 60000, 75000, 58000]
}
df = pd.DataFrame(data)

# Pivot table: Average salary by Department and Gender
pivot = pd.pivot_table(df, values='Salary', index='Department', columns='Gender', aggfunc='mean')
print(pivot)

Output:

Gender         F        M
Department                
Engineering     NaN  85000.0
Finance         NaN  75000.0
HR           54000.0  55000.0
Marketing    60000.0      NaN

📝 Datetime Operations

Handle and manipulate datetime data efficiently.

Converting to Datetime:

import pandas as pd

data = {
    'Event': ['Start', 'Middle', 'End'],
    'Date': ['2024-01-01', '2024-06-15', '2024-12-31']
}
df = pd.DataFrame(data)

# Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
print(df)

Output:

    Event       Date
0   Start 2024-01-01
1  Middle 2024-06-15
2     End 2024-12-31

Extracting Date Components:

# Extract year, month, and day
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
print(df)

Output:

    Event       Date  Year  Month  Day
0   Start 2024-01-01  2024      1    1
1  Middle 2024-06-15  2024      6   15
2     End 2024-12-31  2024     12   31

Filtering by Date Range:

# Filter events after June 1, 2024
filtered = df[df['Date'] > '2024-06-01']
print(filtered)

Output:

    Event       Date  Year  Month  Day
1  Middle 2024-06-15  2024      6   15
2     End 2024-12-31  2024     12   31

4. 💻 Hands-On Coding

🎉 Example Scripts

📝 Script 1: Basic DataFrame Operations

# basic_dataframe_operations.py

import pandas as pd

# Creating a DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)

# Display the DataFrame
print("Original DataFrame:")
print(df)

# Adding a new column
df['Salary'] = [70000, 80000, 90000, 100000]
print("\nDataFrame after adding 'Salary' column:")
print(df)

# Dropping a column
df.drop('City', axis=1, inplace=True)
print("\nDataFrame after dropping 'City' column:")
print(df)

Output:

Original DataFrame:
      Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    David   40      Houston

DataFrame after adding 'Salary' column:
      Name  Age         City  Salary
0    Alice   25     New York   70000
1      Bob   30  Los Angeles   80000
2  Charlie   35      Chicago   90000
3    David   40      Houston  100000

DataFrame after dropping 'City' column:
      Name  Age  Salary
0    Alice   25   70000
1      Bob   30   80000
2  Charlie   35   90000
3    David   40  100000

📝 Script 2: Handling Missing Data

# handling_missing_data.py

import pandas as pd
import numpy as np

# Creating a DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, np.nan, 35, 40],
    'City': ['New York', 'Los Angeles', np.nan, 'Houston']
}
df = pd.DataFrame(data)
print("Original DataFrame with missing values:")
print(df)

# Dropping rows with any missing values
cleaned_df = df.dropna()
print("\nDataFrame after dropping missing values:")
print(cleaned_df)

# Filling missing values
mean_age = df['Age'].mean()
df['Age'].fillna(mean_age, inplace=True)
df['City'].fillna('Unknown', inplace=True)
print("\nDataFrame after filling missing values:")
print(df)

Output:

Original DataFrame with missing values:
      Name   Age         City
0    Alice  25.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  35.0          NaN
3    David  40.0      Houston

DataFrame after dropping missing values:
    Name   Age      City
0  Alice  25.0  New York
3  David  40.0   Houston

DataFrame after filling missing values:
      Name        Age         City
0    Alice  25.000000     New York
1      Bob  33.333333  Los Angeles
2  Charlie  35.000000        Unknown
3    David  40.000000      Houston

📝 Script 3: Merging DataFrames

# merging_dataframes.py

import pandas as pd

# First DataFrame
data1 = {
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David']
}
df1 = pd.DataFrame(data1)

# Second DataFrame
data2 = {
    'EmployeeID': [3, 4, 5, 6],
    'Department': ['HR', 'Engineering', 'Marketing', 'Finance']
}
df2 = pd.DataFrame(data2)

# Merge DataFrames on 'EmployeeID'
merged_df = pd.merge(df1, df2, on='EmployeeID', how='inner')
print("Merged DataFrame (Inner Join):")
print(merged_df)

# Outer Join
outer_merged = pd.merge(df1, df2, on='EmployeeID', how='outer', indicator=True)
print("\nMerged DataFrame (Outer Join with Indicator):")
print(outer_merged)

Output:

Merged DataFrame (Inner Join):
   EmployeeID     Name   Department
0           3  Charlie           HR
1           4    David  Engineering

Merged DataFrame (Outer Join with Indicator):
   EmployeeID     Name   Department      _merge
0           1    Alice          NaN   left_only
1           2      Bob          NaN   left_only
2           3  Charlie           HR        both
3           4    David  Engineering        both
4           5      NaN    Marketing  right_only
5           6      NaN      Finance  right_only

📝 Script 4: Grouping and Aggregating Data

# grouping_aggregating.py

import pandas as pd

# Creating a DataFrame
data = {
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank'],
    'Salary': [50000, 80000, 55000, 90000, 60000, 75000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Group by 'Department'
grouped = df.groupby('Department')

# Calculate mean salary per department
mean_salary = grouped['Salary'].mean()
print("\nMean Salary per Department:")
print(mean_salary)

# Multiple aggregations
agg_salary = grouped['Salary'].agg(['mean', 'sum', 'count'])
print("\nAggregated Salary Statistics:")
print(agg_salary)

Output:

Original DataFrame:
   Department Employee  Salary
0          HR    Alice   50000
1  Engineering      Bob   80000
2          HR  Charlie   55000
3  Engineering    David   90000
4   Marketing      Eve   60000
5     Finance    Frank   75000

Mean Salary per Department:
Department
Engineering    85000.0
Finance        75000.0
HR             52500.0
Marketing      60000.0
Name: Salary, dtype: float64

Aggregated Salary Statistics:
             mean    sum  count
Department                       
Engineering 85000 170000      2
Finance     75000  75000      1
HR          52500 105000      2
Marketing   60000  60000      1

📝 Script 5: Creating Pivot Tables

# pivot_tables.py

import pandas as pd

# Creating a DataFrame
data = {
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance', 'HR'],
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Grace'],
    'Gender': ['F', 'M', 'M', 'M', 'F', 'M', 'F'],
    'Salary': [50000, 80000, 55000, 90000, 60000, 75000, 58000]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Creating a pivot table: Average salary by Department and Gender
pivot = pd.pivot_table(df, values='Salary', index='Department', columns='Gender', aggfunc='mean')
print("\nPivot Table - Average Salary by Department and Gender:")
print(pivot)

Output:

Original DataFrame:
   Department Employee Gender  Salary
0          HR    Alice      F   50000
1  Engineering      Bob      M   80000
2          HR  Charlie      M   55000
3  Engineering    David      M   90000
4   Marketing      Eve      F   60000
5     Finance    Frank      M   75000
6          HR    Grace      F   58000

Pivot Table - Average Salary by Department and Gender:
Gender         F        M
Department                
Engineering     NaN  85000.0
Finance         NaN  75000.0
HR           54000.0  55000.0
Marketing    60000.0      NaN

📝 Script 6: Datetime Operations

# datetime_operations.py

import pandas as pd

# Creating a DataFrame with date information
data = {
    'Event': ['Start', 'Middle', 'End'],
    'Date': ['2024-01-01', '2024-06-15', '2024-12-31']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Converting 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
print("\nDataFrame after converting 'Date' to datetime:")
print(df)

# Extracting year, month, and day
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
print("\nDataFrame with extracted date components:")
print(df)

# Filtering events after June 1, 2024
filtered = df[df['Date'] > '2024-06-01']
print("\nFiltered DataFrame (Events after June 1, 2024):")
print(filtered)

Output:

Original DataFrame:
    Event        Date
0   Start  2024-01-01
1  Middle  2024-06-15
2     End  2024-12-31

DataFrame after converting 'Date' to datetime:
    Event       Date
0   Start 2024-01-01
1  Middle 2024-06-15
2     End 2024-12-31

DataFrame with extracted date components:
    Event       Date  Year  Month  Day
0   Start 2024-01-01  2024      1    1
1  Middle 2024-06-15  2024      6   15
2     End 2024-12-31  2024     12   31

Filtered DataFrame (Events after June 1, 2024):
    Event       Date  Year  Month  Day
1  Middle 2024-06-15  2024      6   15
2     End 2024-12-31  2024     12   31

5. 🧩 Interactive Exercises

📝 Exercise 1: Creating and Manipulating DataFrames

    • Add a new column for passing status based on grades (pass if grade >= 60).
    • Filter out students who are passing.
    • Calculate the average age of passing students.

Task: Create a DataFrame from a dictionary containing student names, ages, and grades. Perform the following operations:

import pandas as pd

# Create DataFrame
data = {
    'Name': ['John', 'Emma', 'Liam', 'Olivia', 'Noah'],
    'Age': [20, 21, 19, 22, 20],
    'Grade': [85, 58, 90, 76, 45]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Add 'Passed' column
df['Passed'] = df['Grade'] >= 60
print("\nDataFrame after adding 'Passed' column:")
print(df)

# Filter passing students
passing_students = df[df['Passed']]
print("\nPassing Students:")
print(passing_students)

# Calculate average age of passing students
average_age = passing_students['Age'].mean()
print(f"\nAverage Age of Passing Students: {average_age}")

Output:

Original DataFrame:
    Name  Age  Grade
0   John   20     85
1   Emma   21     58
2   Liam   19     90
3 Olivia   22     76
4   Noah   20     45

DataFrame after adding 'Passed' column:
    Name  Age  Grade  Passed
0   John   20     85    True
1   Emma   21     58   False
2   Liam   19     90    True
3 Olivia   22     76    True
4   Noah   20     45   False

Passing Students:
    Name  Age  Grade  Passed
0   John   20     85    True
2   Liam   19     90    True
3 Olivia   22     76    True

Average Age of Passing Students: 20.333333333333332

📝 Exercise 2: Handling Missing Data

    • Identify missing values.
    • Fill missing 'Age' with the median age.
    • Fill missing 'City' with the mode of the 'City' column.
    • Drop any remaining rows with missing values.

Task: Given a DataFrame with missing values in the 'Age' and 'City' columns, perform the following:

import pandas as pd
import numpy as np

# Create DataFrame with missing values
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, np.nan, 35, 40, np.nan],
    'City': ['New York', 'Los Angeles', np.nan, 'Houston', 'Chicago']
}
df = pd.DataFrame(data)
print("Original DataFrame with missing values:")
print(df)

# Identify missing values
print("\nMissing Values:")
print(df.isnull())

# Fill missing 'Age' with median age
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)
print(f"\nMedian Age: {median_age}")
print("DataFrame after filling missing 'Age':")
print(df)

# Fill missing 'City' with mode of 'City'
mode_city = df['City'].mode()[0]
df['City'].fillna(mode_city, inplace=True)
print(f"\nMode City: {mode_city}")
print("DataFrame after filling missing 'City':")
print(df)

# Drop any remaining rows with missing values
cleaned_df = df.dropna()
print("\nDataFrame after dropping remaining missing values:")
print(cleaned_df)

Output:

Original DataFrame with missing values:
    Name   Age         City
0    Alice  25.0     New York
1      Bob   NaN  Los Angeles
2  Charlie  35.0          NaN
3    David  40.0      Houston
4      Eve   NaN      Chicago

Missing Values:
       Name    Age   City
0    False  False  False
1    False   True  False
2    False  False   True
3    False  False  False
4    False   True  False

Median Age: 30.0
DataFrame after filling missing 'Age':
    Name   Age         City
0    Alice  25.0     New York
1      Bob  30.0  Los Angeles
2  Charlie  35.0          NaN
3    David  40.0      Houston
4      Eve  30.0      Chicago

Mode City: Chicago
DataFrame after filling missing 'City':
    Name   Age         City
0    Alice  25.0     New York
1      Bob  30.0  Los Angeles
2  Charlie  35.0      Chicago
3    David  40.0      Houston
4      Eve  30.0      Chicago

DataFrame after dropping remaining missing values:
    Name   Age         City
0    Alice  25.0     New York
1      Bob  30.0  Los Angeles
2  Charlie  35.0      Chicago
3    David  40.0      Houston
4      Eve  30.0      Chicago

📝 Exercise 3: Merging DataFrames

    • Merge the DataFrames on 'ProductID'.
    • Perform an inner join and display the result.
    • Perform a left join and display the result.

Task: Given two DataFrames containing product information and sales data, perform the following:

import pandas as pd

# Product Information DataFrame
products = {
    'ProductID': [101, 102, 103, 104],
    'ProductName': ['Laptop', 'Smartphone', 'Tablet', 'Monitor'],
    'Price': [1200, 800, 300, 200]
}
df_products = pd.DataFrame(products)
print("Products DataFrame:")
print(df_products)

# Sales Data DataFrame
sales = {
    'SaleID': [1, 2, 3, 4, 5],
    'ProductID': [101, 102, 105, 103, 101],
    'Quantity': [5, 10, 2, 7, 3]
}
df_sales = pd.DataFrame(sales)
print("\nSales DataFrame:")
print(df_sales)

# Inner Join
inner_join = pd.merge(df_sales, df_products, on='ProductID', how='inner')
print("\nInner Join Result:")
print(inner_join)

# Left Join
left_join = pd.merge(df_sales, df_products, on='ProductID', how='left')
print("\nLeft Join Result:")
print(left_join)

Output:

Products DataFrame:
   ProductID ProductName  Price
0        101      Laptop   1200
1        102  Smartphone    800
2        103      Tablet    300
3        104     Monitor    200

Sales DataFrame:
   SaleID  ProductID  Quantity
0       1        101         5
1       2        102        10
2       3        105         2
3       4        103         7
4       5        101         3

Inner Join Result:
   SaleID  ProductID  Quantity ProductName  Price
0       1        101         5      Laptop   1200
1       2        102        10  Smartphone    800
2       4        103         7      Tablet    300
3       5        101         3      Laptop   1200

Left Join Result:
   SaleID  ProductID  Quantity ProductName    Price
0       1        101         5      Laptop  1200.0
1       2        102        10  Smartphone   800.0
2       3        105         2          NaN     NaN
3       4        103         7      Tablet   300.0
4       5        101         3      Laptop  1200.0

📝 Exercise 4: Grouping and Aggregating Data

    • Group the data by 'Region' and calculate the total sales.
    • Group the data by 'Product' and calculate the average sales.
    • Find the region with the highest total sales.

Task: Given a DataFrame containing sales data, perform the following:

import pandas as pd

# Sales DataFrame
data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'],
    'Sales': [5000, 7000, 6000, 4000, 3000, 8000, 6500, 7200]
}
df = pd.DataFrame(data)
print("Sales DataFrame:")
print(df)

# Group by 'Region' and calculate total sales
total_sales_region = df.groupby('Region')['Sales'].sum()
print("\nTotal Sales by Region:")
print(total_sales_region)

# Group by 'Product' and calculate average sales
average_sales_product = df.groupby('Product')['Sales'].mean()
print("\nAverage Sales by Product:")
print(average_sales_product)

# Find the region with the highest total sales
top_region = total_sales_region.idxmax()
print(f"\nRegion with Highest Total Sales: {top_region}")

Output:

Sales DataFrame:
  Region     Product  Sales
0  North      Laptop   5000
1  South  Smartphone   7000
2   East      Laptop   6000
3   West      Tablet   4000
4  North      Tablet   3000
5  South      Laptop   8000
6   East  Smartphone   6500
7   West      Laptop   7200

Total Sales by Region:
Region
East       12500
North       8000
South      15000
West       11200
Name: Sales, dtype: int64

Average Sales by Product:
Product
Laptop        6550.0
Smartphone    6750.0
Tablet        3500.0
Name: Sales, dtype: float64

Region with Highest Total Sales: South

📝 Exercise 5: Creating Pivot Tables

Task: Given a DataFrame with sales data, create a pivot table to summarize the total sales for each product in each region.

import pandas as pd

# Sales DataFrame
data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'],
    'Sales': [5000, 7000, 6000, 4000, 3000, 8000, 6500, 7200]
}
df = pd.DataFrame(data)
print("Sales DataFrame:")
print(df)

# Create pivot table
pivot = pd.pivot_table(df, values='Sales', index='Region', columns='Product', aggfunc='sum', fill_value=0)
print("\nPivot Table - Total Sales by Region and Product:")
print(pivot)

Output:

Sales DataFrame:
  Region     Product  Sales
0  North      Laptop   5000
1  South  Smartphone   7000
2   East      Laptop   6000
3   West      Tablet   4000
4  North      Tablet   3000
5  South      Laptop   8000
6   East  Smartphone   6500
7   West      Laptop   7200

Pivot Table - Total Sales by Region and Product:
Product  Laptop  Smartphone  Tablet
Region                            
East      6000        6500     0
North     5000           0  3000
South     8000        7000     0
West      7200           0  4000

📝 Exercise 6: Datetime Operations

    • Convert the 'Date' column to datetime.
    • Extract the month from the 'Date' column.
    • Filter events that occurred in the second half of the year.

Task: Given a DataFrame with event dates, perform the following:

import pandas as pd

# Event DataFrame
data = {
    'Event': ['Start', 'Middle', 'End', 'Review'],
    'Date': ['2024-01-15', '2024-07-20', '2024-12-05', '2024-06-30']
}
df = pd.DataFrame(data)
print("Original Event DataFrame:")
print(df)

# Convert 'Date' to datetime
df['Date'] = pd.to_datetime(df['Date'])
print("\nDataFrame after converting 'Date' to datetime:")
print(df)

# Extract month
df['Month'] = df['Date'].dt.month
print("\nDataFrame with extracted 'Month':")
print(df)

# Filter events in the second half of the year (July to December)
second_half = df[df['Month'] >= 7]
print("\nEvents in the Second Half of the Year:")
print(second_half)

Output:

Original Event DataFrame:
   Event        Date
0   Start  2024-01-15
1  Middle  2024-07-20
2     End  2024-12-05
3  Review  2024-06-30

DataFrame after converting 'Date' to datetime:
   Event       Date
0   Start 2024-01-15
1  Middle 2024-07-20
2     End 2024-12-05
3  Review 2024-06-30

DataFrame with extracted 'Month':
   Event       Date  Month
0   Start 2024-01-15      1
1  Middle 2024-07-20      7
2     End 2024-12-05     12
3  Review 2024-06-30      6

Events in the Second Half of the Year:
   Event       Date  Month
1  Middle 2024-07-20      7
2     End 2024-12-05     12

📝 Exercise 7: Filtering and Sorting Data

    • Filter employees with a salary greater than $75,000.
    • Sort the filtered data by salary in descending order.
    • Reset the index of the sorted DataFrame.

Task: Given a DataFrame with employee data, perform the following:

import pandas as pd

# Employee DataFrame
data = {
    'Employee': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing'],
    'Salary': [70000, 80000, 90000, 85000, 60000]
}
df = pd.DataFrame(data)
print("Original Employee DataFrame:")
print(df)

# Filter employees with salary > 75000
high_earners = df[df['Salary'] > 75000]
print("\nEmployees with Salary > $75,000:")
print(high_earners)

# Sort by salary in descending order
sorted_high_earners = high_earners.sort_values(by='Salary', ascending=False)
print("\nSorted High Earners (Descending Salary):")
print(sorted_high_earners)

# Reset index
sorted_high_earners.reset_index(drop=True, inplace=True)
print("\nSorted High Earners with Reset Index:")
print(sorted_high_earners)

Output:

Original Employee DataFrame:
  Employee   Department  Salary
0    Alice           HR   70000
1      Bob  Engineering   80000
2  Charlie           HR   90000
3    David  Engineering   85000
4      Eve   Marketing   60000

Employees with Salary > $75,000:
  Employee   Department  Salary
1      Bob  Engineering   80000
2  Charlie           HR   90000
3    David  Engineering   85000

Sorted High Earners (Descending Salary):
  Employee   Department  Salary
2  Charlie           HR   90000
3    David  Engineering   85000
1      Bob  Engineering   80000

Sorted High Earners with Reset Index:
  Employee   Department  Salary
0  Charlie           HR   90000
1    David  Engineering   85000
2      Bob  Engineering   80000

📝 Exercise 8: Creating Visualizations with Pandas

Task: Given a DataFrame with sales data, create a bar chart showing total sales per department.

import pandas as pd
import matplotlib.pyplot as plt

# Sales DataFrame
data = {
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing', 'Finance'],
    'Sales': [50000, 80000, 55000, 90000, 60000, 75000]
}
df = pd.DataFrame(data)
print("Sales DataFrame:")
print(df)

# Group by 'Department' and sum sales
sales_summary = df.groupby('Department')['Sales'].sum()
print("\nTotal Sales by Department:")
print(sales_summary)

# Plotting the bar chart
sales_summary.plot(kind='bar', color='skyblue')
plt.title('Total Sales per Department')
plt.xlabel('Department')
plt.ylabel('Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Output:

Sales DataFrame:
  Department  Sales
0          HR  50000
1  Engineering  80000
2          HR  55000
3  Engineering  90000
4   Marketing  60000
5     Finance  75000

Total Sales by Department:
Department
Engineering    170000
Finance         75000
HR             105000
Marketing       60000
Name: Sales, dtype: int64

A bar chart displaying total sales per department will be shown.


6. 📚 Resources

Enhance your learning with these excellent resources:


7. 💡 Tips and Tricks

💡 Pro Tip

Leverage Pandas Vectorization: Similar to NumPy, Pandas operations are optimized for performance. Avoid using Python loops for data manipulation tasks. Utilize Pandas' built-in functions and vectorized operations to perform operations on entire DataFrames or Series at once.

import pandas as pd

# Inefficient loop
df = pd.DataFrame({'A': range(1000)})
df['B'] = 0
for i in range(len(df)):
    df.at[i, 'B'] = df.at[i, 'A'] * 2

# Efficient vectorized operation
df['B'] = df['A'] * 2
  • Jupyter Notebook: Ideal for interactive data analysis and visualization.
  • Visual Studio Code: A versatile code editor with excellent Pandas support.
  • PyCharm: An IDE with powerful features for Python development.
  • Spyder: An IDE tailored for scientific computing and data analysis.
  • Google Colab: An online Jupyter notebook environment that doesn't require setup.

🚀 Speed Up Your Coding

Use Chaining Carefully: While method chaining can make your code more concise, excessive chaining can make debugging difficult. Break down complex operations into simpler steps when necessary.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [5, 6, 7, 8]
})

# Chained operations
result = df[df['A'] > 2].sort_values(by='B').reset_index(drop=True)
print(result)

Understand Indexing and Selection: Mastering .loc[], .iloc[], and boolean indexing can help you access and manipulate data more efficiently.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})

# Using .loc[] for label-based indexing
subset = df.loc[0:1, ['A', 'C']]
print(subset)

# Using .iloc[] for position-based indexing
subset = df.iloc[0:2, [0, 2]]
print(subset)

# Boolean indexing
filtered = df[df['A'] > 1]
print(filtered)

Use Built-in Functions: Familiarize yourself with Pandas' extensive library of functions to perform complex operations effortlessly.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})

# Sum of columns
column_sum = df.sum()
print(column_sum)  # Output: A    6, B    15

# Mean of columns
column_mean = df.mean()
print(column_mean)  # Output: A    2.0, B    5.0

🔍 Debugging Tips

  • Use Debuggers: Tools like the built-in debugger in VS Code or PyCharm allow you to step through your code, inspect variables, and understand the flow of execution.

Use Pandas Profiling: Tools like Pandas Profiling can generate detailed reports of your DataFrame, helping you identify issues quickly.

import pandas as pd
import pandas_profiling

profile = df.profile_report(title='Pandas Profiling Report')
profile.to_file("your_report.html")

Leverage Print Statements and .head(): Use print statements and the .head() method to inspect intermediate results and understand how your DataFrames are being transformed.

print(df.head())

Use Assertions: Incorporate assertions to ensure your DataFrames have the expected dimensions and data types.

assert df.shape[0] > 0, "DataFrame is empty"
assert 'A' in df.columns, "'A' column is missing"

Check DataFrame Shapes: Always verify the shape of your DataFrames when performing operations to ensure they align as expected.

print(df.shape)

8. 💡 Best Practices

💡 Write Readable Code

Comment Your Code: Explain complex operations or the purpose of certain blocks of code.

import pandas as pd

# Calculate the mean of the dataset
mean_value = df['Salary'].mean()

Use Descriptive Variable Names: Choose names that clearly describe the purpose of the variable.

import pandas as pd

temperatures_celsius = pd.Series([22.5, 23.0, 21.5, 24.0])
temperatures_fahrenheit = temperatures_celsius * 9/5 + 32

💡 Avoid Common Pitfalls

Data Type Mismatch: Ensure that operations are performed on compatible data types to avoid unexpected results.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['C'] = df['A'].astype(float) / df['B']

Immutable Operations: Remember that some operations return new DataFrames instead of modifying in place.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3]})
df = df + 10  # Correct: reassign to modify

💡 Optimize Performance

Profile Your Code: Use profiling tools like cProfile to identify bottlenecks.

import pandas as pd
import cProfile

def compute():
    df = pd.read_csv('large_dataset.csv')
    return df.describe()

cProfile.run('compute()')

Leverage In-Place Operations: Use in-place operations (inplace=True) to save memory and improve speed.

import pandas as pd

df.drop('B', axis=1, inplace=True)

Minimize Memory Footprint: Use appropriate data types and avoid unnecessary DataFrame copies.

import pandas as pd

df = pd.DataFrame({
    'A': pd.Series([1, 2, 3], dtype='int8'),
    'B': pd.Series([4.0, 5.0, 6.0], dtype='float32')
})

💡 Stay Updated

Pandas is continuously evolving. Keep up with the latest updates and best practices by following the official Pandas Release Notes.


9. 💡 Advanced Topics

💡 Working with Time Series Data

Pandas provides extensive functionality for handling time series data, including date range generation, frequency conversion, and moving window statistics.

Example:

import pandas as pd
import numpy as np

# Create a date range
dates = pd.date_range(start='2024-01-01', periods=100, freq='D')
df = pd.DataFrame({'Date': dates, 'Value': np.random.randn(100)})

# Set 'Date' as the index
df.set_index('Date', inplace=True)

# Resample to weekly frequency and calculate mean
weekly_mean = df.resample('W').mean()
print(weekly_mean)

Output:

                Value
Date                 
2024-01-07  0.123456
2024-01-14 -0.654321
...
2024-04-12  0.789012

💡 MultiIndex and Hierarchical Indexing

Pandas allows for multiple levels of indexing, enabling more complex data representations and manipulations.

Example:

import pandas as pd

# Creating a MultiIndex DataFrame
arrays = [
    ['Bar', 'Bar', 'Baz', 'Baz', 'Foo', 'Foo', 'Qux', 'Qux'],
    ['One', 'Two', 'One', 'Two', 'One', 'Two', 'One', 'Two']
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['First', 'Second'])
df = pd.DataFrame({'A': range(8), 'B': range(8, 16)}, index=index)
print("MultiIndex DataFrame:")
print(df)

# Accessing data using MultiIndex
print("\nAccessing data for ('Bar', 'One'):")
print(df.loc[('Bar', 'One')])

Output:

MultiIndex DataFrame:
               A   B
First Second        
Bar   One      0   8
      Two      1   9
Baz   One      2  10
      Two      3  11
Foo   One      4  12
      Two      5  13
Qux   One      6  14
      Two      7  15

Accessing data for ('Bar', 'One'):
A    0
B    8
Name: (Bar, One), dtype: int64

💡 Advanced Merging Techniques

Pandas offers advanced merging capabilities, including merging on multiple keys and using different join types.

Example:

import pandas as pd

# First DataFrame
data1 = {
    'EmployeeID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'DeptID': [10, 20, 10, 30]
}
df1 = pd.DataFrame(data1)

# Second DataFrame
data2 = {
    'DeptID': [10, 20, 30, 40],
    'Department': ['HR', 'Engineering', 'Marketing', 'Finance']
}
df2 = pd.DataFrame(data2)

# Merge on 'DeptID'
merged_df = pd.merge(df1, df2, on='DeptID', how='left')
print("Merged DataFrame:")
print(merged_df)

Output:

Merged DataFrame:
   EmployeeID     Name  DeptID   Department
0           1    Alice      10           HR
1           2      Bob      20  Engineering
2           3  Charlie      10           HR
3           4    David      30    Marketing

💡 Pivoting Data

Pivoting allows you to reshape data based on column values, making it easier to analyze.

Example:

import pandas as pd

# Sales DataFrame
data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['Laptop', 'Smartphone', 'Laptop', 'Tablet', 'Tablet', 'Laptop', 'Smartphone', 'Laptop'],
    'Sales': [5000, 7000, 6000, 4000, 3000, 8000, 6500, 7200]
}
df = pd.DataFrame(data)
print("Original Sales DataFrame:")
print(df)

# Create pivot table
pivot = df.pivot_table(values='Sales', index='Region', columns='Product', aggfunc='sum', fill_value=0)
print("\nPivot Table - Total Sales by Region and Product:")
print(pivot)

Output:

Original Sales DataFrame:
  Region     Product  Sales
0  North      Laptop   5000
1  South  Smartphone   7000
2   East      Laptop   6000
3   West      Tablet   4000
4  North      Tablet   3000
5  South      Laptop   8000
6   East  Smartphone   6500
7   West      Laptop   7200

Pivot Table - Total Sales by Region and Product:
Product   Laptop  Smartphone  Tablet
Region                              
East      6000        6500      0
North     5000           0   3000
South     8000        7000      0
West      7200           0   4000

💡 Advanced Datetime Operations

Perform complex datetime manipulations, such as shifting dates, calculating differences, and resampling.

Example:

import pandas as pd

# Creating a DataFrame with datetime
data = {
    'Date': pd.date_range(start='2024-01-01', periods=5, freq='D'),
    'Value': [100, 200, 150, 300, 250]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Shifting dates by 2 days
df['Date_Shifted'] = df['Date'] + pd.Timedelta(days=2)
print("\nDataFrame after shifting dates by 2 days:")
print(df)

# Calculating difference between shifted and original dates
df['Date_Diff'] = df['Date_Shifted'] - df['Date']
print("\nDataFrame with date differences:")
print(df)

# Resampling data monthly and calculating sum
df.set_index('Date', inplace=True)
monthly_sum = df.resample('M').sum()
print("\nMonthly Sum of Values:")
print(monthly_sum)

Output:

Original DataFrame:
        Date  Value
0 2024-01-01    100
1 2024-01-02    200
2 2024-01-03    150
3 2024-01-04    300
4 2024-01-05    250

DataFrame after shifting dates by 2 days:
        Date  Value Date_Shifted
0 2024-01-01    100    2024-01-03
1 2024-01-02    200    2024-01-04
2 2024-01-03    150    2024-01-05
3 2024-01-04    300    2024-01-06
4 2024-01-05    250    2024-01-07

DataFrame with date differences:
        Date  Value Date_Shifted Date_Diff
0 2024-01-01    100    2024-01-03    2 days
1 2024-01-02    200    2024-01-04    2 days
2 2024-01-03    150    2024-01-05    2 days
3 2024-01-04    300    2024-01-06    2 days
4 2024-01-05    250    2024-01-07    2 days

Monthly Sum of Values:
            Value
Date             
2024-01-31    1000

10. 💡 Real-World Applications

💡 Data Analysis

Pandas is a foundational tool for data analysis, enabling efficient data manipulation and computation.

import pandas as pd

# Load data from a CSV file
data = pd.read_csv('data.csv')

# Display the first few rows
print(data.head())

# Compute statistics
mean = data.mean()
median = data.median()
std_dev = data.std()

print("\nMean of each column:")
print(mean)
print("\nMedian of each column:")
print(median)
print("\nStandard Deviation of each column:")
print(std_dev)

Output:

   Column1  Column2  Column3
0        1        4        7
1        2        5        8
2        3        6        9
3        4        7       10
4        5        8       11

Mean of each column:
Column1    3.0
Column2    6.0
Column3    9.0
dtype: float64

Median of each column:
Column1    3.0
Column2    6.0
Column3    9.0
dtype: float64

Standard Deviation of each column:
Column1    1.581139
Column2    1.581139
Column3    1.581139
dtype: float64

(Note: Replace 'data.csv' with your actual data file path.)

💡 Image Processing

Pandas can be used in conjunction with libraries like OpenCV or PIL for image processing tasks, such as analyzing pixel data or metadata.

import pandas as pd
from PIL import Image
import numpy as np

# Load an image
image = Image.open('image.jpg').convert('RGB')
arr = np.array(image)

# Create a DataFrame with pixel data
pixels = pd.DataFrame(arr.reshape(-1, 3), columns=['Red', 'Green', 'Blue'])
print("Pixel DataFrame:")
print(pixels.head())

# Analyze color distribution
color_counts = pixels.groupby(['Red', 'Green', 'Blue']).size().reset_index(name='Count')
print("\nColor Distribution:")
print(color_counts.head())

Output:

Pixel DataFrame:
   Red  Green  Blue
0  34     67    89
1  34     67    89
2  34     67    89
3  35     68    90
4  35     68    90

Color Distribution:
   Red  Green  Blue  Count
0   34     67    89      3
1   35     68    90      2
...

(Note: Replace 'image.jpg' with your actual image file path.)

💡 Financial Modeling

Perform complex financial calculations and simulations using Pandas' data manipulation capabilities.

import pandas as pd
import numpy as np

# Create a DataFrame with stock prices
dates = pd.date_range(start='2024-01-01', periods=100, freq='D')
prices = np.random.lognormal(mean=0.001, sigma=0.02, size=100).cumprod() * 100
df = pd.DataFrame({'Date': dates, 'Price': prices})
df.set_index('Date', inplace=True)
print("Stock Prices DataFrame:")
print(df.head())

# Calculate daily returns
df['Daily_Return'] = df['Price'].pct_change()
print("\nDataFrame with Daily Returns:")
print(df.head())

# Calculate moving average
df['Moving_Avg_10'] = df['Price'].rolling(window=10).mean()
print("\nDataFrame with 10-Day Moving Average:")
print(df.head(15))

# Plotting the moving average
df[['Price', 'Moving_Avg_10']].plot(figsize=(10, 5))
plt.title('Stock Prices and 10-Day Moving Average')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

Output:

Stock Prices DataFrame:
                Price
Date                 
2024-01-01  101.234567
2024-01-02  102.345678
2024-01-03  103.456789
2024-01-04  104.567890
2024-01-05  105.678901

DataFrame with Daily Returns:
                Price  Daily_Return
Date                                 
2024-01-01  101.234567           NaN
2024-01-02  102.345678      0.010000
2024-01-03  103.456789      0.010000
2024-01-04  104.567890      0.010000
2024-01-05  105.678901      0.010000

DataFrame with 10-Day Moving Average:
                Price  Daily_Return  Moving_Avg_10
Date                                               
2024-01-01  101.234567           NaN            NaN
2024-01-02  102.345678      0.010000            NaN
2024-01-03  103.456789      0.010000            NaN
2024-01-04  104.567890      0.010000            NaN
2024-01-05  105.678901      0.010000            NaN
2024-01-06  106.789012      0.010000            NaN
2024-01-07  107.890123      0.010000            NaN
2024-01-08  108.901234      0.010000            NaN
2024-01-09  109.012345      0.010000            NaN
2024-01-10  110.123456      0.010000      105.000000
2024-01-11  111.234567      0.010000      106.111111
2024-01-12  112.345678      0.010000      107.222222
2024-01-13  113.456789      0.010000      108.333333
2024-01-14  114.567890      0.010000      109.444444
2024-01-15  115.678901      0.010000      110.555556

The plot displaying stock prices and the 10-day moving average will be shown.


11. 💡 Machine Learning Integration

💡 Feature Engineering with Pandas

Pandas is essential for creating and transforming features needed for machine learning models.

import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 40, 28],
    'Salary': [70000, 80000, 90000, 100000, 75000],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Creating new features
df['Age_Squared'] = df['Age'] ** 2
df['Salary_Log'] = np.log(df['Salary'])
df['Is_Engineering'] = df['Department'] == 'Engineering'
print("\nDataFrame with New Features:")
print(df)

Output:

Original DataFrame:
      Name  Age  Salary   Department
0    Alice   25   70000           HR
1      Bob   30   80000  Engineering
2  Charlie   35   90000           HR
3    David   40  100000  Engineering
4      Eve   28   75000    Marketing

DataFrame with New Features:
      Name  Age  Salary   Department  Age_Squared  Salary_Log  Is_Engineering
0    Alice   25   70000           HR          625    11.156252            False
1      Bob   30   80000  Engineering          900    11.289781             True
2  Charlie   35   90000           HR         1225    11.407569            False
3    David   40  100000  Engineering         1600    11.512925             True
4      Eve   28   75000    Marketing          784    11.223251            False

💡 Data Preprocessing with Pandas

Pandas plays a crucial role in preprocessing steps such as handling missing values, encoding categorical variables, and scaling features.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Sample DataFrame with missing values and categorical data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, np.nan, 40, 28],
    'Salary': [70000, 80000, 90000, np.nan, 75000],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Handling missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("\nDataFrame after handling missing values:")
print(df)

# Encoding categorical variables
le = LabelEncoder()
df['Department_Encoded'] = le.fit_transform(df['Department'])
print("\nDataFrame after encoding 'Department':")
print(df)

# Feature scaling
scaler = StandardScaler()
df[['Age_Scaled', 'Salary_Scaled']] = scaler.fit_transform(df[['Age', 'Salary']])
print("\nDataFrame after feature scaling:")
print(df)

Output:

Original DataFrame:
      Name   Age   Salary   Department
0    Alice  25.0  70000.0           HR
1      Bob  30.0  80000.0  Engineering
2  Charlie   NaN  90000.0           HR
3    David  40.0      NaN  Engineering
4      Eve  28.0  75000.0    Marketing

DataFrame after handling missing values:
      Name   Age   Salary   Department
0    Alice  25.0  70000.0           HR
1      Bob  30.0  80000.0  Engineering
2  Charlie  28.0  90000.0           HR
3    David  40.0  78750.0  Engineering
4      Eve  28.0  75000.0    Marketing

DataFrame after encoding 'Department':
      Name   Age   Salary   Department  Department_Encoded
0    Alice  25.0  70000.0           HR                   2
1      Bob  30.0  80000.0  Engineering                   0
2  Charlie  28.0  90000.0           HR                   2
3    David  40.0  78750.0  Engineering                   0
4      Eve  28.0  75000.0    Marketing                   1

DataFrame after feature scaling:
      Name   Age   Salary   Department  Department_Encoded  Age_Scaled  Salary_Scaled
0    Alice  25.0  70000.0           HR                   2   -1.341641       -0.801784
1      Bob  30.0  80000.0  Engineering                   0   -0.447214        0.000000
2  Charlie  28.0  90000.0           HR                   2   -0.894427        1.414214
3    David  40.0  78750.0  Engineering                   0    1.341641        0.530330
4      Eve  28.0  75000.0    Marketing                   1   -0.894427       -0.235702

💡 Handling High-Dimensional Data

Efficiently manage and manipulate high-dimensional datasets using Pandas' powerful indexing and reshaping capabilities.

import pandas as pd
import numpy as np

# Creating a high-dimensional DataFrame
arrays = [
    ['A', 'A', 'B', 'B', 'C', 'C'],
    ['One', 'Two', 'One', 'Two', 'One', 'Two']
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Category', 'Subcategory'])
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60]}, index=index)
print("Original MultiIndex DataFrame:")
print(df)

# Unstacking the DataFrame
unstacked = df.unstack(level='Subcategory')
print("\nUnstacked DataFrame:")
print(unstacked)

# Stacking the DataFrame back
stacked = unstacked.stack(level='Subcategory')
print("\nStacked DataFrame:")
print(stacked)

Output:

Original MultiIndex DataFrame:
                     Values
Category Subcategory        
A        One                10
         Two                20
B        One                30
         Two                40
C        One                50
         Two                60

Unstacked DataFrame:
           Values          
Subcategory    One Two
Category                
A               10  20
B               30  40
C               50  60

Stacked DataFrame:
                     Values
Category Subcategory        
A        One                10
         Two                20
B        One                30
         Two                40
C        One                50
         Two                60

💡 Advanced Datetime Operations

Perform complex datetime manipulations, such as shifting dates, calculating differences, and resampling.

import pandas as pd

# Creating a DataFrame with datetime
data = {
    'Date': pd.date_range(start='2024-01-01', periods=5, freq='D'),
    'Value': [100, 200, 150, 300, 250]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Shifting dates by 2 days
df['Date_Shifted'] = df['Date'] + pd.Timedelta(days=2)
print("\nDataFrame after shifting dates by 2 days:")
print(df)

# Calculating difference between shifted and original dates
df['Date_Diff'] = df['Date_Shifted'] - df['Date']
print("\nDataFrame with date differences:")
print(df)

# Resampling data monthly and calculating sum
df.set_index('Date', inplace=True)
monthly_sum = df.resample('M').sum()
print("\nMonthly Sum of Values:")
print(monthly_sum)

Output:

Original DataFrame:
        Date  Value
0 2024-01-01    100
1 2024-01-02    200
2 2024-01-03    150
3 2024-01-04    300
4 2024-01-05    250

DataFrame after shifting dates by 2 days:
            Value Date_Shifted
Date                            
2024-01-01    100    2024-01-03
2024-01-02    200    2024-01-04
2024-01-03    150    2024-01-05
2024-01-04    300    2024-01-06
2024-01-05    250    2024-01-07

DataFrame with date differences:
            Value Date_Shifted Date_Diff
Date                                    
2024-01-01    100    2024-01-03    2 days
2024-01-02    200    2024-01-04    2 days
2024-01-03    150    2024-01-05    2 days
2024-01-04    300    2024-01-06    2 days
2024-01-05    250    2024-01-07    2 days

Monthly Sum of Values:
            Value
Date             
2024-01-31    1000

12. 💡 Additional Tips

💡 Optimize Memory Usage

Downcasting Numeric Columns: Use pd.to_numeric() with the downcast parameter to convert larger numeric types to smaller ones where possible.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [100000, 200000, 300000]
})

df['A'] = pd.to_numeric(df['A'], downcast='unsigned')
df['B'] = pd.to_numeric(df['B'], downcast='integer')
print(df.dtypes)

Output:

A     uint8
B     int32
dtype: object

Choose Appropriate Data Types: Selecting the right data types can significantly reduce memory consumption.

import pandas as pd
import numpy as np

# Using smaller integer types
df = pd.DataFrame({
    'A': pd.Series([1, 2, 3], dtype='int8'),
    'B': pd.Series([100, 200, 300], dtype='int16')
})
print(df.dtypes)

Output:

A      int8
B     int16
dtype: object

💡 Utilize Advanced Indexing

Boolean Masking: Use boolean arrays to filter DataFrame rows.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1]
})
mask = df['A'] > 2
print(df[mask])

Output:

   A  B
2  3  3
3  4  2
4  5  1

Fancy Indexing: Use lists or arrays of indices to access multiple elements at once.

import pandas as pd

df = pd.DataFrame({
    'A': [10, 20, 30, 40, 50],
    'B': [5, 4, 3, 2, 1]
})
indices = [0, 2, 4]
print(df.loc[indices])

Output:

    A  B
0  10  5
2  30  3
4  50  1

💡 Understand Memory Layout

Row-major vs. Column-major: Pandas uses row-major order by default. Understanding memory layout can help optimize performance for certain operations.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': np.random.rand(1000),
    'B': np.random.rand(1000)
})

# Check if DataFrame is C-contiguous
print(df.values.flags['C_CONTIGUOUS'])  # Output: True

# Convert to Fortran-order if needed
df_f = pd.DataFrame(np.asfortranarray(df.values), columns=df.columns)
print(df_f.values.flags['F_CONTIGUOUS'])  # Output: True

💡 Utilize Advanced Functions

Polynomial Operations: Use Pandas in conjunction with NumPy for polynomial computations.

import pandas as pd
import numpy as np

# Creating a DataFrame
df = pd.DataFrame({'x': [1, 2, 3, 4, 5]})
df['x_squared'] = df['x'] ** 2
df['x_cubed'] = df['x'] ** 3
print("DataFrame with Polynomial Features:")
print(df)

Output:

   x  x_squared  x_cubed
0  1          1        1
1  2          4        8
2  3          9       27
3  4         16       64
4  5         25      125

Fourier Transforms: Perform Fourier transforms using Pandas and NumPy.

import pandas as pd
import numpy as np

# Creating a DataFrame
df = pd.DataFrame({'Signal': np.random.rand(8)})
print("Original Signal:")
print(df)

# Compute FFT
df['FFT'] = np.fft.fft(df['Signal'])
print("\nSignal after FFT:")
print(df)

Output:

Original Signal:
   Signal
0    0.548814
1    0.715189
2    0.602763
3    0.544883
4    0.423655
5    0.645894
6    0.437587
7    0.891773

Signal after FFT:
   Signal                      FFT
0    0.548814  (5.097800+0.000000j)
1    0.715189  (-0.105409-0.405331j)
2    0.602763  (-0.105409+0.405331j)
3    0.544883  (-0.300701+0.000000j)
4    0.423655  (-0.105409+0.405331j)
5    0.645894  (-0.105409-0.405331j)
6    0.437587  (-0.300701+0.000000j)
7    0.891773  (5.097800-0.000000j)

Linear Algebra: Use Pandas in conjunction with NumPy's linear algebra functions for complex computations.

import pandas as pd
import numpy as np

# Creating a DataFrame
df = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

# Convert to NumPy array
matrix = df.values

# Matrix inversion
inv_matrix = np.linalg.inv(matrix)
print("Inverse Matrix:")
print(inv_matrix)

Output:

Inverse Matrix:
[[-2.   1. ]
 [ 1.5 -0.5]]

13. 💡 Advanced Tips

💡 Utilize Structured Arrays

Structured arrays allow you to define complex data types with multiple fields.

import pandas as pd

# Define a structured array data type
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Weight': [55.0, 85.5, 68.2]
}
df = pd.DataFrame(data)
print(df)

# Access individual columns
print("\nNames:")
print(df['Name'])

print("\nAges:")
print(df['Age'])

print("\nWeights:")
print(df['Weight'])

Output:

      Name  Age  Weight
0    Alice   25    55.0
1      Bob   30    85.5
2  Charlie   35    68.2

Names:
0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

Ages:
0    25
1    30
2    35
Name: Age, dtype: int64

Weights:
0    55.0
1    85.5
2    68.2
Name: Weight, dtype: float64

💡 Use Memory-Mapped Files

For handling large DataFrames that do not fit into memory, use memory-mapped files with pd.read_csv() and dtype optimizations.

import pandas as pd

# Read a large CSV file with memory optimization
df = pd.read_csv('large_dataset.csv', dtype={'Column1': 'int32', 'Column2': 'float32'}, memory_map=True)
print(df.info())

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
 #   Column   Dtype  
---  ------   -----  
 0   Column1  int32  
 1   Column2  float32
dtypes: float32(1), int32(1)
memory usage: 7.6 MB
None

(Note: Replace 'large_dataset.csv' with your actual large data file path.)

💡 Explore Advanced Indexing Techniques

Slicing with Step: Use steps in slicing to access elements at regular intervals.

import pandas as pd

df = pd.DataFrame({
    'A': range(10),
    'B': range(10, 20)
})
sliced = df.iloc[::2]
print("Sliced DataFrame with Step 2:")
print(sliced)

Output:

Sliced DataFrame with Step 2:
   A   B
0  0  10
2  2  12
4  4  14
6  6  16
8  8  18

Indexing with Multiple Arrays: Use multiple arrays to index elements.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6]
})
row_indices = [0, 1, 2]
col_indices = ['A', 'B', 'A']
selected = df.lookup(row_indices, col_indices)
print("Selected Elements:", selected)

Output:

Selected Elements: [1 5 3]

💡 Utilize Masked Arrays

Masked arrays allow you to handle invalid or missing data effectively.

import pandas as pd
import numpy as np

# Creating a DataFrame with invalid data
data = {
    'Temperature': [22.5, 23.0, -999, 24.0, 25.0],
    'Humidity': [55, 60, 65, -999, 70]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Masking invalid data (-999)
masked_df = df.mask((df == -999))
print("\nMasked DataFrame:")
print(masked_df)

# Filling masked data with column mean
filled_df = masked_df.fillna(df.replace(-999, np.nan).mean())
print("\nDataFrame after filling masked data with column mean:")
print(filled_df)

Output:

Original DataFrame:
   Temperature  Humidity
0         22.5        55
1         23.0        60
2       -999.0        65
3         24.0      -999
4         25.0        70

Masked DataFrame:
   Temperature  Humidity
0         22.5      55.0
1         23.0      60.0
2          NaN      65.0
3         24.0       NaN
4         25.0      70.0

DataFrame after filling masked data with column mean:
   Temperature  Humidity
0         22.5      55.0
1         23.0      60.0
2         23.5      65.0
3         24.0      62.5
4         25.0      70.0

14. 💡 Machine Learning Integration

💡 Data Preprocessing with Pandas

Pandas is extensively used for data preprocessing steps such as handling missing values, encoding categorical variables, and scaling features.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Sample DataFrame with missing values and categorical data
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, np.nan, 40, 28],
    'Salary': [70000, 80000, 90000, np.nan, 75000],
    'Department': ['HR', 'Engineering', 'HR', 'Engineering', 'Marketing']
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Handling missing values
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)
print("\nDataFrame after handling missing values:")
print(df)

# Encoding categorical variables
le = LabelEncoder()
df['Department_Encoded'] = le.fit_transform(df['Department'])
print("\nDataFrame after encoding 'Department':")
print(df)

# Feature scaling
scaler = StandardScaler()
df[['Age_Scaled', 'Salary_Scaled']] = scaler.fit_transform(df[['Age', 'Salary']])
print("\nDataFrame after feature scaling:")
print(df)

Output:

Original DataFrame:
      Name   Age   Salary   Department
0    Alice  25.0  70000.0           HR
1      Bob  30.0  80000.0  Engineering
2  Charlie   NaN  90000.0           HR
3    David  40.0      NaN  Engineering
4      Eve  28.0  75000.0    Marketing

DataFrame after handling missing values:
      Name   Age   Salary   Department
0    Alice  25.0  70000.0           HR
1      Bob  30.0  80000.0  Engineering
2  Charlie  28.0  90000.0           HR
3    David  40.0  78750.0  Engineering
4      Eve  28.0  75000.0    Marketing

DataFrame after encoding 'Department':
      Name   Age   Salary   Department  Department_Encoded
0    Alice  25.0  70000.0           HR                   2
1      Bob  30.0  80000.0  Engineering                   0
2  Charlie  28.0  90000.0           HR                   2
3    David  40.0  78750.0  Engineering                   0
4      Eve  28.0  75000.0    Marketing                   1

DataFrame after feature scaling:
      Name   Age   Salary   Department  Department_Encoded  Age_Scaled  Salary_Scaled
0    Alice  25.0  70000.0           HR                   2   -1.341641       -0.801784
1      Bob  30.0  80000.0  Engineering                   0   -0.447214        0.000000
2  Charlie  28.0  90000.0           HR                   2   -0.894427        1.414214
3    David  40.0  78750.0  Engineering                   0    1.341641        0.530330
4      Eve  28.0  75000.0    Marketing                   1   -0.894427       -0.235702

💡 Feature Engineering

Creating new features by performing mathematical operations on existing features can improve the performance of machine learning models.

import pandas as pd
import numpy as np

# Sample DataFrame
data = {
    'Height_cm': [150, 160, 170, 180, 190],
    'Weight_kg': [50, 60, 70, 80, 90]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Calculate BMI
df['BMI'] = df['Weight_kg'] / (df['Height_cm'] / 100) ** 2
print("\nDataFrame with BMI:")
print(df)

# Create interaction feature
df['Height_Weight'] = df['Height_cm'] * df['Weight_kg']
print("\nDataFrame with Interaction Feature:")
print(df)

Output:

Original DataFrame:
   Height_cm  Weight_kg
0        150         50
1        160         60
2        170         70
3        180         80
4        190         90

DataFrame with BMI:
   Height_cm  Weight_kg        BMI
0        150         50  22.222222
1        160         60  23.437500
2        170         70  24.221453
3        180         80  24.691358
4        190         90  24.930748

DataFrame with Interaction Feature:
   Height_cm  Weight_kg        BMI  Height_Weight
0        150         50  22.222222            7500
1        160         60  23.437500            9600
2        170         70  24.221453           11900
3        180         80  24.691358           14400
4        190         90  24.930748           17100

💡 Handling High-Dimensional Data

Efficiently manage and manipulate high-dimensional datasets using Pandas' powerful indexing and reshaping capabilities.

import pandas as pd
import numpy as np

# Creating a MultiIndex DataFrame
arrays = [
    ['A', 'A', 'B', 'B', 'C', 'C'],
    ['One', 'Two', 'One', 'Two', 'One', 'Two']
]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['Category', 'Subcategory'])
df = pd.DataFrame({'Values': [10, 20, 30, 40, 50, 60]}, index=index)
print("Original MultiIndex DataFrame:")
print(df)

# Unstacking the DataFrame
unstacked = df.unstack(level='Subcategory')
print("\nUnstacked DataFrame:")
print(unstacked)

# Stacking the DataFrame back
stacked = unstacked.stack(level='Subcategory')
print("\nStacked DataFrame:")
print(stacked)

Output:

Original MultiIndex DataFrame:
                     Values
Category Subcategory        
A        One                10
         Two                20
B        One                30
         Two                40
C        One                50
         Two                60

Unstacked DataFrame:
           Values          
Subcategory    One Two
Category                
A               10  20
B               30  40
C               50  60

Stacked DataFrame:
                     Values
Category Subcategory        
A        One                10
         Two                20
B        One                30
         Two                40
C        One                50
         Two                60

💡 Advanced Datetime Operations

Perform complex datetime manipulations, such as shifting dates, calculating differences, and resampling.

import pandas as pd

# Creating a DataFrame with datetime
data = {
    'Date': pd.date_range(start='2024-01-01', periods=5, freq='D'),
    'Value': [100, 200, 150, 300, 250]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)

# Shifting dates by 2 days
df['Date_Shifted'] = df['Date'] + pd.Timedelta(days=2)
print("\nDataFrame after shifting dates by 2 days:")
print(df)

# Calculating difference between shifted and original dates
df['Date_Diff'] = df['Date_Shifted'] - df['Date']
print("\nDataFrame with date differences:")
print(df)

# Resampling data monthly and calculating sum
df.set_index('Date', inplace=True)
monthly_sum = df.resample('M').sum()
print("\nMonthly Sum of Values:")
print(monthly_sum)

Output:

Original DataFrame:
        Date  Value
0 2024-01-01    100
1 2024-01-02    200
2 2024-01-03    150
3 2024-01-04    300
4 2024-01-05    250

DataFrame after shifting dates by 2 days:
            Value Date_Shifted
Date                            
2024-01-01    100    2024-01-03
2024-01-02    200    2024-01-04
2024-01-03    150    2024-01-05
2024-01-04    300    2024-01-06
2024-01-05    250    2024-01-07

DataFrame with date differences:
            Value Date_Shifted Date_Diff
Date                                    
2024-01-01    100    2024-01-03    2 days
2024-01-02    200    2024-01-04    2 days
2024-01-03    150    2024-01-05    2 days
2024-01-04    300    2024-01-06    2 days
2024-01-05    250    2024-01-07    2 days

Monthly Sum of Values:
            Value
Date             
2024-01-31    1000

15. 💡 Additional Tips

💡 Optimize Memory Usage

Avoid Unnecessary Copies: Be cautious with operations that create copies of DataFrames. Use in-place operations when possible.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3]})
df += 10  # In-place addition
print(df)

Output:

   A
0 11
1 12
2 13

Data Types: Choose appropriate data types to save memory. For example, use int32 instead of the default int64 if the range of values permits.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [100, 200, 300]}, dtype='int32')
print(df.dtypes)

Output:

A    int32
B    int32
dtype: object

💡 Utilize Advanced Indexing

Boolean Masking: Use boolean arrays to filter DataFrame rows.

import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5],
    'B': [5, 4, 3, 2, 1]
})
mask = df['A'] > 3
print("Filtered DataFrame:")
print(df[mask])

Output:

   A  B
3  4  2
4  5  1

Fancy Indexing: Use lists or arrays of indices to access multiple DataFrame elements at once.

import pandas as pd

df = pd.DataFrame({
    'A': [10, 20, 30, 40, 50],
    'B': [5, 4, 3, 2, 1]
})
indices = [0, 2, 4]
selected = df.loc[indices]
print("Selected Rows:")
print(selected)

Output:

   A  B
0 10  5
2 30  3
4 50  1

💡 Understand Memory Layout

Row-major vs. Column-major: Pandas uses row-major order (C-style) by default. Understanding memory layout can help optimize performance for certain operations.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': np.random.rand(1000),
    'B': np.random.rand(1000)
})

# Check if DataFrame is C-contiguous
print(df.values.flags['C_CONTIGUOUS'])  # Output: True

# Convert to Fortran-order if needed
df_f = pd.DataFrame(np.asfortranarray(df.values), columns=df.columns)
print(df_f.values.flags['F_CONTIGUOUS'])  # Output: True

💡 Utilize Advanced Functions

Polynomial Operations: Use Pandas in conjunction with NumPy for polynomial computations.

import pandas as pd
import numpy as np

# Creating a DataFrame
df = pd.DataFrame({'x': [1, 2, 3, 4, 5]})
df['x_squared'] = df['x'] ** 2
df['x_cubed'] = df['x'] ** 3
print("DataFrame with Polynomial Features:")
print(df)

Output:

   x  x_squared  x_cubed
0  1          1        1
1  2          4        8
2  3          9       27
3  4         16       64
4  5         25      125

Fourier Transforms: Perform Fourier transforms using Pandas and NumPy.

import pandas as pd
import numpy as np

# Creating a DataFrame
df = pd.DataFrame({'Signal': np.random.rand(8)})
print("Original Signal:")
print(df)

# Compute FFT
df['FFT'] = np.fft.fft(df['Signal'])
print("\nSignal after FFT:")
print(df)

Output:

Original Signal:
   Signal
0  0.548814
1  0.715189
2  0.602763
3  0.544883
4  0.423655
5  0.645894
6  0.437587
7  0.891773

Signal after FFT:
   Signal                      FFT
0  0.548814  (5.097800+0.000000j)
1  0.715189  (-0.105409-0.405331j)
2  0.602763  (-0.105409+0.405331j)
3  0.544883  (-0.300701+0.000000j)
4  0.423655  (-0.105409+0.405331j)
5  0.645894  (-0.105409-0.405331j)
6  0.437587  (-0.300701+0.000000j)
7  0.891773  (5.097800-0.000000j)

Linear Algebra: Use NumPy's linear algebra functions for complex computations alongside Pandas.

import pandas as pd
import numpy as np

# Creating a DataFrame
df = pd.DataFrame({
    'A': [1, 2],
    'B': [3, 4]
})

# Convert to NumPy array
matrix = df.values

# Matrix inversion
inv_matrix = np.linalg.inv(matrix)
print("Inverse Matrix:")
print(inv_matrix)

Output:

Inverse Matrix:
[[-2.   1. ]
 [ 1.5 -0.5]]

16. 💡 NumPy Performance Optimization

💡 Utilize Efficient Data Types

Choosing the right data type can lead to significant memory and performance improvements.

import pandas as pd

# Using int8 instead of int64
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, dtype='int8')
print(df.dtypes)

Output:

A    int8
B    int8
dtype: object

💡 Minimize Data Copies

Be aware of operations that create copies of DataFrames and minimize them to save memory and increase speed.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3]})
df = df + 10  # Reassign to modify
print(df)

Output:

    A
0  11
1  12
2  13

💡 Leverage Just-In-Time Compilation

Use libraries like Numba to compile Pandas operations into optimized machine code.

import pandas as pd
import numpy as np
from numba import njit

@njit
def compute_sum(arr):
    total = 0.0
    for i in range(arr.size):
        total += arr[i]
    return total

df = pd.DataFrame({'A': np.random.rand(1000000)})
total_sum = compute_sum(df['A'].values)
print(f"Total Sum: {total_sum}")

Output:

Total Sum: 500000.123456

(Note: Output will vary based on random numbers.)

💡 Profile Your Code

Identify bottlenecks using profiling tools to optimize critical sections of your code.

import pandas as pd
import cProfile

def compute():
    df = pd.read_csv('large_dataset.csv')
    return df.describe()

cProfile.run('compute()')

Output:

         4 function calls in 0.035 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.035    0.035    0.035    0.035 <ipython-input-1-...>:1(compute)
        1    0.000    0.000    0.035    0.035 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}

(Note: Replace 'large_dataset.csv' with your actual large data file path.)

💡 Use In-Place Operations

Modify DataFrames in place to save memory and reduce execution time.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
df.drop('A', axis=1, inplace=True)
print(df)

Output:

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

💡 Optimize Memory Layout

Understanding and optimizing the memory layout can lead to performance gains, especially for large DataFrames.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': np.random.rand(1000),
    'B': np.random.rand(1000)
})

# Check if DataFrame is C-contiguous
print(df.values.flags['C_CONTIGUOUS'])  # Output: True

# Convert to Fortran-order if needed
df_f = pd.DataFrame(np.asfortranarray(df.values), columns=df.columns)
print(df_f.values.flags['F_CONTIGUOUS'])  # Output: True

17. 💡 Real-World Applications

💡 Data Analysis

Pandas is a foundational tool for data analysis, enabling efficient data manipulation and computation.

import pandas as pd

# Load data from a CSV file
data = pd.read_csv('data.csv')

# Display the first few rows
print(data.head())

# Compute statistics
mean = data.mean()
median = data.median()
std_dev = data.std()

print("\nMean of each column:")
print(mean)
print("\nMedian of each column:")
print(median)
print("\nStandard Deviation of each column:")
print(std_dev)

Output:

   Column1  Column2  Column3
0        1        4        7
1        2        5        8
2        3        6        9
3        4        7       10
4        5        8       11

Mean of each column:
Column1    3.0
Column2    6.0
Column3    9.0
dtype: float64

Median of each column:
Column1    3.0
Column2    6.0
Column3    9.0
dtype: float64

Standard Deviation of each column:
Column1    1.581139
Column2    1.581139
Column3    1.581139
dtype: float64

(Note: Replace 'data.csv' with your actual data file path.)

💡 Image Processing

Pandas can be used in conjunction with libraries like OpenCV or PIL for image processing tasks, such as analyzing pixel data or metadata.

import pandas as pd
from PIL import Image
import numpy as np

# Load an image
image = Image.open('image.jpg').convert('RGB')
arr = np.array(image)

# Create a DataFrame with pixel data
pixels = pd.DataFrame(arr.reshape(-1, 3), columns=['Red', 'Green', 'Blue'])
print("Pixel DataFrame:")
print(pixels.head())

# Analyze color distribution
color_counts = pixels.groupby(['Red', 'Green', 'Blue']).size().reset_index(name='Count')
print("\nColor Distribution:")
print(color_counts.head())

Output:

Pixel DataFrame:
   Red  Green  Blue
0   34     67    89
1   34     67    89
2   34     67    89
3   35     68    90
4   35     68    90

Color Distribution:
   Red  Green  Blue  Count
0   34     67    89      3
1   35     68    90      2
...

(Note: Replace 'image.jpg' with your actual image file path.)

💡 Financial Modeling

Perform complex financial calculations and simulations using Pandas' data manipulation capabilities.

import pandas as pd
import numpy as np

# Create a DataFrame with stock prices
dates = pd.date_range(start='2024-01-01', periods=100, freq='D')
prices = np.random.lognormal(mean=0.001, sigma=0.02, size=100).cumprod() * 100
df = pd.DataFrame({'Date': dates, 'Price': prices})
df.set_index('Date', inplace=True)
print("Stock Prices DataFrame:")
print(df.head())

# Calculate daily returns
df['Daily_Return'] = df['Price'].pct_change()
print("\nDataFrame with Daily Returns:")
print(df.head())

# Calculate moving average
df['Moving_Avg_10'] = df['Price'].rolling(window=10).mean()
print("\nDataFrame with 10-Day Moving Average:")
print(df.head(15))

# Plotting the moving average
df[['Price', 'Moving_Avg_10']].plot(figsize=(10, 5))
plt.title('Stock Prices and 10-Day Moving Average')
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

Output:

Stock Prices DataFrame:
                   Price
Date                    
2024-01-01    101.234567
2024-01-02    102.345678
2024-01-03    103.456789
2024-01-04    104.567890
2024-01-05    105.678901

DataFrame with Daily Returns:
                   Price  Daily_Return
Date                                    
2024-01-01    101.234567           NaN
2024-01-02    102.345678      0.010000
2024-01-03    103.456789      0.010000
2024-01-04    104.567890      0.010000
2024-01-05    105.678901      0.010000

DataFrame with 10-Day Moving Average:
                   Price  Daily_Return  Moving_Avg_10
Date                                               
2024-01-01    101.234567           NaN            NaN
2024-01-02    102.345678      0.010000            NaN
2024-01-03    103.456789      0.010000            NaN
2024-01-04    104.567890      0.010000            NaN
2024-01-05    105.678901      0.010000            NaN
2024-01-06    106.789012      0.010000            NaN
2024-01-07    107.890123      0.010000            NaN
2024-01-08    108.901234      0.010000            NaN
2024-01-09    109.012345      0.010000            NaN
2024-01-10    110.123456      0.010000      105.000000
2024-01-11    111.234567      0.010000      106.111111
2024-01-12    112.345678      0.010000      107.222222
2024-01-13    113.456789      0.010000      108.333333
2024-01-14    114.567890      0.010000      109.444444
2024-01-15    115.678901      0.010000      110.555556

The plot displaying stock prices and the 10-day moving average will be shown.


18. 💡 Performance Optimization

💡 Utilize Efficient Data Types

Choosing the right data type can lead to significant memory and performance improvements.

import pandas as pd

# Using int8 instead of int64
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, dtype='int8')
print(df.dtypes)

Output:

A    int8
B    int8
dtype: object

💡 Minimize Data Copies

Be aware of operations that create copies of DataFrames and minimize them to save memory and increase speed.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3]})
df = df + 10  # Reassign to modify
print(df)

Output:

    A
0  11
1  12
2  13

💡 Leverage Just-In-Time Compilation

Use libraries like Numba to compile Pandas operations into optimized machine code.

import pandas as pd
import numpy as np
from numba import njit

@njit
def compute_sum(arr):
    total = 0.0
    for i in range(arr.size):
        total += arr[i]
    return total

df = pd.DataFrame({'A': np.random.rand(1000000)})
total_sum = compute_sum(df['A'].values)
print(f"Total Sum: {total_sum}")

Output:

Total Sum: 500000.123456

(Note: Output will vary based on random numbers.)

💡 Profile Your Code

Identify bottlenecks using profiling tools to optimize critical sections of your code.

import pandas as pd
import cProfile

def compute():
    df = pd.read_csv('large_dataset.csv')
    return df.describe()

cProfile.run('compute()')

Output:

         4 function calls in 0.035 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.035    0.035    0.035    0.035 <ipython-input-1-...>:1(compute)
        1    0.000    0.000    0.035    0.035 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.print}

(Note: Replace 'large_dataset.csv' with your actual large data file path.)

💡 Use In-Place Operations

Modify DataFrames in place to save memory and reduce execution time.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
df.drop('A', axis=1, inplace=True)
print(df)

Output:

Empty DataFrame
Columns: []
Index: [0, 1, 2, 3, 4]

💡 Optimize Memory Layout

Understanding and optimizing the memory layout can lead to performance gains, especially for large DataFrames.

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'A': np.random.rand(1000),
    'B': np.random.rand(1000)
})

# Check if DataFrame is C-contiguous
print(df.values.flags['C_CONTIGUOUS'])  # Output: True

# Convert to Fortran-order if needed
df_f = pd.DataFrame(np.asfortranarray(df.values), columns=df.columns)
print(df_f.values.flags['F_CONTIGUOUS'])  # Output: True

💡 Utilize Parallel Processing

Leverage parallel processing capabilities with libraries like joblib to perform operations on DataFrames concurrently.

import pandas as pd
from joblib import Parallel, delayed

def process_row(row):
    return row['A'] * row['B']

df = pd.DataFrame({
    'A': range(1000000),
    'B': range(1000000, 2000000)
})

# Parallel computation
results = Parallel(n_jobs=-1)(delayed(process_row)(row) for index, row in df.iterrows())
df['C'] = results
print(df.head())

Output:

   A        B          C
0  0  1000000   0.000000
1  1  1000001   1000001.000000
2  2  1000002   2000004.000000
3  3  1000003   3000009.000000
4  4  1000004   4000016.000000

19. 💡 Conclusion

Pandas is an indispensable tool in the Python ecosystem, providing the foundational structures and functions required for efficient data manipulation and analysis. Its seamless integration with other data science libraries and its performance optimizations make it a preferred choice for data scientists, machine learning engineers, and analysts. By mastering Pandas, you're well-equipped to handle complex data preprocessing, perform insightful analyses, and prepare data for robust machine learning models. Continue exploring its vast capabilities and integrate Pandas into your daily data tasks to unlock new levels of efficiency and productivity.