End-to-End Data Analytics Project

Python + SQL Data Analysis Pipeline

Project Overview

This project demonstrates an end-to-end data analytics pipeline using Python for data processing and SQL Server for data analysis. The workflow involves:

  • Data acquisition from Kaggle
  • Cleaning and preprocessing with Python
  • Loading data into SQL Server
  • Performing advanced SQL queries to extract business insights
Data Pipeline Diagram

Methodology

1

Data Collection

Downloaded dataset from Kaggle API containing sales data spanning multiple years.

2

Data Cleaning

Used Python to handle missing values, standardize formats, and prepare data for analysis.

3

Data Loading

Loaded processed data into SQL Server database for efficient querying and analysis.

4

SQL Analysis

Wrote advanced SQL queries to answer specific business questions and extract insights.

Key Findings

Top 10 Revenue-Generating Products

Identified the products that contributed most significantly to overall revenue using SQL aggregation and sorting.


-- Find top 10 highest revenue generating products
select top 10 product_id, sum(sale_price) as sales
from df_orders
group by product_id
order by sales desc;
            

Regional Top Sellers

Analyzed the top 5 highest-selling products across different regions using window functions.


-- Find top 5 highest selling products in each region
with cte as (
select region, product_id, sum(sale_price) as sales
from df_orders
group by region, product_id)
select * from (
select *, ROW_NUMBER() over(partition by region order by sales desc) as rn
from cte) A
where rn<=5
            

Month-over-Month Growth (2022-2023)

Compared monthly sales growth between 2022 and 2023 using conditional aggregation.


-- Find month over month growth comparison for 2022 and 2023 sales
with cte as (
select year(order_date) as order_year, 
month(order_date) as order_month,
sum(sale_price) as sales
from df_orders
group by year(order_date), month(order_date) 
)
select order_month
, SUM(case when order_year = 2022 then sales else 0 end) as sales_2022
, SUM(case when order_year = 2023 then sales else 0 end) as sales_2023
from cte
group by order_month
order by order_month;
            

Monthly Category Performance

Determined which months had the highest sales for each product category using window functions.


-- For each category which month had highest sales
with cte as (
select category ,format(order_date, 'yyyyMM') as order_ymonth, 
SUM(sale_price) as sales
from df_orders
group by category, format(order_date, 'yyyyMM')
)
select * from (
select *,
ROW_NUMBER() over(partition by category order by sales desc)
as rn
from cte
) a
where rn=1
            

Subcategory Profit Growth

Identified which subcategories experienced the highest profit growth from 2022 to 2023.


-- Which sub category had highest growth by profit in 2023 compared to 2022
with cte as (
select sub_category, 
year(order_date) as order_year, 
sum(sale_price) as sales
from df_orders
group by sub_category, year(order_date)
)
, cte2 as (
select sub_category
, SUM(case when order_year = 2022 then sales else 0 end) as sales_2022
, SUM(case when order_year = 2023 then sales else 0 end) as sales_2023
from cte
group by sub_category )
select top 1 * 
,(sales_2023 - sales_2022)*100/sales_2022
from cte2
order by (sales_2023-sales_2022)*100/sales_2022 desc;
            

Code Samples

Data Preprocessing with Python


# Install required packages
!pip install kaggle
!pip install sqlalchemy
!pip install pyodbc

# Download dataset from Kaggle
import kaggle
# https://www.kaggle.com/datasets/ankitbansal06/retail-orders
!kaggle datasets download ankitbansal06/retail-orders

# Unzip the downloaded file
import zipfile
with zipfile.ZipFile('retail-orders.zip', 'r') as zip_ref:
    zip_ref.extractall('retail-orders')

# Read the data
import pandas as pd
df = pd.read_csv('D:\\data-analysis\\retail-orders\\orders.csv', 
                na_values = ['Not Available', 'unknown'])

# Check data and unique values
df.head(20)
df['Ship Mode'].unique()
df.isna().sum()

# Clean column names - convert to lowercase and replace spaces with underscores
df.columns = df.columns.str.lower()
df.columns = df.columns.str.replace(' ', '_')

# Create derived columns for analysis
df['discount'] = df['list_price']*df['discount_percent']*.01
df['sale_price'] = df['list_price'] - df['discount']
df['profit'] = df['sale_price'] - df['cost_price']

# Convert date datatype
df['order_date'] = pd.to_datetime(df['order_date'], format='%Y-%m-%d')

# Drop unnecessary columns
df.drop(columns = ['list_price', 'cost_price', 'discount_percent'], inplace=True)

# Connect to SQL Server and load data
import sqlalchemy as sal
engine = sal.create_engine('mssql://Foxy_Saravana/master?driver=ODBC+DRIVER+17+FOR+SQL+SERVER')
conn = engine.connect()

# Load the data into SQL Server
df.to_sql('df_orders', con=conn, index=False, if_exists='append')
            

SQL Analysis Queries


-- Create table schema
CREATE TABLE df_orders(
    order_id INT PRIMARY KEY,
    order_date DATE,
    ship_mode VARCHAR(20),
    segment VARCHAR(20),
    country VARCHAR(20),
    city VARCHAR(20),
    state VARCHAR(20),
    postal_code VARCHAR(20),
    region VARCHAR(20),
    category VARCHAR(20),
    sub_category VARCHAR(20),
    product_id VARCHAR(20),
    quantity INT,
    discount DECIMAL(7, 2),
    sale_price DECIMAL(7, 2),
    profit DECIMAL(7, 2)
)

-- Find top 10 highest revenue generating products
SELECT TOP 10 product_id, SUM(sale_price) AS sales
FROM df_orders
GROUP BY product_id
ORDER BY sales DESC;

-- Find top 5 highest selling products in each region
WITH cte AS (
    SELECT region, product_id, SUM(sale_price) AS sales
    FROM df_orders
    GROUP BY region, product_id
)
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER(PARTITION BY region ORDER BY sales DESC) AS rn
    FROM cte
) A
WHERE rn <= 5;

-- Find month over month growth comparison for 2022 and 2023 sales
WITH cte AS (
    SELECT YEAR(order_date) AS order_year, 
           MONTH(order_date) AS order_month,
           SUM(sale_price) AS sales
    FROM df_orders
    GROUP BY YEAR(order_date), MONTH(order_date) 
)
SELECT order_month
     , SUM(CASE WHEN order_year = 2022 THEN sales ELSE 0 END) AS sales_2022
     , SUM(CASE WHEN order_year = 2023 THEN sales ELSE 0 END) AS sales_2023
FROM cte
GROUP BY order_month
ORDER BY order_month;

-- For each category which month had highest sales
WITH cte AS (
    SELECT category, FORMAT(order_date, 'yyyyMM') AS order_ymonth, 
           SUM(sale_price) AS sales
    FROM df_orders
    GROUP BY category, FORMAT(order_date, 'yyyyMM')
)
SELECT * FROM (
    SELECT *,
           ROW_NUMBER() OVER(PARTITION BY category ORDER BY sales DESC) AS rn
    FROM cte
) a
WHERE rn = 1;

-- Which sub category had highest growth by profit in 2023 compared to 2022
WITH cte AS (
    SELECT sub_category, 
           YEAR(order_date) AS order_year, 
           SUM(sale_price) AS sales
    FROM df_orders
    GROUP BY sub_category, YEAR(order_date)
),
cte2 AS (
    SELECT sub_category
         , SUM(CASE WHEN order_year = 2022 THEN sales ELSE 0 END) AS sales_2022
         , SUM(CASE WHEN order_year = 2023 THEN sales ELSE 0 END) AS sales_2023
    FROM cte
    GROUP BY sub_category 
)
SELECT TOP 1 * 
     ,(sales_2023 - sales_2022)*100/sales_2022 AS growth_percent
FROM cte2
ORDER BY (sales_2023-sales_2022)*100/sales_2022 DESC;
            

About This Project

This project was developed following the comprehensive tutorial by YouTube Channel. The tutorial provided step-by-step guidance on building an end-to-end data analytics pipeline.

Technologies used: