Database Design · Analytics · ETL

Sales Data Warehouse &
Analytics Pipeline

End-to-end data warehouse built in PostgreSQL using a star schema. Python ETL pipeline. 20+ analytical SQL queries. Tableau KPI dashboard.

Tables

SQL Queries

20+

Rows Processed

500K+

Schema Normal Form

3NF / BCNF

Tech Stack

PostgreSQLPython 3.11SQLAlchemypandasTableauETL PipelineStar SchemaWindow FunctionsCTEs

Star Schema Design

Central fact table surrounded by dimension tables — optimized for OLAP queries, not OLTP operations. All foreign keys indexed. Surrogate PKs used throughout to decouple from source system IDs.

fact_orders (fact)

PKorder_idBIGINT

FKcustomer_idINT

FKproduct_idINT

FKdate_idINT

FKregion_idINT

revenueNUMERIC(12,2)

quantityINT

discount_pctNUMERIC(5,2)

dim_customers

PKcustomer_idINT

full_nameVARCHAR(100)

segmentVARCHAR(50)

join_dateDATE

lifetime_valueNUMERIC(12,2)

dim_products

PKproduct_idINT

product_nameVARCHAR(150)

categoryVARCHAR(80)

unit_costNUMERIC(10,2)

unit_priceNUMERIC(10,2)

dim_date

PKdate_idINT

full_dateDATE

yearINT

quarterINT

monthINT

weekINT

is_weekendBOOLEAN

Python ETL Pipeline

Automated pipeline: extract raw CSVs → transform & validate → load into PostgreSQL.

python

import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime

engine = create_engine("postgresql://user:pass@localhost/sales_dw")

def transform_orders(df: pd.DataFrame) -> pd.DataFrame:
    """Clean, validate, and conform raw order data."""
    df = df.dropna(subset=["order_id", "customer_id", "revenue"])
    df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
    df["order_date"] = pd.to_datetime(df["order_date"])
    df["discount_pct"] = df["discount_pct"].fillna(0).clip(0, 100)

    # Derive date_id from dim_date surrogate key
    df["date_id"] = df["order_date"].dt.strftime("%Y%m%d").astype(int)
    return df

def load(df: pd.DataFrame, table: str):
    df.to_sql(table, engine, if_exists="append", index=False,
              method="multi", chunksize=5000)
    print(f"  Loaded {len(df):,} rows → {table}")

if __name__ == "__main__":
    raw = pd.read_csv("raw/orders_2024.csv")
    clean = transform_orders(raw)
    load(clean, "fact_orders")
    print("ETL complete:", datetime.now())

Analytical SQL — Sample Queries

Year-over-Year Revenue Growth

sql

WITH yearly AS (
  SELECT
    d.year,
    SUM(f.revenue)                              AS total_revenue,
    LAG(SUM(f.revenue)) OVER (ORDER BY d.year)  AS prev_revenue
  FROM fact_orders f
  JOIN dim_date d ON f.date_id = d.date_id
  GROUP BY d.year
)
SELECT
  year,
  total_revenue,
  ROUND(
    (total_revenue - prev_revenue) / prev_revenue * 100, 2
  ) AS yoy_growth_pct
FROM yearly
ORDER BY year;

90-Day Customer Cohort Retention

sql

WITH first_order AS (
  SELECT customer_id, MIN(order_date) AS cohort_date
  FROM fact_orders f
  JOIN dim_date d ON f.date_id = d.date_id
  GROUP BY customer_id
),
retention AS (
  SELECT
    DATE_TRUNC('month', fo.cohort_date)  AS cohort_month,
    COUNT(DISTINCT fo.customer_id)       AS cohort_size,
    COUNT(DISTINCT CASE
      WHEN d.full_date BETWEEN fo.cohort_date + 1
                           AND fo.cohort_date + 90
      THEN f.customer_id END)            AS returned_90d
  FROM first_order fo
  JOIN fact_orders f  USING (customer_id)
  JOIN dim_date d     ON f.date_id = d.date_id
  GROUP BY cohort_month
)
SELECT
  cohort_month,
  cohort_size,
  returned_90d,
  ROUND(returned_90d::numeric / cohort_size * 100, 1) AS retention_rate
FROM retention
ORDER BY cohort_month;

Top Products by Margin

sql

SELECT
  p.category,
  p.product_name,
  SUM(f.revenue)                                     AS total_revenue,
  SUM(f.quantity * p.unit_cost)                      AS total_cost,
  ROUND(
    (SUM(f.revenue) - SUM(f.quantity * p.unit_cost))
    / SUM(f.revenue) * 100, 2
  )                                                  AS margin_pct,
  RANK() OVER (
    PARTITION BY p.category ORDER BY
    SUM(f.revenue) - SUM(f.quantity * p.unit_cost) DESC
  )                                                  AS rank_in_category
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
GROUP BY p.category, p.product_id, p.product_name, p.unit_cost
ORDER BY margin_pct DESC
LIMIT 20;

Key Findings

Q4 consistently contributes 38% of annual revenue — flagged for seasonal staffing & inventory planning.
Top 20% of customers (by LTV) account for 67% of total revenue (80/20 rule validated).
90-day retention rate averages 41%, with the Electronics category retaining at 58%.
Products with >40% discount have a negative net margin — discount policy recommendation submitted.
Query runtime reduced 3.2× after adding composite indexes on fact_orders(customer_id, date_id).

← All Projects Hire me →

Sales Data Warehouse &Analytics Pipeline