← Projects
Database Design · Analytics · ETL

Sales Data Warehouse &
Analytics Pipeline

End-to-end data warehouse built in PostgreSQL using a star schema. Python ETL pipeline. 20+ analytical SQL queries. Tableau KPI dashboard.

Tables
8
SQL Queries
20+
Rows Processed
500K+
Schema Normal Form
3NF / BCNF

Tech Stack

PostgreSQLPython 3.11SQLAlchemypandasTableauETL PipelineStar SchemaWindow FunctionsCTEs

Star Schema Design

Central fact table surrounded by dimension tables — optimized for OLAP queries, not OLTP operations. All foreign keys indexed. Surrogate PKs used throughout to decouple from source system IDs.

fact_orders (fact)
PKorder_idBIGINT
FKcustomer_idINT
FKproduct_idINT
FKdate_idINT
FKregion_idINT
revenueNUMERIC(12,2)
quantityINT
discount_pctNUMERIC(5,2)
dim_customers
PKcustomer_idINT
full_nameVARCHAR(100)
segmentVARCHAR(50)
join_dateDATE
lifetime_valueNUMERIC(12,2)
dim_products
PKproduct_idINT
product_nameVARCHAR(150)
categoryVARCHAR(80)
unit_costNUMERIC(10,2)
unit_priceNUMERIC(10,2)
dim_date
PKdate_idINT
full_dateDATE
yearINT
quarterINT
monthINT
weekINT
is_weekendBOOLEAN

Python ETL Pipeline

Automated pipeline: extract raw CSVs → transform & validate → load into PostgreSQL.

python
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime

engine = create_engine("postgresql://user:pass@localhost/sales_dw")

def transform_orders(df: pd.DataFrame) -> pd.DataFrame:
    """Clean, validate, and conform raw order data."""
    df = df.dropna(subset=["order_id", "customer_id", "revenue"])
    df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
    df["order_date"] = pd.to_datetime(df["order_date"])
    df["discount_pct"] = df["discount_pct"].fillna(0).clip(0, 100)

    # Derive date_id from dim_date surrogate key
    df["date_id"] = df["order_date"].dt.strftime("%Y%m%d").astype(int)
    return df

def load(df: pd.DataFrame, table: str):
    df.to_sql(table, engine, if_exists="append", index=False,
              method="multi", chunksize=5000)
    print(f"  Loaded {len(df):,} rows → {table}")

if __name__ == "__main__":
    raw = pd.read_csv("raw/orders_2024.csv")
    clean = transform_orders(raw)
    load(clean, "fact_orders")
    print("ETL complete:", datetime.now())

Analytical SQL — Sample Queries

Year-over-Year Revenue Growth

sql
WITH yearly AS (
  SELECT
    d.year,
    SUM(f.revenue)                              AS total_revenue,
    LAG(SUM(f.revenue)) OVER (ORDER BY d.year)  AS prev_revenue
  FROM fact_orders f
  JOIN dim_date d ON f.date_id = d.date_id
  GROUP BY d.year
)
SELECT
  year,
  total_revenue,
  ROUND(
    (total_revenue - prev_revenue) / prev_revenue * 100, 2
  ) AS yoy_growth_pct
FROM yearly
ORDER BY year;

90-Day Customer Cohort Retention

sql
WITH first_order AS (
  SELECT customer_id, MIN(order_date) AS cohort_date
  FROM fact_orders f
  JOIN dim_date d ON f.date_id = d.date_id
  GROUP BY customer_id
),
retention AS (
  SELECT
    DATE_TRUNC('month', fo.cohort_date)  AS cohort_month,
    COUNT(DISTINCT fo.customer_id)       AS cohort_size,
    COUNT(DISTINCT CASE
      WHEN d.full_date BETWEEN fo.cohort_date + 1
                           AND fo.cohort_date + 90
      THEN f.customer_id END)            AS returned_90d
  FROM first_order fo
  JOIN fact_orders f  USING (customer_id)
  JOIN dim_date d     ON f.date_id = d.date_id
  GROUP BY cohort_month
)
SELECT
  cohort_month,
  cohort_size,
  returned_90d,
  ROUND(returned_90d::numeric / cohort_size * 100, 1) AS retention_rate
FROM retention
ORDER BY cohort_month;

Top Products by Margin

sql
SELECT
  p.category,
  p.product_name,
  SUM(f.revenue)                                     AS total_revenue,
  SUM(f.quantity * p.unit_cost)                      AS total_cost,
  ROUND(
    (SUM(f.revenue) - SUM(f.quantity * p.unit_cost))
    / SUM(f.revenue) * 100, 2
  )                                                  AS margin_pct,
  RANK() OVER (
    PARTITION BY p.category ORDER BY
    SUM(f.revenue) - SUM(f.quantity * p.unit_cost) DESC
  )                                                  AS rank_in_category
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
GROUP BY p.category, p.product_id, p.product_name, p.unit_cost
ORDER BY margin_pct DESC
LIMIT 20;

Key Findings

  • Q4 consistently contributes 38% of annual revenue — flagged for seasonal staffing & inventory planning.
  • Top 20% of customers (by LTV) account for 67% of total revenue (80/20 rule validated).
  • 90-day retention rate averages 41%, with the Electronics category retaining at 58%.
  • Products with >40% discount have a negative net margin — discount policy recommendation submitted.
  • Query runtime reduced 3.2× after adding composite indexes on fact_orders(customer_id, date_id).
← All ProjectsHire me →
Home
Projects
Experience
Skills
Contact