Database Design · Analytics · ETLSales Data Warehouse &
Sales Data Warehouse &
Analytics Pipeline
End-to-end data warehouse built in PostgreSQL using a star schema. Python ETL pipeline. 20+ analytical SQL queries. Tableau KPI dashboard.
Tables
8
SQL Queries
20+
Rows Processed
500K+
Schema Normal Form
3NF / BCNF
Tech Stack
PostgreSQLPython 3.11SQLAlchemypandasTableauETL PipelineStar SchemaWindow FunctionsCTEs
Star Schema Design
Central fact table surrounded by dimension tables — optimized for OLAP queries, not OLTP operations. All foreign keys indexed. Surrogate PKs used throughout to decouple from source system IDs.
fact_orders (fact)
PKorder_idBIGINT
FKcustomer_idINT
FKproduct_idINT
FKdate_idINT
FKregion_idINT
revenueNUMERIC(12,2)
quantityINT
discount_pctNUMERIC(5,2)
dim_customers
PKcustomer_idINT
full_nameVARCHAR(100)
segmentVARCHAR(50)
join_dateDATE
lifetime_valueNUMERIC(12,2)
dim_products
PKproduct_idINT
product_nameVARCHAR(150)
categoryVARCHAR(80)
unit_costNUMERIC(10,2)
unit_priceNUMERIC(10,2)
dim_date
PKdate_idINT
full_dateDATE
yearINT
quarterINT
monthINT
weekINT
is_weekendBOOLEAN
Python ETL Pipeline
Automated pipeline: extract raw CSVs → transform & validate → load into PostgreSQL.
python
import pandas as pd
from sqlalchemy import create_engine
from datetime import datetime
engine = create_engine("postgresql://user:pass@localhost/sales_dw")
def transform_orders(df: pd.DataFrame) -> pd.DataFrame:
"""Clean, validate, and conform raw order data."""
df = df.dropna(subset=["order_id", "customer_id", "revenue"])
df["revenue"] = pd.to_numeric(df["revenue"], errors="coerce")
df["order_date"] = pd.to_datetime(df["order_date"])
df["discount_pct"] = df["discount_pct"].fillna(0).clip(0, 100)
# Derive date_id from dim_date surrogate key
df["date_id"] = df["order_date"].dt.strftime("%Y%m%d").astype(int)
return df
def load(df: pd.DataFrame, table: str):
df.to_sql(table, engine, if_exists="append", index=False,
method="multi", chunksize=5000)
print(f" Loaded {len(df):,} rows → {table}")
if __name__ == "__main__":
raw = pd.read_csv("raw/orders_2024.csv")
clean = transform_orders(raw)
load(clean, "fact_orders")
print("ETL complete:", datetime.now())Analytical SQL — Sample Queries
Year-over-Year Revenue Growth
sql
WITH yearly AS (
SELECT
d.year,
SUM(f.revenue) AS total_revenue,
LAG(SUM(f.revenue)) OVER (ORDER BY d.year) AS prev_revenue
FROM fact_orders f
JOIN dim_date d ON f.date_id = d.date_id
GROUP BY d.year
)
SELECT
year,
total_revenue,
ROUND(
(total_revenue - prev_revenue) / prev_revenue * 100, 2
) AS yoy_growth_pct
FROM yearly
ORDER BY year;90-Day Customer Cohort Retention
sql
WITH first_order AS (
SELECT customer_id, MIN(order_date) AS cohort_date
FROM fact_orders f
JOIN dim_date d ON f.date_id = d.date_id
GROUP BY customer_id
),
retention AS (
SELECT
DATE_TRUNC('month', fo.cohort_date) AS cohort_month,
COUNT(DISTINCT fo.customer_id) AS cohort_size,
COUNT(DISTINCT CASE
WHEN d.full_date BETWEEN fo.cohort_date + 1
AND fo.cohort_date + 90
THEN f.customer_id END) AS returned_90d
FROM first_order fo
JOIN fact_orders f USING (customer_id)
JOIN dim_date d ON f.date_id = d.date_id
GROUP BY cohort_month
)
SELECT
cohort_month,
cohort_size,
returned_90d,
ROUND(returned_90d::numeric / cohort_size * 100, 1) AS retention_rate
FROM retention
ORDER BY cohort_month;Top Products by Margin
sql
SELECT
p.category,
p.product_name,
SUM(f.revenue) AS total_revenue,
SUM(f.quantity * p.unit_cost) AS total_cost,
ROUND(
(SUM(f.revenue) - SUM(f.quantity * p.unit_cost))
/ SUM(f.revenue) * 100, 2
) AS margin_pct,
RANK() OVER (
PARTITION BY p.category ORDER BY
SUM(f.revenue) - SUM(f.quantity * p.unit_cost) DESC
) AS rank_in_category
FROM fact_orders f
JOIN dim_products p ON f.product_id = p.product_id
GROUP BY p.category, p.product_id, p.product_name, p.unit_cost
ORDER BY margin_pct DESC
LIMIT 20;Key Findings
- Q4 consistently contributes 38% of annual revenue — flagged for seasonal staffing & inventory planning.
- Top 20% of customers (by LTV) account for 67% of total revenue (80/20 rule validated).
- 90-day retention rate averages 41%, with the Electronics category retaining at 58%.
- Products with >40% discount have a negative net margin — discount policy recommendation submitted.
- Query runtime reduced 3.2× after adding composite indexes on
fact_orders(customer_id, date_id).