← Projects

ETL Pipeline Studio

A full-stack ETL/ELT tool for defining data sources and streams, modeling them two ways (dimensional and Data Vault), and loading the results into ClickHouse from S3.

Role
Solo — Django + React
Timeframe
2025
Status
Working
Stack
DjangoReactClickHouseS3

What it is

A pipeline-management tool that takes data from messy API sources to query-ready tables. You register sources (with various auth methods), define streams with pagination and schema inference, assemble streams into data packages, and materialize those into models — supporting both dimensional and Data Vault modeling.

What it does

  • Schema inference on streams, so you don’t hand-write every column.
  • Two modeling styles. Star-schema dimensional models for ergonomics, and Data Vault for auditable, late-arriving loads — including hash transformations for Data Vault business keys.
  • ClickHouse integration. Automatic table creation from model definitions, and loading data from S3 into ClickHouse with S3 virtualization for efficient loads.
  • JWT auth, a React front end for managing it all, and release CI.

Why it matters

It’s the data-modeling and pipeline muscle in concentrated form: sources, streams, schema inference, dimensional vs. Data Vault, and a columnar analytical target. The same shape shows up anywhere messy upstream data has to become trustworthy reporting.