A Lean Datawarehouse Toolkit

Signalflow
3 min readFeb 29, 2020

Building a lean DWH with Python, SQLAlchemy, PostgreSQL & Cron

One of my pet peeves the last couple of years while working with startups was; SMEs relies to much on manual data collection tasks. At my last company there was a guy that dedicated at least a couple of days a month for manual data collection tasks (i.e. updating core KPIs). This is precious time that could have been dedicated to other tasks to drive value for the company (i.e. setting up marketing or A/B testing campaigns). This peeve has pushed me to learn more about automation, which the main-stream folks typically call “Data Engineering”. Sure there will always be trade-offs between manual- vs automation tasks, but in general I feel there are enough free and open source technologies available on the table to at least discuss it as an option.

https://xkcd.com/1205/

My Proposal

In this post I will only list several concepts I wish to write about in the future. I believe these concepts (together with a possible code toolkit) could help you and your company set-up a lean datawarehouse at a minimum cost-to-company. Yes, off course one can start dockerizing everything and switch to GOLANG (or other state-of-the-art technologies), but I want to really keep this process as simple as possible. The simpler the approach, the more repeatable it will become for any company to adopt it.

Concepts To Cover

It is my hope to point you directly to a github repo in the future where you can just read my code and adapt it towards your company’s metrics, but before we get to that stage, we need to first revisit fundamental database concepts (which have been around for several years). These concepts are not rocket science by any means, but will require revision in order to be used productively for your own lean datawarehouse.

#### Primary Concepts:
* The basics of SQL
* Database Design & Entity Relationships
* Datawarehouse Star Schema Models
* Database ORMs
* SQLAlchemy Basics
* Database Migrations with Alemblic
* Unit Testing (or at least some form of script-testing)
* Version Control with Gitlab / Github
* Estimating infrastructure costs
* Setting up local, staging, production environments
* Automating ETL-scripts with python and cron
#### Secondary Concepts:
* Automating ETL-scripts with python and Apache Airflow
* Building A Flask UI for your lean dwh / crons / scripts

Q: Have I missed any concepts? Please add it in the comment-section below.

Who in this world does not like feedback?

  • ⚡Claps: If you loved my post, consider showing some support by clapping once or twice. If you really liked the post, consider clapping until you can’t clap no more :)
  • ⚡Comments: Is there something I could improve or perhaps write about in a next post? Let your fingers do the talking and please drop me a comment below.
  • ⚡Followers: If you loved this post, please consider following me on Medium and Twitter for future posts on various indie tasks, workflows, and open source tech.
  • ⚡Readers: If you found this article helpful, please feel free to also scan over some of my other workflows.

--

--

Signalflow

Insightfull tasks to optimise your startup’s worflows.