Dev/Stage/Prod is an Anti-Pattern for Data Pipelines
The primary benefit of maintaining distinct dev/stage/prod environments is isolation, where even that is not always maintainable and scalable with 3 environments serving N engineer scale.
But the basic underlying premise is to provide the ability to test the impact of our changes in one environment, without affecting production environments our customers use. However, for data pipelines of more than moderate complexity, achieving isolation via dev/stage/prod environments leads to predictable problems––not the least of which is the cost of managing and maintaining many data copies (particularly in big data shops).
But keep calm!
There is a different pattern that we call Pipeline Sandboxing that solves these problems, and we will show you how it works. In this talk we will present the Pipeline Sandboxing pattern that provides similar benefits to dev/stage/prod for big data operations, and demo how this is applied this in order to build an efficient data infrastructure, reduce cost and achieve much higher quality than before.
This talk will cover both the best practices for leveraging this approach for your data infrastructure, alongside the tools that make it possible, through a Dagster, Databricks, lakeFS and Git example, that you can apply to your own data stack as well.