Building a self-service data pipeline with Apache Spark
At ZipRecruiter, we are currently building our next generation in house streaming data platform to enable our 10 person data services team to support 20 distinct dev teams by providing a self-service system.
I’ll share the architecture we design based on the trade-offs we considered and the choices we’ve made.
Building a data pipeline for stats and analysis is a big job. We have a cornucopia of open source tools to choose from and so many decisions to make regarding:
- storage formats
- streaming compute
- SQL integration
- data ingress, egress
- job vetting
- data integrity