How To Take Prometheus Planet Scale - Massively Large Scale Metrics Installations
Observability at eBay has been on an exponential growth curve. What was a low 2M/sec ingest rate of time series in 2017 is now roughly 40M/sec with active time series close to 3 billion. Our current cortex inspired architecture of Prometheus builds sharding and clustering on top of the Prometheus TSDB. It is relatively simple to shard/replicate tenants of data in centralized clusters. However, large clusters with growing cardinality become less useful as query latencies degrade considerably. In 2020, Google published a paper on its time series database Monarch which is dubbed as a planet scale TSDB. The paper gave us some useful hints on how we could potentially decentralize our installation and go fully planet scale.
What started off as a humble prototype to federate queries to TSDBs deployed in Sydney, Amsterdam and the US from a centralized query instance, now is a living breathing entity that allows us to deploy our TSDBs anywhere in the world using simple Kubernetes operators, GitOps and intelligence on top of the Prometheus TSDB.
This talk focuses on:
* the development of field hint indices to fingerprint time series and use the same for pointed query fanout.
* functional query push down on top of Prometheus storage
* the struggles of managing a planet scale deployment and using Gitops to mitigate pains
* other lessons learned
Conventional centralized systems, regardless of how well we do clustering, replication and automation, still have vertical limits that become harder and harder to overcome. Our planet scale deployment provides a clear path on how to decentralize the storage but at the same time provide a centralized query experience.
This presentation is extremely useful for observability practitioners in the following aspects. It provides audience with:
* solutions to common problems that would be seen while trying to get to a planet scale deployment
* insights into optimizations performed on things like wire protocol to enhance the performance of PromQL queries.
* shows a working example of effectively leveraging push down for PromQL queries
* statistics on how higher scale can be achieved by going planet scale
This presentation is unique in the sense that it shows that it is possible to decentralize storage, scale out and provide a centralized-like experience at the same time.