Our story begins with a puzzling slowdown reported by one of our largest customers—no obvious spikes, no crashes, just degraded observability. On paper, our OpenTelemetry Collector pipeline looked healthy. But the metrics weren’t telling the whole story. Behind the scenes, we were silently dropping requests, and our Collector—burdened with over 300 filters and 50+ transformers—was stretched to its limit.
In this talk, we’ll walk through how we diagnosed and resolved the bottlenecks. We started by introducing an HAProxy sidecar to surface pre-ingest metrics and protect availability. That gave us the visibility we needed to go deeper. What we found was a performance cliff caused by expensive OTTL expressions and excessive memory allocations. The real fix came when we cracked open the OTTL engine itself, built a custom code generation layer, and rethought how pipelines should scale. Resulting in near-zero memory allocations and a jaw-dropping 40x performance improvement.
This session shares the architecture, tools, and engineering decisions that made it possible to scale OpenTelemetry in high-throughput environments. It’s a practical deep dive into what breaks under pressure—and how to push the Collector well beyond its default limits.



