The presentation will take place in Ballroom DE on Sunday, March 8, 2026 - 11:45 to 12:45

A client came to us with a problem we’re seeing more and more, their large language model (LLM) was deployed, but inference was painfully slow, GPU usage was unpredictable, and costs were spiraling out of control. Kubernetes alone wasn’t enough, they needed a production-ready, efficient, and scalable stack.

In this talk, we’ll walk through how we diagnosed and solved the issue using open-source CNCF tools, turning a chaotic deployment into a well-oiled inference machine.

You’ll learn how to:
1. Use KServe and Kubeflow to serve LLMs reliably.
2. Benchmark and auto-scale workloads using Volcano and KEDA while optimizing resource usage and latency.
3. Track model performance and drift with Prometheus, Grafana, and OpenTelemetry.

We’ll share benchmarks, architectures, and lessons from the field, all based on open-source tooling you can try today. Whether you’re running LLMs at scale or just exploring GenAI, this talk is packed with real-world solutions to help you do more with less.