Many engineers have looked at production metrics and wondered: what does this “duration” really measure? Does it include failed requests, retries, or only successes? Too often, the only way to find out is to read the source code - or worse, discover that errors aren’t recorded at all.

This talk addresses a widespread but under-discussed problem: poorly defined telemetry. Ambiguous metrics encourage tribal knowledge, slow onboarding, and lead to false alerts and observability gaps. The session demonstrates practical approaches to designing clear, consistent telemetry and automating its validation.

The talk draws on work from the OpenTelemetry Semantic Conventions project - a shared language for defining metrics, traces, logs, and entities. It covers common instrumentation patterns (naming, error handling, network attributes) and showcases a live demo of Weaver, a tool that enforces and validates conventions in practice.

The session also highlights schema validationpolicy enforcement using Rego, and live telemetry validation as ways to automate the feedback loop between developers and operations teams, improving reliability and compliance across services.

While rooted in OpenTelemetry, these practices apply equally to Prometheus, legacy metrics, or custom systems. The goal is to show how consistent, validated telemetry enables more reliable operations and reduces cognitive load across teams.