Small Language Models are finally practical for everyday applications: they're cheaper to run, fast enough for real-time tasks, and easy to ship. This session shows how to build one end-to-end with open tooling: selecting a corpus of data, leveraging a well-known established tokenizer, defining a model architecture, and training with knowledge distillation. We'll walk through the whole training process... how to fine-tune, compare outputs between the teacher and student, and use standard tricks like batching, learning rate tuning, and checkpointing to keep the model stable and efficient. You'll learn how to copy the teacher model's weights into your smaller student network, then compress it with dynamic int8 quantization to make it run faster on standard CPUs.

We'll also talk about what breaks in practice and how to avoid it:  tokenizer mismatches, unstable learning rates, temperature choices, over-checkpointing, and memory pressure on single-GPU rigs. Expect concrete code samples, reproducible scripts, and tips for running the pipeline locally or in your homelab. You'll leave with a working recipe for training and deploying a compact language model that delivers strong accuracy while running faster and cheaper than its larger counterparts.