Data Processing for Site Reliability Engineers
It is virtually impossible to provide your customers with a cost effective and stable service without proper monitoring. An overprovisioned service wastes resources that otherwise could be invested into improvements, underprovisioned service is prone to saturation and degrades customer experience. Finding the right balance can be a challenging task given the complexity of modern cloud based computing systems. Furthermore load on most customer facing services changes dramatically between peak and low activity hours. Backend services may experience surges of activity around the clock - heavy and long running data processing tasks for example are likely to be scheduled during off hours to avoid competing for resources with external customers.
Of course the cloud computing technology is evolving and there are products on the market provided by cloud service providers allowing to get a telemetry on your services. There are several problems with these solutions though that may or may not impact your company:
* you get a cookie cutter monitoring that may lack features and metrics you are interested in
* since the cloud provider is interested in cost optimization there could be hard sampling and time limits obscuring what actually happens during sharp spikes of customer activity
* and the main problem - cloud provider is not really interested to run your service effectively. The decision how much to overpay to ensure the stability of your service is on you.
This is why Site Reliability team should always look for alternative sources of information. Developments in machine learning and data processing open new possibilities in log analysis and data aggregation. Apache Spark and similar frameworks provide a powerful tool to build metrics service SLOs can lean on.
During this presentation I will talk about my experience using Apache Spark to analyse request logs of Python based services in order to improve their stability and optimize costs.