Millions of Devices, Billions of Metrics
The OpenNMS network management platform has been around for 20 years. During that time monitoring needs have changed dramatically. Companies that used to only use their web page for basic sales information now rely on it and other network services to generate their revenue, and the amount of data available about how the network is performing and increased several orders of magnitude.
OpenNMS has grown with it. Traditional monitoring methods such as data collection via SNMP have grown in scale, so OpenNMS created Newts (https://newts.io) to handle those storage needs. Modern monitoring has switched to using telemetry data for both performance metrics and "flow" data (who is talking to who and for how long) using technologies such as sFlow and Netflow. OpenNMS Drift (http://docs.opennms.org/opennms/branches/develop/guide-admin/guide-admin...) can now gather that data and store it in Elasticsearch.
The question then became how to analyze such large amounts of information. OpenNMS was integrated with the open source visualization tool Grafana to be able to "roll up" metrics into useful reports for operators, but when it came to dealing with alarm conditions operators were still overwhelmed.
Thus the Architecture for Learning Enabled Correlation (ALEC) was created (https://alec.opennms.com/alec/2.0.0-snapshot/). This framework is able to automatically create an inventory model from the OpenNMS database and then to associate alarms within that model. This information is then fed to a correlation engine such as DBSCAN or a deep learning engine based on Tensorflow.
Once properly trained ALEC will cluster large numbers of alarms into "situations" when can then be presented to operators for corrective action.
This presentation will demonstrate how these features of OpenNMS work and how they can be applied to meet complex monitoring needs. OpenNMS is 100% open source and all the features presented are available as free software.