Background

The seeds of modern Artificial Intelligence (AI) emerged in the 1940’s with programmable computers and in the 1950’s with dedicated AI research. In the early days of AI, there were predictions of machines that would be as intelligent as humans.

Those predictions have not eventuated, leading to waves of investment and disillusionment. Efforts have fallen short for two reasons. The goal is much harder than anyone ever imagined, and the creation of rules-based programming is both complex and slow relative to the enormity of the task.

While the enormity has not diminished, approaches have changed. In the last couple of decades, an approach to AI based on learning directly from data, instead of coding rules, has yielded game changing results in multiple areas. While rules continue to have a place, even in data-centric solutions, the current focus of Information Technology (IT) is in the significant gains being realized through machine learning: algorithms that improve with experience and with data. Hence, AI/ML is
now a central focus for all data-rich environments, for example IT operations, including SRE/NetOps.

Examples of where machine learning has led to significant gains include image recognition and natural language processing. These approaches are not without their challenges though, for example they often require well labeled data, and lengthy training cycles before the models can be applied. Neither of these are desirable in an operations environment where agility and flexibility are key. There are ways of approaching AI/ML for SRE/NetOps that eliminate these issues.

Self-Supervised Learning

As data became abundant and accessible in the post-Internet world, supervised learning emerged as a way of mapping an input object to an output object (a supervisory signal). This learning is based on training data consisting of labeled examples: this is a cat, this is a dog, etc. While this approach has proven powerful compared to writing rules that describe what a cat or dog looks like, it is not without its challenges. Well-labeled data is not always easy to come by, does not always have sufficient examples, and sometimes contains bias based on what examples it does have. While the application of Graphics Processing Units (GPUs) and specialized training silicon have created a new age of practicality and applicability for training, training still requires significant time and resources. Neither of which are always abundant in operations environments.

Self-supervised learning is now emerging as an alternative. In this approach, training, a rule, or some other method is used to create a supervisory signal for later analysis. A good example would be training on real-time data to determine “normal”, and then using “normal” as the threshold for later anomaly detection. Whereas traditional training may take over a day, training in this approach can be done in less than a minute. Less training time and resources are required, time to first value is rapid, and thousands to millions of thresholds do not have to be manually set and maintained or
based on imprecise heuristics that create a flood of false anomalies. This is an example where learning from the data, using innovative new approaches, is a much better, more scalable, and a more sustainable approach than rules.

Learning Over Different Time Periods

While self-supervised learning in less than a minute can radically transform thresholding, not all insights can be attained so quickly. For this reason, a good AIOps solution will have multiple periods over which insights are gathered: one minute, five minutes, a “season”.

For example, a one-minute period may realize baselines and thresholds. A five-minute period may realize important contextual information about what was happening before and after an anomaly, connecting the dots as to root causes. Seasonal analysis naturally provides insight on when deviations from a baselined normal is itself a normal recurring pattern.

While monitoring thresholds is one key aspect of operations tools, the ultimate goal is to resolve trouble spots rapidly, and return network conditions to their normal state. This requires examining many different data types, filtering, correlating, clustering, and ranking the relevance of connections. Connections that become apparent over different periods of time. One of the characteristics of next-generation AIOps solutions is the ability to tell an insightful story by examining a variety of different data types: time-series, metrics, events, telemetry, logs, config, etc.

Learning from the data, over different time periods, connects the dots in ways that rules-based systems may miss, if there is a new type of association that was previously unknown. The data points SRE/NetOps teams in the direction of root cause, and ultimately, resolution and remediation.

Conclusion

While next-generation AIOps solutions will make use of rules where needed, their transformative power is in unveiling the not always obvious insights contained in a variety of data types. These insights, accessible and understandable in a collaborative environment, leverage the experience and knowledge of SRE/NetOps professionals to rapidly decide on the best resolution to an anomaly, and/or, the best remediation.

Schedule Your Free Demo Today

Find out firsthand how Selector can help you reduce network management complexity, reduce MTTR, and lower operational costs.

This field is for validation purposes and should be left unchanged.