Over the last couple of decades, a revolution in artificial intelligence (AI) occurred: a shift from handcrafted rules to machine-learned models. While there remains a role for both, the momentum in the latter is clear, defining the age we live in, and the approaches that are transforming AI Operations (AIOps) for all IT Operations (ITOps), and specifically, Site Reliability Engineering / Network Operation (NetOps).
Rules are precise, which makes them an excellent choice for some decision processes. A rule can also express experience. However, a rule can become an imprecise heuristic, as they often are with thresholds, for example. In addition, as the number and variety of end points that have to be assured explodes, setting up and maintaining rules can become burdensome and complex.
In many aspects of AIOps, learning directly from the data is a better approach than setting and maintaining thousands to millions of rules. Data-centric approaches to AIOps are now emerging, forming one of the two dramatic shifts in SRE/NetOps tools, the other being immersive collaboration, where human experience and intelligence can be blended with data-centric insights.
Initial approaches to AIOps had grand visions and complex operational procedures – high on expectations, low on results. The next generation of AIOps tools will be more pragmatic in their goals, and easier to use. This paper examines the AI/ML approaches that are characteristic of this next generation of tools, and the benefits that are realized in SRE/NetOps.
Collect from a variety of sources
Ingest variety of data sources in various formats
Correlate metrics and events machine learning
Machine Learning based data analytics for automated anomaly detection
Collaborate across boundaries
Instant actionable insights on collaborative platforms using Natural Language Queries (NLQ)
The seeds of modern Artificial Intelligence (AI) emerged in the 1940’s with programmable computers and in the 1950’s with dedicated AI research. In the early days of AI, there were predictions of machines that would be as intelligent as humans.
Those predictions have not eventuated, leading to waves of investment and disillusionment. Efforts have fallen short for two reasons. The goal is much harder than anyone ever imagined, and the creation of rules-based programming is both complex and slow relative to the enormity of the task.
While the enormity has not diminished, approaches have changed. In the last couple of decades, an approach to AI based on learning directly from data, instead of coding rules, has yielded game changing results in multiple areas. While rules continue to have a place, even in data-centric solutions, the current focus of Information Technology (IT) is in the significant gains being realized through machine learning: algorithms that improve with experience and with data. Hence, AI/ML is
now a central focus for all data-rich environments, for example IT operations, including SRE/NetOps.
Examples of where machine learning has led to significant gains include image recognition and natural language processing. These approaches are not without their challenges though, for example they often require well labeled data, and lengthy training cycles before the models can be applied. Neither of these are desirable in an operations environment where agility and flexibility are key. There are ways of approaching AI/ML for SRE/NetOps that eliminate these issues.
24×7 product support
A fully interactive web portal with on-demand dashboards
Selector AI can be deployed in three possible modes – public, private, and cloud VPCs
As data became abundant and accessible in the post-Internet world, supervised learning emerged as a way of mapping an input object to an output object (a supervisory signal). This learning is based on training data consisting of labeled examples: this is a cat, this is a dog, etc. While this approach has proven powerful compared to writing rules that describe what a cat or dog looks like, it is not without its challenges. Well-labeled data is not always easy to come by, does not always have sufficient examples, and sometimes contains bias based on what examples it does have. While the application of Graphics Processing Units (GPUs) and specialized training silicon have created a new age of practicality and applicability for training, training still requires significant time and resources. Neither of which are always abundant in operations environments.
Self-supervised learning is now emerging as an alternative. In this approach, training, a rule, or some other method is used to create a supervisory signal for later analysis. A good example would be training on real-time data to determine “normal”, and then using “normal” as the threshold for later anomaly detection. Whereas traditional training may take over a day, training in this approach can be done in less than a minute. Less training time and resources are required, time to first value is rapid, and thousands to millions of thresholds do not have to be manually set and maintained or based on imprecise heuristics that create a flood of false anomalies. This is an example where learning from the data, using innovative new approaches, is a much better, more scalable, and a more sustainable approach than rules.
Learning Over Different Time Periods
While self-supervised learning in less than a minute can radically transform thresholding, not all insights can be attained so quickly. For this reason, a good AIOps solution will have multiple periods over which insights are gathered: one minute, five minutes, a “season”.
For example, a one-minute period may realize baselines and thresholds. A five-minute period may realize important contextual information about what was happening before and after an anomaly, connecting the dots as to root causes. Seasonal analysis naturally provides insight on when deviations from a baselined normal is itself a normal recurring pattern.
While monitoring thresholds is one key aspect of operations tools, the ultimate goal is to resolve trouble spots rapidly, and return network conditions to their normal state. This requires examining many different data types, filtering, correlating, clustering, and ranking the relevance of connections. Connections that become apparent over different periods of time. One of the characteristics of next-generation AIOps solutions is the ability to tell an insightful story by examining a variety of different data types: time-series, metrics, events, telemetry, logs, config, etc.
Learning from the data, over different time periods, connects the dots in ways that rules-based systems may miss, if there is a new type of association that was previously unknown. The data points SRE/NetOps teams in the direction of root cause, and ultimately, resolution and remediation.
While next-generation AIOps solutions will make use of rules where needed, their transformative power is in unveiling the not always obvious insights contained in a variety of data types. These insights, accessible and understandable in a collaborative environment, leverage the experience and knowledge of SRE/NetOps professionals to rapidly decide on the best resolution to an anomaly, and/or, the best remediation.