How AI Finds Root Cause in Minutes Across Hybrid Networks

New Webinar — See how AI eliminates noise and accelerates resolution

How AI Finds Root Cause in Minutes Across Hybrid Networks

New Webinar — See how AI eliminates noise and accelerates resolution

On this page

Everything You Know About Observability is Wrong

Growing up in America has allowed me to be bombarded with marketing messages that border on the ridiculous. You can’t go one full day without seeing an ad for some type of anti-aging cream, or yogurt containing particular bacteria to boost your immune system, or an energy drink promising you will grow wings

The world of tech is no different. Everyone seeks an edge and will stretch the truth with marketing messages. Look no further than the phrase “full-stack observability” used by companies whose products don’t offer network monitoring. Leaving the most critical piece of infrastructure out of your platform and still calling it “full-stack” is a bold move. 

In simple terms, observability tools are an evolution from traditional legacy monitoring platforms and are a must-have for anyone responsible for maintaining and monitoring a modern globally distributed architecture. There are a plethora of observability tools on the market today, each with its own set of features and capabilities and each with a definition of “observability.” And these definitions all include messaging, which stretches the truth around usability, affordability, and scalability.

While a decent observability platform will help monitor and troubleshoot complex systems, it is crucial to understand every observability platform is flawed. We must start at the beginning to better understand where and why observability platforms fail.

A Brief History of Network Monitoring

At some point during the previous century, network monitoring was born. It’s hard to say when, but ARPANET evolved in the same timeframe. Engineers collected metrics and logs to help respond to events and debug and troubleshoot network performance. 

Fast forward a few decades to when internet usage increased exponentially monthly, thanks to everyone getting hundreds of hours for free. In time, networked systems grew in complexity, shifting from servers self-hosted in closets to globally distributed systems hosted by cloud providers. The result was a shift in monitoring methods, going from a metric collection to something marketed as “visibility.” This technique groups similar metrics to find correlations between events and systems. 

In time, we shifted marketing messages again to “observability.” The word itself is borrowed from control theory, which offers a classical definition for observability: “…the ability to infer the internal state of a system based upon the external outputs.”

Suppose you search for a definition for observability today. In that case, you will uncover dozens of candidates, as each software vendor has sought to leverage the latest buzzword in their marketing materials. Since each company gets to define what observability means for their customers, this leads to market confusion. If we apply observability to these definitions, we will discover the internal state of observability is “many marketing messages.”

But those messages often fail to help their target audience — system administrators, who keep corporate IT running daily. Sysadmins want to demonstrate their value, be productive, and not feel like a failure every time Adam in Accounting opens a ticket because something is “running slow.”

Observability Demystified

Let’s think briefly about what observability is and what it is not. Observability is a superset of monitoring, as you cannot infer a system’s health without measuring the outputs somehow. Let’s take my favorite example — a trash bin.

A trash bin, by default, lacks many external outputs. The most critical metric a trash bin should communicate externally is “I’m full.” But you often only know if a trash bin is full by opening the lid or door (a physical constraint). So, we need a way to monitor for fullness. A sensor to measure weight would work, except not all trash weighs the same or has the same shape. So, we need additional metrics for fullness. We can add these metrics and claim the trash bin is observable, but what we have accomplished is we have made the trash bin observable for our specific requirements and not necessarily the requirements of someone else. 

Simply stated, observability is a way to infer a system’s health through the outputs and metrics provided. But you also need the ability to adapt the metrics collected. In modern observability platforms, this is often done by applying tags to your systems, allowing you to slice the output data and return a filtered view. 

However, observability by itself is not the solution to all your problems. Anyone telling you otherwise is trying to sell you something.

The Fundamental Problem with Observability

Data without context is useless. That’s the problem.

It’s not a new problem, either. The original network monitoring tools also collected metrics directly from devices. It was easy to know if a router or switch was having a spike in CPU usage. What was more complex to understand was why the spike was happening in the first place. That’s the context missing from simply observing an output from a system.

Shifting to a monitoring platform offering “visibility” into your systems allowed you to have context, but you needed to add the metadata yourself. Adding metadata was done by manually correlating groups of metrics together. Today, we use tagging to provide context for an observability platform. Lack of context has always been an issue and, thus, is the fundamental problem with any monitoring platform.

Think of all the ways your observability tools have failed you:

  • Blind spots: Observability tools may not capture data and events occurring inside modern networks, especially those involving connections to the cloud, through VPN tunnels, or database bottlenecks.
  • Alerting: Some observability tools can’t alert what they can’t see, and even when they have visibility, they may not have sufficient capabilities to notify users of issues, forcing customers to rely on third-party integrations. Many such tools are reactive, requiring an event to occur before analysis is performed, leading to alert fatigue.
  • Root cause analysis: Other observability tools require 27 different dashboards to help users perform root cause analysis. Often, the tools are not collecting a specific metric by default, resulting in users deciding to collect as much data as possible, slowing down analysis even more.
  • Complexity: Some observability tools reach the upper bounds of their scalability sooner than anticipated. Modern globally distributed networks are complex and growing, and your tools need the ability to scale to match. Many observability tools fail to consider that data collected from different sources have different scales (aggregated versus raw), leading to poor analysis. 
  • Cost: Observability tools promise a low price to get started, but costs quickly add up as additional packages become necessary and more extensions are needed, not to mention the extra storage costs due to the need to collect as much data as possible because you don’t know what you are looking for.

In short, the “unknown unknowns” are the Achilles heel for these systems. They require a lot of manual configuration and updates to observe the correct things, which change with time. While these systems promise the ability to be proactive in your monitoring, you spend too much time in reactive mode first.  

Therefore, the next generation of network monitoring platforms must include contextual observability as a core offering. Providing an emphasis on context is the most significant shift in modern network observability. Collecting data is no longer enough; understanding the context in which data is collected is crucial. 

NextGen Network Observability Platforms

The next generation of network observability platforms must (1) be genuinely data-centric and (2) provide the context necessary for actionable insights. 

Data-centric is crucial, and your observability platform must be able to collect any data from any system, anywhere. The collected data is a permanent asset as applications and systems come and go over time. The data is transformed through advanced analytical techniques such as machine learning classification models. The model allows the platform to find correlations otherwise undiscovered by traditional monitoring. In other words, with a bit of math, we can uncover some unknown unknowns. No advanced knowledge is necessary.

The context is provided in the same way through analytical techniques, which allow the infusion of metadata to help build correlations of different data points to find a true root cause. For example, a traditional monitoring tool might report a network switch is at 80% CPU utilization. The next generation of network observability platforms will not only report the 80% CPU utilization but will correlate with a spike in application request rates and provide historical context for anomaly detection.

But these NextGen tools will use analytics to go even further. They can build forecast models, helping to predict network bottlenecks before they happen. They will automatically create dependency mappings and show you the root cause and all the nodes affected. They will also provide insight into the business impact of an incident, helping to understand the revenue impact, customer satisfaction, and effect on overall operations for an outage. 

And the result will be breaking down those corporate data silos between different monitoring and observability platforms. You won’t need twenty-seven unique dashboards to discover your root cause.

Summary

Anyone who has built their monitoring system will tell you it is not fun to be in a meeting where you are asked, “Why didn’t your system catch the issue?” The reality is that no one, not you or these legacy monitoring vendors, will capture everything you need at the precise times you need it, no matter how many tags or customizations you add. 

Even if your legacy tools can capture every piece of data possible, correlation and root cause remain exercises for the user to complete — an activity which involves jumping from one dashboard to the next, hoping each time the next dashboard will be the dashboard showing the root cause.

The next generation of monitoring and observability tools will fundamentally differ from the legacy tools and their years of false promises. These new platforms will federate data from all available sources — formatting, normalizing, and automatically labeling incoming information. They will allow for real-time analytics where anomalies are detected and correlations are discovered, and provide inputs to automation tools for alerts and service management. 

Modern observability platforms will finally provide vertically integrated platforms to deliver the single pane of glass we’ve all been waiting for and pave the way for revolutionizing observability.

More on our blog

Beyond the Dashboard: Selector’s Patented Approach to Conversational Observability

For years, IT operations teams have been trapped in a frustrating paradox: the data they need to solve critical issues is right at their fingertips, yet entirely out of reach. Accessing it requires engineers to master complex, platform-specific query languages, dig through endless layers of dashboards, and hunt for the exact visualization that holds the answer. Under the intense pressures of modern speed, scale, and complexity, this rigid model is breaking down. At Selector, we recognized a fundamental opportunity to change how teams interact with their data. Our recently published U.S. patent application (US20250278401A1, filed March 2, 2024, and published September 4, 2025), titled “Dashboard metadata as training data for natural language querying,” outlines a transformative solution. By utilizing dashboard metadata, aliases, and user interaction data as training material, we empower operators to bypass structured queries entirely and obtain infrastructure insights using plain, natural language. The Core Innovation: Dashboards as Operational Intelligence Historically, dashboards have been viewed as the final destination for data—a static visualization tool. Selector flips this paradigm. We treat the dashboard as a rich source of operational intelligence and a single version of the truth. The patent details how our platform uses existing dashboard metadata to build dynamic alias datasets. These datasets effectively map natural language phrases to the correct operational context. Because the system leverages the existing organization, labeling, and usage patterns already established in an environment, it doesn’t have to learn from scratch with every user request. It already speaks your network’s language. Changing the Operator Experience This approach fundamentally redefines the operator experience. Instead of forcing an engineer to “think like a query engine,” Selector allows them to simply “think like an operator.” When an issue arises, a user can ask, “Why is packet loss increasing in the west region?” without needing to hunt through widgets or write complex syntax. The system instantly interprets the natural language request, identifies the necessary context, generates the underlying database queries, and returns real-time (or near-real-time) performance data. Capturing “Tribal Knowledge” This innovation goes far beyond a UI upgrade; it represents a major shift in how operational knowledge is institutionalized. Most operations centers rely heavily on “tribal knowledge”—the unwritten expertise of senior staff who inherently know which metrics matter, which dashboards to check, and what specific terms mean in their unique environment. Selector’s patented method converts this implicit expertise into durable training data. As users interact with the system, their natural language inputs continuously augment the alias dataset. The model aligns itself with the customer’s actual domain language, growing smarter and more accurate over time. Scaling Operations and Lowering the Skill Barrier For teams tasked with managing unprecedented scale, this adaptive approach is a game-changer. Traditional natural-language-to-query systems often fail because they require constant manual labeling and retraining whenever new terminology emerges. Selector’s patent directly solves this inefficiency. Our adaptive method automatically updates the alias dataset based on dashboard metadata and user language, even extrapolating new query templates before they are explicitly encountered. This drastically reduces the need for manual labeling while driving high relevance in highly specific, domain-heavy environments. The operational benefits are immediate and measurable: The Architectural Foundation Crucially, this conversational layer doesn’t exist in a vacuum—it is built on a powerful architectural foundation. The patent describes an operations management system capable of ingesting, normalizing, and labeling heterogeneous operational data from multiple sources before generating and executing queries against it. For AIOps and observability, this highlights a foundational truth: natural language querying is only effective when it rests atop a platform that is already proficient in data correlation, normalization, and contextual retrieval. The Future is Conversational Ultimately, this isn’t just about making dashboards easier to use. It is about transforming the relationship between humans and operational systems. Dashboards are evolving from passive displays into active learning agents. By moving operational data beyond static visualization and into the realm of conversational access, Selector lets the system learn the operator’s language—rather than forcing the operator to learn the system’s. We are delivering on the ultimate promise of AIOps: turning your operational data into a resource you can converse with, rather than just a dashboard you have to search. Stay Connected Selector is helping organizations move beyond legacy complexity toward clarity, intelligence, and control. Stay ahead of what’s next in observability and AI for network operations: 

The Business Case for AI-Driven Observability in Network Operations

Modern network operations generate an extraordinary amount of telemetry. Metrics, logs, events, topology data, cloud signals, and service context all contribute to a richer picture of system behavior. As environments expand across cloud, data center, edge, and SaaS, the opportunity for operations teams is clear: when that telemetry is unified and understood in context, it becomes a powerful source of resilience, efficiency, and business insight. That is why AI-driven observability has become such an important priority for IT and operations leaders. Its value comes from helping teams move through complex environments with greater clarity. Correlated signals, contextual awareness, and shared operational understanding help teams identify issues faster, coordinate more effectively, and resolve incidents with greater confidence. For business leaders, the conversation is increasingly practical. They want to understand how observability investments contribute to uptime, team productivity, operational scale, and service quality. AI-driven observability answers that question by connecting technical insight to measurable operational outcomes. AI-Driven Observability Creates Shared Operational Context One of the most valuable outcomes in modern operations is shared context. Network, infrastructure, cloud, and application teams all work with data that reflects real conditions in the environment. When that information is connected across domains, teams can align quickly around what is happening, what is affected, and where to focus first. Previous articles we’ve written point to this operational need consistently. Full-stack visibility, event correlation, data harmonization, and contextual intelligence all support the same outcome: helping teams see systems as interconnected environments. This gives engineers a clearer path from telemetry to understanding, and it helps leaders create more consistent operational workflows across distributed environments. Shared context also improves collaboration during incidents. A unified operational view helps teams work from the same narrative, which supports faster triage, clearer ownership, and smoother coordination across functions. In high-pressure moments, that alignment has direct business value because it reduces confusion, accelerates decisions, and supports service continuity. Business Value Begins With Faster Understanding In many organizations, the most important operational gain comes from shortening the path to understanding. When engineers have access to correlated, context-rich insight, they can move quickly from detection to investigation and from investigation to action. That acceleration matters because every operational delay carries a cost. Teams invest time in triage, collaboration, handoffs, and escalation. Business services may experience degraded performance. Internal teams lose productivity. Customer-facing systems carry increased risk. AI-driven observability supports a more efficient operating model by helping teams understand relationships across signals and by surfacing the context needed to act earlier in the incident lifecycle. This is one of the clearest ways to express the value of AI-driven observability to executive audiences. Faster understanding improves incident response, strengthens operational discipline, and helps organizations sustain service quality as complexity grows. The Metrics That Show Real Value A strong business case becomes much easier to communicate when it is anchored in a focused set of operational metrics. MTTR Mean Time to Resolution remains one of the clearest indicators of operational effectiveness. AI-driven observability contributes to MTTR improvement by helping teams identify likely cause, affected services, and relevant context earlier in the process. This supports a faster path to action and a more efficient incident lifecycle. Time to Identify Early understanding shapes the rest of the response cycle. A clear view of correlated events, dependencies, and service impact helps teams assign ownership quickly and move forward with confidence. Incident and Ticket Volume Correlated incident management supports a more focused operating model. When related events are grouped into context-rich incidents, teams can work from a smaller number of more meaningful operational objects. That improves efficiency and helps reduce cognitive load across NOC and operations teams. Escalation Patterns High-quality context supports better decision-making at every level of the organization. It allows frontline teams to act with stronger situational awareness and helps senior engineers focus their expertise where it can create the greatest impact. This contributes to healthier team capacity and more scalable operations. Operational Toil Operations leaders increasingly care about the amount of repetitive manual work surrounding incidents: reviewing duplicate alerts, switching across tools, reconstructing timelines, and coordinating repeated handoffs. AI-driven observability supports a cleaner, more streamlined workflow that improves engineer productivity and creates a better day-to-day operating experience. Translating Operational Gains Into Executive Language Executive stakeholders respond most strongly when technical improvements are connected to business outcomes. AI-driven observability lends itself well to that conversation because the operational gains are tangible. Time saved during triage translates into labor efficiency. Faster resolution supports uptime and service quality. More focused incidents help teams scale their efforts across larger, more distributed environments. Better context strengthens planning, prioritization, and cross-team coordination. These outcomes support resilience while also contributing to cost discipline and organizational agility. This is especially important in hybrid operations, where service performance depends on relationships across infrastructure, network paths, providers, and applications. In these environments, observability creates value by helping organizations understand system behavior holistically and act with a stronger operational foundation. AI-Driven Observability Supports Resilient Growth As digital environments grow, the need for clarity grows with them. More services, more interdependencies, and more distributed architectures all increase the importance of context-rich operational intelligence. AI-driven observability helps organizations meet that complexity with a model that supports resilience and scale. Data harmonization, event intelligence, natural language access, intelligent incident management, and agentic workflows all contribute to a future where operational teams can work with greater speed, confidence, and precision. That progression begins with observability that understands relationships across the environment and delivers insights in a form teams can use immediately. A Simple Framework for Proving Value For teams building the business case internally, the clearest approach is often the simplest. Start by establishing a baseline for incident response, escalation patterns, and operational effort. Track the time spent identifying issues, coordinating across teams, and resolving events. Then measure how AI-driven observability improves those workflows through richer context, better alignment, and faster understanding. From there, tie those improvements to the outcomes leadership cares about most: service reliability, productivity, operational scale, and customer experience. This gives observability

Solving the Ticket Noise Problem: What We Learned from Our ServiceNow Webinar

On March 18th, we hosted a session focused on a challenge that continues to undermine even the most mature IT operations teams: ticket noise.  It’s easy to dismiss noise as just “too many alerts”. But as we explored in the webinar, the real issue runs deeper. Ticket noise is a symptom of something more fundamental — a lack of correlation, context, and shared visibility across the stack.  If you weren’t able to attend, this blog walks through the key ideas, examples, and takeaways. And if any of this feels familiar, it’s worth watching the full session.  View “Solving the Ticket Noise Problem: Bringing Intelligence to ServiceNow”.  The Hidden Cost of Tickets Most organizations don’t struggle because they lack monitoring. In fact, the opposite is true — they have too much of it. Over time, teams adopt specialized tools for every layer of the environment: Each tool does its job well within its domain, but incidents don’t respect those boundaries. As discussed in the webinar, what emerges is a fragmented operational model: The result is a familiar pattern: alert storms, duplicated effort, and delayed resolution. To put things more bluntly, it becomes a data correlation problem rather than a monitoring problem.  Why Traditional ITSM Workflows Break Down Platforms like ServiceNow are central to how organizations manage incidents, but they are only as effective as the data they receive. When upstream systems generate noisy, uncorrelated alerts, ServiceNow becomes a reflection of that chaos: In the webinar, we walked through a scenario that highlights this breakdown. A single configuration change triggered an outage, resulting in dozens of alerts across different tools and teams. Each team began investigating independently, without a shared understanding of the root cause. What should have been a single incident turned into a multi-team firefight, slowing resolution and increasing operational risk. Rethinking the Model: From Alerts to Event Intelligence The core idea behind Selector’s approach is simple but powerful: Don’t manage alerts. Understand Events.  Instead of treating ever alert as a separate signal, Selector ingests telemetry across the entire stack — network, infrastructure, application, and cloud — and builds a correlated model of what’s actually happening.  This shift fundamentally changes how incidents are handled:  This is what we refer to as event intelligence — the ability to move from raw signals to actionable insight.  What This Looks Like Inside ServiceNow One of the most important aspects we covered in the webinar is how this intelligence translates into real operational workflows. Selector integrates directly with ServiceNow, but not in the traditional “forward alerts as tickets” sense. Instead, it transforms the structure and quality of what enters the system. Fewer Tickets, Higher Signal Rather than flooding ServiceNow with every alert, Selector creates correlated incidents. In one example shared during the session, a large-scale outage generated thousands of alerts in a traditional tool. Selector reduced that to just a handful of meaningful incidents, with each tied to a clear root cause. This dramatically reduces the cognitive load on engineers and allows teams to focus on resolution instead of triage. Bi-Directional Intelligence Another key differentiator is the bi-directional integration between Selector and ServiceNow. Selector doesn’t just push tickets into ServiceNow, but instead continuously exchanges information: This ensures that both systems remain aligned and eliminates the fragmentation that often occurs between monitoring and ITSM. It also enables smarter workflows, such as: From Basic Tickets to Actionable Context Perhaps the most meaningful shift is in the quality of each ticket. Traditional tickets often require engineers to begin their investigation from scratch. Selector changes that by embedding context directly into the incident: In effect, Selector elevates tickets from simple notifications to decision-ready artifacts, reducing the need for manual investigation and accelerating time to resolution. Real-World Examples To make this tangible, we walked through several real-world scenarios during the webinar. In one case, a failure in a network interface caused cascading issues across multiple access points. Without correlation, this would appear as a series of unrelated device failures. With Selector, the system identified the failing interface as the root cause and generated a single, context-rich incident, allowing the team to resolve the issue in under 30 minutes. In another example, a large SD-WAN outage impacted over 100 devices across dozens of sites. While other tools generated thousands of alerts, Selector reduced the situation to just a few actionable incidents. Engineers were able to coordinate quickly and focus on resolution rather than filtering out noise. These aren’t edge cases. These represent what happens when correlation is applied at scale. Why This Matters Now As environments become more distributed and complex, the cost of noise continues to rise. It’s not just about wasted time, but also: The organizations that succeed are the ones that move beyond monitoring and toward intelligent operations, where systems don’t just detect issues, but help explain and resolve them. The Takeaway Ticket noise isn’t solved by adding more filters, rules, or dashboards. It’s solved by changing how data is understood. By correlating events across the stack and delivering that intelligence into systems like ServiceNow, Selector enables teams to: Watch The Full Webinar This blog captures the core ideas, but the full session goes deeper into: Watch the full webinar on demand here.  Stay Connected Selector is helping organizations move beyond legacy complexity toward clarity, intelligence, and control. Stay ahead of what’s next in observability and AI for network operations: 

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.