AI for Network Leaders — Powered by Selector

Join us in NYC on March 25th

AI for Network Leaders — Powered by Selector

Join us in NYC on March 25th

On this page

Navigating External Outages: How Selector Cuts Through the Cloudflare Noise

Yesterday’s widespread Cloudflare outage reminds us how crucial external dependencies are to the stability of our own applications. When a key edge provider like Cloudflare goes down, the impact on your internal monitoring systems can look like a catastrophic, internal system failure triggering a massive storm of alerts and sending engineering teams into frantic, misdirected debugging sessions.

The difference between knowing and guessing during an outage isn’t just about response time. It’s about maintaining customer trust and making informed decisions when every second counts.Selector is specifically designed to cut through this noise, rapidly identifying the true root cause as external and drastically reducing the time it takes to restore sanity. It turns a potential internal panic into a confident, swift response.

How Selector Specifically Assists During a Cloudflare Outage

When Cloudflare goes offline, your internal monitoring dashboards light up with red. The outage appears to be a total system failure because traffic has dropped to zero or error rates have spiked across the board. Selector uses AIOps, correlation, and synthetic monitoring to separate internal health from external failure.

1. Rapid Root Cause Isolation (Mean Time to Innocence)

When an edge failure occurs, the first instinct is to check internal servers. Selector provides an immediate answer, establishing your “Mean Time to Innocence.”

  • The Symptom: Your internal dashboards show alerts for 0 (zero) traffic or High Error Rates across your entire application stack, suggesting an application crash.
  • What Selector Does: It intelligently correlates two critical data points:
    • Internal Metrics: Your server health checks are green/healthy.
    • Edge Metrics: Your ingress traffic has dropped to zero.
  • The Result: Selector  immediately identifies that the App-to-Infrastructure path is healthy, but the Ingress (incoming) path is broken. It flags the issue as upstream at the edge (Cloudflare). This prevents your engineers from wasting hours debugging perfectly working internal code and shifts their focus to external dependency management.

2. Noise Suppression (End Alert Storms)

A widespread external outage generates a massive wave of cascading alerts. Load balancers report health check failures, synthetic tests fail, and every application microservice reports an error spike because they are starved of traffic.

  • The Symptom: Your Network Operations Center (NOC) or Site Reliability Engineering (SRE) team is flooded with hundreds of individual alerts across every monitored component.
  • What Selector Does: Its powerful correlation engine groups these hundreds of low-level, symptomatic alerts into a single, high-fidelity event.
  • The Result: Instead of receiving 500 notifications for every failed microservice, the team receives a single, correlated insight indicating a massive drop in ingress traffic. This dramatically prevents alert fatigue and keeps the team focused on the single external root cause.

3. Synthetic & Path Monitoring

Selector can leverage data from existing synthetic monitoring tools (or utilize its own capabilities if configured) to perform active reachability testing.

  • What Selector Does: It can detect that synthetic probes bypassing Cloudflare (direct-to-origin) are succeeding, while corresponding probes routing through Cloudflare are failing.
  • The Result: This provides definitive, empirical proof of the outage source. This level of certainty allows your teams to confidently communicate the true status to internal and external stakeholders without waiting for a public confirmation from the provider.

4. Automated Remediation & ChatOps

Once the root cause is isolated, the incident response needs to be fast and decisive.

  • What Selector Does: It integrates with collaboration platforms like Slack and Microsoft Teams to push a natural language summary directly to your incident channel. For example: “Traffic anomaly detected: 100% drop in ingress. Internal services green. Correlated with Cloudflare reachability errors.”
  • The Result: This instant, accurate communication facilitates immediate decision-making. If your organization has a multi-CDN setup or a break-glass DNS strategy, the platform can either directly trigger or prompt an engineer to initiate a DNS swing to route traffic away from the failed provider to a backup or origin directly.

5. Automated Incident Creation and Ticketing

A critical step in managing any major outage is the creation of a formal incident record. Selector automates this process to ensure no time is wasted in documentation.

  • Intelligent Ticketing: Selector automatically creates incident tickets in popular IT Service Management (ITSM) platforms like ServiceNow, Jira Service Management, or PagerDuty as soon as the correlated external outage event is confirmed.
  • Pre-populated Details: The tickets are not empty; they are pre-populated with crucial, contextual information, including:
    • The single correlated root cause (e.g., “External Ingress Failure – Cloudflare Outage Detected”).
    • The scope of impact (which services are affected).
    • A link to the ChatOps channel where the automated summary was posted.
    • The initial severity level based on the traffic drop percentage.
  • Eliminating Manual Handoffs: This automation eliminates the delay and potential human error of a human operator manually creating a ticket, accelerating the formal start of the incident response process.

Integrated Incident Workflow and Tracking

Once the incident is created, Selector maintains its role as the source of truth, centralizing information flow and tracking progress.

  • Real-Time Status Updates: Selector continues to monitor the external provider’s status and the ingress metrics. Any change (e.g., a partial recovery, or a full return to normal) is used to automatically update the incident ticket’s status and the related ChatOps message.
  • Timeline Generation: The platform automatically logs key events and actions within the incident record—for instance, the time the correlation occurred, the time the automated remediation was suggested, and the time the external provider announced restoration. This is invaluable for generating accurate post-incident reviews (PIRs).
  • Team Handoff Management: By integrating with alerting tools, Selector ensures the appropriate on-call personnel (e.g., NOC, SRE, or even Communications) are notified specifically about the external nature of the outage, allowing for faster task delegation and minimizing misdirected escalations.

🔑 How Selector Helps Reduce Pain and Alerts for Teams

By leveraging AIOps and advanced correlation, Selector transforms a chaotic, internal-looking incident into a controlled, externally focused response.

  • Reduce “Mean Time to Innocence” (MTTI) from Hours to Minutes: Engineers spend less time debugging internal code that is working fine.
  • Suppress Alert Storms: Hundreds of cascading alerts are consolidated into a single, actionable event, preventing alert fatigue.
  • Shift Focus from Internal Debugging to External Mitigation: Teams are immediately focused on managing the external dependency rather than hunting ghosts in their own code.
  • Provide Definitive Proof of Outage Source: Synthetic monitoring data gives clear evidence, boosting confidence in stakeholder communications.
  • Enable Automated/Prompted Remediation: Facilitates fast DNS changes or traffic shifts to a backup provider through ChatOps integrations.
  • Maintain Sanity and Focus: The clear, concise communication prevents panic and ensures the right people are working on the right solution immediately.

Would you like to see a demonstration of how Selector can ingest your current monitoring data to provide this kind of correlated insight? Get a demo here

More on our blog

The Fragmentation Tax: What Multi-Tool Incident Response is Really Costing You

Here’s a question that sounds simple but isn’t:  When something breaks in your environment, how long does it take your team to agree on what they’re looking at? Not how long it takes to fix it—that’s a different problem. I mean: how long does it take for everyone on the bridge to have the same basic understanding of what’s broken, where it started, and what it’s affecting? If your answer is anything other than “pretty much immediately,” you’ve got a fragmentation problem. And chances are it’s costing you more than you think.  Consider the following scenario: alarms are flooding in. Multiple servers in the data center are unreachable. Applications are throwing connection errors. The war room comes online, and everyone — NetOps, infrastructure, the application team, and systems engineering — joins. Everyone opens their tools, and what do they see?  The network team sees a BGP state change. Peers went down, routes withdrew. Infrastructure sees high CPU alarms on the core router, followed by a line card reset. The server team’s looking at dozens of hosts that lost connectivity simultaneously. The application team sees cascading failures across services that depend on those servers. The NOC pulls up a configuration change that was pushed to the router forty minutes earlier.  So which one caused it?  The silence tells you everything you need to know.  Why Everyone’s Right, and Nobody Knows Why The frustrating part about this type of scenario is that every tool is correct. The BGP flap is real; the high CPU and line card reset occurred. The servers lost connectivity, applications are failing, and a config change was deployed.  But somehow, even with all this data, you still can’t see what’s actually going on.  It’s not that you’re missing information; it’s that the information lives in five different places, and each place is telling you a different story. Every tool in your arsenal did its job effectively, but they aren’t talking to each other.  And that leaves you — at whatever ungodly hour this is happening — tabbing between dashboards, trying to build a timeline in your head, while someone on the bridge asks if you’ve checked whether the change was actually validated in staging.  The problem here is not the complexity of your systems. It’s that your understanding is in multiple different pieces.  The Architecture of Confusion When five engineers look at five different dashboards and come away with five different theories about what’s broken and how to fix it, that’s not a failure of skill. It’s a failure of architecture.  Most monitoring and observability platforms are built around what we consider to be a vertical data model. Data comes in and gets sorted by type. Logs go into the log pipeline, metrics go into the metrics pipeline, so on and so forth. Network events, infrastructure alerts, and application traces each get its own lane, its own schema, its own storage, and its own analytics.  Most platforms can ingest multiple types of data, but each type still lives in a silo. You can set up correlations — match timestamps, trigger alerts when two things happen at once — but those correlations are brittle and predefined. They know “if X, then Y” but they don’t know the why.  That’s the gap.  That’s why five smart people (often times a lot more than five) on a bridge call can look at the same incident and walk away with completely different understandings of what happened. The tools aren’t designed to give you a shared view. By nature, most of your tools are designed to optimize analysis within their own domain. So when something breaks across domains —which is, let’s be honest, most of the time — you’re left stitching the story together yourself. And you have to do it manually, under pressure, while the alarms keep coming in.  What Changes When Data Speaks the Same Language There’s a different way to do this. Instead of organizing data by type, you can organize it by relationship. We call this “Horizontal Data Ingestion”.  Selector doesn’t care if something is a log or a metric or a BGP event or a line card reset. It’s all knowledge, and we ingest it all — network telemetry, infrastructure metrics, application logs, topology data, change records, configuration pushes, even emails if that’s important to you. Then we use patented AI and ML models to figure out how it’s all connected.  We don’t ask you to tag things in advance. We don’t need you to define schemas. We don’t care if your infrastructure spans on-prem data centers, cloud, hybrid environments, or a mix of vendors that nobody planned but everyone has to live with.  We just ingest it. And then we learn it.  The models we use do three things:  The result isn’t five dashboards with five stories (or more). It’s one operational view of what’s actually happening.  When correlation stops being about matching timestamps and starts being about understanding causality, the whole game changes. You stop pointing fingers and start solving problems.  Same Incident, Different Outcome Let’s go back to that scenario. Alarms flooding, servers unreachable, applications failing.  But this time, there’s no war room.  Instead, a Smart Alert hits Slack. One alert. Not dozens of of fragmented notifications across five different tools. The alert shows you everything:  And here’s the part that actually matters: the person who gets that alert understands what happened without needing to pull everyone else into the problem.  They see what broke, where it started, what it’s affecting, and what needs to happen next. If the need to escalate or create a ticket, there’s a button right there to push it to ServiceNow — with all the correlation, context, and causation already included.  No dashboard archaeology.  No manual timeline reconstruction.  No debate about whether this is a network problem or an infrastructure problem.  Just what happened, why, and how to fix it.  Selector isn’t just making incident response faster. We are fundamentally changing how incident response works.  Integrate First, Consolidate Later Look, we

Key Takeaways From the 2025 Gartner® Market Guide for Event Intelligence Solutions

The 2025 Gartner® Market Guide for Event Intelligence Solutions reflects a shift in how IT operations leaders evaluate AI-driven technologies. As AI hype gives way to more practical evaluation, we are seeing a natural departure from broad promises about AI capabilities toward clearly defined use cases and outcomes.  In their research, Gartner reframes the market formerly known as “AIOps platforms” as Event Intelligence Solutions (EIS), emphasizing correlation, context, and response over generic AI claims. While Gartner examines the evolving role of event intelligence in modern IT operations, we have identified five key takeaways in the market guide. This week, we will share Selector’s perspective on how these ideas translate into real operational value.  Selector is proud to have been identified as a Representative Vendor in the 2025 Gartner Market Guide for Event Intelligence Solution. You can read the full report here.  1. The Market is Resetting Expectations Around AIOps What Gartner says:  “The term AIOps has been widely adopted by vendors across multiple IT operations markets, often without a clear definition of, or consensus on, what it entails. This, coupled with the associated AI hype, has led to both confusion and disillusionment among infrastructure and operations (I&O) leaders, whose expectations have not been met.” “The renaming of this market from AIOps platforms to EIS serves to direct focus to the intended domain and set of use cases. Namely, the application of AI, ML and advanced analytics to cross-domain events from monitoring and observability tools to augment, accelerate and ultimately automate response.” Selector’s perspective:  From our perspective, Gartner’s reframing reflects a broader shift in how operations teams evaluate AI in practice. The challenge was never the potential of AI itself, but the lack of clarity around where and how it should be applied to deliver operational value.  Selector was built with this distinction in mind. Rather than positioning AI as a standalone capability, we focus on applying intelligence to a specific operational domain: cross-domain events produced by monitoring and observability tools. The goal is not to “add AI” to operations, but to help teams augment human decision-making, accelerate response, and progressively move toward automation in areas where confidence and process maturity allow. In other words, AI in and of itself is not the end goal; rather, it is a strategic enabler of the desired outcomes.  We believe this approach mirrors Gartner’s emphasis on use cases and outcomes over terminology. By focusing on event intelligence as a defined operational layer — rather than a broad, catch-all AIOps concept — Selector aims to help teams move past abstract AI promises and focus on measurable improvements in how incidents are understood and handled.  2. Event Noise is the Core Operational Bottleneck What Gartner says:  “It is not unusual for larger enterprises to have portfolios of five to 50 tools for monitoring, each creating signals that must be correlated, triaged and responded to by IT operations teams.” “Often cited by I&O leaders as the key, or only, driver for EIS implementation is this ability to reduce event volumes, in extreme cases this can result in a 95%+ reduction in events that require human intervention.” Selector’s perspective:  We think Gartner’s emphasis on event volume highlights a deeper operational issue: most teams are not overwhelmed because they lack alerts, but because they lack context to understand which signals matter and why.  Selector approaches noise reduction as an outcome of correlation and reasoning, not as a standalone objective. By ingesting events across domains and analyzing their relationships, Selector helps teams distinguish between symptoms and underlying issues. Events that are causally related can be grouped and contextualized, allowing operators to focus on what requires attention rather than manually triaging large volumes of disconnected alerts.  This approach reflects the idea that sustainable noise reduction should reduce cognitive load without obscuring important signals. Rather than simply suppressing alerts, Selector aims to help teams understand how events relate to one another, their impact, and where to begin investigating.  3. Correlation and Context Drive Faster Resolution What Gartner says:  “EIS correlate, group and reduce superfluous notifications from monitoring tools, reducing unnecessary human intervention. In addition, events are enriched with additional contextual information relating to, for example, topology, services, owner or priority.” “Events are additionally enriched with contextual information such as associated impacted business services, prior incidents, change records, owner and even suggested resolver group and remediation action. This correlation and enrichment dramatically reduces the time taken to triage, prioritize, assign and ultimately resolve an event.” Selector’s perspective:  The way we see it, speed in incident response comes from shared understanding, not just faster alert handling. Correlation becomes most valuable when it explains how events relate to one another across domains and what those relationships mean operationally.  Selector focuses on building and reasoning over live service topology and dependencies so that events can be interpreted in context. By linking events to affected services, historical incidents, and changes, Selector helps teams move more quickly from detection to probable cause, reducing the time spent manually assembling context across tools and teams.  This approach is intended to support faster alignment during incidents. When operators can see how events connect, which services are affected, and where to begin the investigation, triage and resolution become more efficient and less reliant on ad hoc communication or escalation.  4. GenAI is Useful, But Only When Grounded in Domain Data What Gartner Says:  “EIS vendors have moved quickly to implement large language model (LLM)- and GenAI-based capabilities, the use cases of which are evolving at pace. Natural language summaries of ongoing issues, providing insights into their possible cause, business impact and next steps are targeted at less technical users.” “The next evolution of these capabilities promises to deliver ever more specialized and sophisticated agentic models targeting broader aspects of the event response and remediation process with expectations being set once again toward fully automated remediation.” “Aside from evaluating the accuracy and ability of GenAI to replace human toil, I&O teams are challenged by their ability to adapt their processes and roles

How Agentic AI is Redefining Network Operations

For much of the past decade, many of the most ambitious ideas in artificial intelligence lived primarily in research papers, labs, and long-term roadmaps. Agentic AI was no exception. The concept of AI systems capable of reasoning, planning, and acting autonomously was widely discussed but largely theoretical. But earlier this month, Gartner® released its report The Future of NetOps Is Agentic, reflecting a growing consensus that this has changed. What was once conceptual is now becoming operational.  We have reached an inflection point where AI research is being translated into real-world systems, and nowhere is this more evident than in network operations. Across IT operations, and especially in NetOps, the conversation is shifting from how AI analyzes data to how AI takes action. This marks a fundamental break from decades of human-centered workflows that simply cannot scale with the speed, complexity, and interdependence of modern networks.  For the first time in the history of NetOps, teams are beginning to explore an entirely new “art of the possible.” AI is no longer confined to dashboards, recommendations, or post-incident analysis. Instead, intelligent systems can continuously observe live environments, reason across domains, and act on behalf of operators in near real time. This marks a redefinition of how network operations function.  This week, we are exploring what Agentic AI means for network operations, why it matters now, and what must be in place for it to succeed.  Transitioning from AIOps to Agentic Operations For a number of years now, AIOps platforms (now called Event Intelligence Solutions by Gartner) have focused on applying AI to one of the hardest problems in IT operations: making sense of overwhelming volumes of events and signals. Solutions like Selector have delivered real, measurable value, reducing noise, accelerating root cause analysis, and improving mean time to resolution through event correlation and contextual enrichment.  However, AIOps was never designed to deliver full autonomy. By nature, it relies on AI models for optimized pattern detection, inference, and recommendation, with humans remaining responsible for decision-making and action. The fact that AIOps stops short of full autonomy is not a shortcoming but rather a reflection of the maturity of the AI technologies and operational modes available when these platforms emerged.  Agentic NetOps represents the next logical evolutionary step, made possible only now as advances in AI architectures, reasoning systems, and operational guardrails begin to close the gap between insight and action. The 2025 Gartner® Market Guide for Event Intelligence Solutions reframes this evolution by focusing on event intelligence as the foundation for automation and decision-making. According to Gartner: “Event intelligence solutions apply AI to augment, accelerate, and automate responses to signals detected from digital services.” The framing around this is critical, and our take is that AI must first understand before it can act. That understanding requires unified events, cross-domain context, and causal reasoning — all of which are capabilities that must precede any form of safe autonomy.  Gartner’s 2026 research report, The Future of NetOps is Agentic, highlights this natural progression: response-focused AI (simple AI chatbots) gives way to task-focused AI (AI assistants), which finally evolves into goal-focused AI (Network AI agents). In other words, Event Intelligence (formerly known generally as AIOps) lays the foundation. Agentic AI then builds on that foundation to introduce systems to go beyond recommending actions and instead continuously reason about the environment and execute on behalf of operators.  What makes AI “Agentic” in NetOps? Agentic AI differs fundamentally from chatbots or task-based assistants. Rather than responding to prompts or executing predefined workflows, agentic systems operate with:  In practical terms, this means AI agents can monitor live networks, detect emerging issues, investigate root cause across domains, and initiate remediation — often faster and at greater scale than human teams.  Gartner notes that generative AI is accelerating this shift by enabling natural language interaction and deeper contextual reasoning: “EIS vendors have moved quickly to implement large language model (LLM)- and GenAI-based capabilities…These capabilities will increasingly be enhanced with retrieval-augmented generation (RAG) or fine-tuning to provide improved context and reduce the risk of hallucinations and inaccurate findings.” Gartner also asserts that: “The next evolution of these capabilities promises to deliver even more specialized and sophisticated agentic models targeting broader aspects of the event response and remediation process with expectations being set once again toward fully automated remediation.” Why Agentic AI is inevitable for network operations Modern networks are no longer static infrastructures. They are dynamic systems spanning cloud, data center, edge, and SaaS, producing massive volumes of telemetry and events every second. Human-centered operations models simply cannot keep pace.  Gartner highlights the operational pressure facing I&O teams:  “Many IT operations teams fail to realize the full potential of event intelligence solutions, realizing a limited value beyond event correlation and noise reduction.” At Selector, we believe the next leap forward comes from closing the gap between insight and action. Agentic AI enables:  In this model, humans are no longer “in the loop” for every decision, but remain firmly “on the loop”, defining intent, guardrails, and trust boundaries. The Prerequisites for Agentic NetOps Agentic AI cannot be bolted onto fragmented tooling or poor data. Gartner repeatedly emphasizes that value depends on data quality, process maturity, and organizational readiness:  “The efficacy of event intelligence solutions is directly related to the sources and quality of data available for ingestion and analysis.” From our perspective, successful agentic operations require:  Without these foundations, autonomy increases risk rather than reducing it. Selector’s Perspective: Agentic AI as a Capability, Not a Feature One of the biggest risks in the current market is superficial “agent washing”, where vendors rebrand chat interfaces or scripts as autonomous intelligence. Gartner warns against this hype-driven approach, noting that AI must be evaluated by its use cases and outcomes, not by terminology.  Selector views Agentic AI not as a single feature, but as a capability that emerges from mature event intelligence. When AI has access to high-fidelity signals, rich context, and causal reasoning, agentic behavior becomes both possible and safe.  This is why Selector has

This site is registered on wpml.org as a development site. Switch to a production site key to remove this banner.