Network Health and Routing analytics

By Javier Antich, Selector

It is 7:00pm on Thursday…

Your customers are primarily at home, finishing dinner, and ready to see their favorite Netflix series. Your network is enabling it, you have redundancy designs to cover for eventualities, so it should be an easy evening/night. One of the aggregation routers in Chicago fails, but you have the other one, and capacity-wise, it is dimensioned to handle all traffic, so there is no reason to be worried. However, the phone starts ringing, customers call complaining they cannot see Netflix, and the Internet is down. How come? All seems to be ok? BGP sessions are up, but you see a drop in traffic. What is going on? It is no longer an easy evening/night; get ready to find the needle in the haystack.

I certainly tried to make the above a bit dramatic, but I think it is more certain that this may look familiar to many of you. How many times was redundancy not there when you needed it? And when you need it, it is too late. Redundancy is not just about a physical component or path redundancies. It is also about having the right state that enables such redundant components or paths to be effective when they are needed.

Active mechanisms are needed to make sure the state of your network, including the active and redundant components, is consistent and honors your network intent.

In Selector, we are entirely focused on how we can make the life of an operations engineer easier. It is not just about helping you find such a needle in the haystack (sometimes it can be challenging) but about making sure you do not need to get to the point where that search is necessary.

With the Network Health and Routing analytics package in Selector Analytics, Network operation engineers have now the ability to activate proactive mechanisms that will allow having a close tracking of the network state that is critical for service delivery:

  • Is there reachability to the critical locations and services?
  • Is there any latency and/or jitter issue in the connectivity towards those critical locations or services?
  • Has the path changed or the number of hops towards those critical locations?
  • Is there any anomaly in my BGP state towards critical locations or services?
  • Is there consistency in my BGP state towards critical locations or services across sites or redundant resources?

These are some of the questions that Selector Analytics can proactively respond to help identify connectivity issues before they turn into a service outage. Leveraging our synthetic testing agents, which can be deployed anywhere in the infrastructure, public cloud locations, or embedded in network elements such as routers or switches, Selector Analytics can proactively verify connectivity and raise anomalies using our ML-based anomaly detection algorithms and can detect connectivity anomaly before they generate service impact.

On top of that, now Selector Analytics can ingest and analyze routing information using BMP. By analyzing BGP data, we can proactively evaluate routing consistency anomalies and enable policies to track the state of critical services.

With Selector Analytics, network operation engineers can be proactively notified of anomalies or inconsistencies in the routing state, leading to service degradation or impact. Sometimes such inconsistencies may result from router miss-configurations or other types of network issues.

Frequently, especially when it comes to BGP, external network reachability issues may impact our ability to reach certain cloud services like Netflix, Facebook, or others (remember the recent Facebook outage). While in those cases, there is not much you can do to solve them; you are likely going to receive calls from your customers anyway, so the sooner you can identify what is happening the faster you will be able to articulate a response plan.

You may need to go back in time to evaluate when the problem started and what happened. Our Routing Analytics enables forensic analysis by allowing you to verify the routing state changes along with all the different network telemetry associated.

Networking and IT operation will always be a game of both prevention and reaction. In Selector, we are developing capabilities to strengthen both dimensions. MTTD and MTTR are key performance indicators for a network operation, but they are primarily associated with the reactive side of the equation. We need to consider the MTFB and how proactive mechanisms such as the ones introduced in Selector Analytics will help customers prevent failures and, consequently, increase the MTFB, which will directly impact customer satisfaction and churn reduction.