By Javier Antich, Selector
No, I’m not advocating that we jump into a new society where there are no rules; that would be catastrophic.
We know what “rules” are in the network and IT operations. They are all those things we need to program to define the operational boundaries of our infrastructure. We use rules to set acceptable or non-acceptable thresholds on metrics, parse log messages, extract information from them, and do proper classification. In addition, we use rules to determine if configurations are right or wrong and define correlation conditions. Rules are everywhere. Ultimately, rules have been our only means, so far, to capture operational knowledge and put such ability to work.
There is nothing inherently wrong with rules, but they come with extra baggage that makes them very difficult to maintain over time:
- The time and effort it takes to create them in the first place.
- The time and effort it takes to understand existing rules created by someone else one year ago.
- The time and effort it takes to update them.
- The incredible complex rules for the multi-cloud and highly dynamic micro-services IT operations.
Sometimes rules are created based on a particular assumption that we know what a good or bad state is. Frequently, what is good or bad may be so dynamic and contextual that what the rule is capturing is inadequate, leading to many false positives and alert fatigue.
Very frequently, we waste a lot of time reverse-engineering the goal of a particular rule because no one has adequately documented these rules, and it is unclear what that rule is doing and what the impact would be if it is changed or removed. Snowflakes start to increase, and none wants to touch them.
How often does a regex rule to process a particular type of log get outdated because the log format changed slightly without prior notice from the vendor? We are wasting a lot of time statically setting operational boundaries that we can set dynamically using a different approach. There is a lot of operational knowledge embedded in our infrastructure telemetry. We need to activate techniques that allow us to extract and put it to work.
- We do not need rules to define static thresholds that differentiate good from the bad. We can infer from the data itself what is expected and what is not in a specific context, given the hour of the day, or the day of the week, or the month of the year, and automatically determine the proper thresholds. There is no need to waste our time on this.
- We do not need rules anymore to parse log messages. Natural Language processing techniques allow us to extract critical information from the log, irrespectively of whether the syntax of the log may change over time. Say goodbye to regex rules.
- We do not need rules anymore to classify messages or alerts. Deep Learning techniques allow us to understand the scope of a message and classify it accordingly. Say goodbye to regex rules as well.
- We do not need rules anymore to correlate events. If-this-then-that correlation rules belong to the past. Multi-Cloud infrastructure is multidimensional, and so must be the correlation. Context-based correlation allows us to retire legacy correlation techniques that are fundamentally linear and not designed to cope with multidimensional and multilayer infrastructures.
- And there is more; we do not need rules anymore to evaluate if a configuration change is right or wrong. We can use deep learning techniques to automatically understand what is right or wrong in a network or IT infrastructure and raise alerts when a configuration change does not fit.
What I’m describing here is not only possible; it is real, enabled by AI/ML techniques. It is time for a new “rule-free” era for network and IT operations. Do not waste your time defining rules or reverse-engineering someone else’s rules. The future is about you investing your time in how to optimize, evolve and scale your infrastructure to deliver better services, not defining rules to set operational boundaries.