Overview
Incident Management (IcM) is a term describing the activities of an organization to identify, analyze, and correct hazards or outages to prevent a future re-occurrence.
If not managed, an incident can escalate into an emergency, crisis or a disaster.
The first goal of the incident management process is to restore a normal service operation as quickly as possible and to minimize the impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained.
Introducing machine learning for pattern recognition, event correlation can reduce significantly the time and resources required to diagnose and mitigate incidents. Moreover, it can access audit trails across thousands of incidents and recommend and simulate solutions contextual to numerous dependencies.
Example Scenario
Audit trail from an incident log:
- Issue was mitigated once the uplinks were shut and nodes isolated.
- 86 subscriptions were identified as impacted.
- Impacted customers would have experienced loss of connectivity to and from a subset of their VMs on this cluster.
- AzComm has toasted the impacted subscriptions via their management portal.
- 1 CRI [mitigated] / No Social - RCA would be owned by <Name>Team.
- <Incident ID URL> has been created for tracking purposes.
Kindly investigate the issue with the TOR.
Currently these steps are not being written as a series of actionable steps that people can then go execute. It is a summary of a few things the engineers tried and this varies per incident.
I am a new DRI and this incident has been shown to me as a similar incident. I now want to know how this issue was solved. By looking at this incident, the mitigation steps are not actions I can take.
Upon closer inspection, we see that multiple commands were run to isolate the nodes. This information is scattered within the discussion forum and is hard to find for any DRI who needs the steps fast.
If we had a Writing Assistant that pre-populated a basic template like for example :
-
I used <command> for <function>
-
We have the commands that can be drawn out from the knowledge base.
-
We need the user to give us a description of what the command does so we can map the two.
-
Once we see the same command being used, we can auto-fill more of the mitigation step template for them.
What if the algorithm can simulate these steps and advise on the best solution based on all the dependencies related to that incident?
The Challenge
Microsoft Azure has a few thousand services that are linked in a complex hierarchy within its organization across the globe. These services are used both internally and externally and their outages can cause major disruptions to its customers.
Current IcM solutions have a few issues:
-
Lack of context about the incident.
-
No suggestions on how to fix the issues based on similar incidents.
-
No prioritization or severity levels.
-
No way to escalate the incident to a crisis.
-
No visibility to the leadership teams.
-
Ineffective methodology on mitigating errors with a lot of resources trying to go through redundant tasks.
The Solution
-
Improving communications between teams, making information to resolve major incidents discoverable and cognitive using machine learning .
-
Managing on call solutions across multiple teams with ease.
-
Enhanced analytics to identify bottlenecks in mitigating incidents as well as creating a reusable pattern of solutions to fix outages .
-
Setup operational rules that would monitor and inform of potential outages in real-time, giving the user a pre-emptive way of resolving them before they occur.
My Role
-
IcM incident creation, transfer, mitigation, post mortem, on call, smart assistant, rules management, outages.
-
I was the lead designer for this product and employed the following methods:
-
Contextual walkthrough to learn about how Directly Responsible Individuals (DRIs) resolve incidents
-
Evaluation of ITIL methodologies to identify best practices and UX opportunities.
-
Collaboration with key stakeholders and input from SME’s to identify Microsoft service processes and third party integration points.
-
Execution of complex interactive prototypes to simulate system interactions, incident resolutions, and configuration.
-
Worked towards a new visual language (Fluent) informed from existing design patterns.
-
Rigorous user research and testing to validate visual metaphors and conceptual frameworks.