Incident Management¶

Incident: "an event that causes disruption to or a reduction in the quality of a service which requires an immediate response."

This may include (but is not limited to):

Developer notices unexpected behaviour/degraded service.
Alert email from GCP, Gitlab, or any other alerting service received.
External user raises an in issue via email, teams, or the service desk.
A deployment pipeline fails.

The incident protocol needs to be followed as soon as possible after alerts are received. It does not necessarily mean that the problem needs to be entirely resolved, but the incident should be tracked and handled immediately.

The process to handle incidents is as follows:

Raise incident¹ in Gitlab in the iam/admin project, using the "Incident" issue template then post into Identity teams channel.
1. Incident does not need to be fleshed out, can add basic issue description and some possible initial findings.
2. All members of the team should avoid investigating or posting any logs/observations in the teams channel.
3. Add the incident to the current iteration, and label with workflow::In Progress.
One developer takes the incident, decided via discussion in teams channel, and assigns themselves. Ensure to announce in teams ownership of incident.
The assigned developer processes the initial incident issue.
1. Assess the incident status
  1. Not urgent - Business can proceed as usual OR workaround available OR incident has self-resolved - move directly to step 4.
  2. Resolve internally - Prevents more than 1 user performing non-critical work OR prevents further development work - continue with steps below.
  3. Urgent - Prevents more than 100 users performing non-critical work OR prevents more than 1 user performing critical work - organise brief meeting with team to discuss whether this needs to be raised to UIS Major Incident Process or if should be resolved internally.
2. Investigate and attempt to resolve the incident.
  1. Assigned developer will involve others as needed, the main responsible developer should reach out to subject experts to involve them whenever it is sensible.
  2. Document findings in incident comments as much as possible.
  3. While the incident is live, keep the incident ticket up to date with observations, actions taken and the reasoning behind actions, as well as the current status of the service(s).
  4. Raise MRs for code changes if needed, request that they be reviewed in the teams channel thread.
  5. For any click-ops changes (gitops style changes are preferred) and manual maintenance ensure at least two developers are involved and are directly communicating (via a Teams call or similar) to verify any decisions made.
3. Issue is resolved when (but not closed):
  1. Issue is no longer occurring.
  2. The cause of the incident is identified (or the limit of understanding reached).
  3. If rollbacks have occurred, trunk/main/master of effected repos are safe to re-deploy (i.e. the incident won’t be retriggered when pushing changes).
  4. The teams channel thread should then be notified of the status of the incident.
4. Remove the incident from the iteration, and label with workflow::Review Required to be processed in the incident review meeting.
Raise secondary issues as required. All of these should be linked to the initial incident issue and labelled with issuetype::Maintenance and incident-remedial. These should be tagged for refinement, and not be added directly to a sprint. The incident will be discussed in the next incident backlog.
1. If the issue has been hotfixed, but a more stable long-term solution would be preferable, raise an issue to do this.
2. If the issue has self-resolved or is not urgent, raise an issue to investigate and resolve. Tag this issue as a bug.
3. If the incident was determined to not actually be an incident, consider an issue to improve alerting (if source was alert).
4. If the issue was a result of poor documentation, or errors resulting from protocol not being followed, raise an issue to provide or improve the documentation.
5. If the issue was completely resolved during step 3, and none of the above items are relevant do not raise a follow-on issue.

Incident review¶

Incidents are closed during the fortnightly incident review meeting. As part of this meeting we assess the overall status of the incident, discuss any additional concerns or problems as a team, and finalise any closing actions (usually raising any further maintenance tickets as in item 4 above).

The board of incidents to be reviewed is available in our administration wiki.

Purpose of the protocol¶

The incident protocol exists and is written as above so that:

We split immediate critical work and less-critical follow-on work down, reducing the impact on normal sprint work while allowing us to deal with problems when they arise.
We reduce duplicate effort, i.e. we want to avoid the entire team stopping work to investigate an incident.
We respond to problems quickly.
We share skills and knowledge of systems between team members.
We have clear and (relatively) compact rules to follow to handle these problems.

The issue should be raised as an "incident" type rather than a standard issue. See the gitlab documentation ↩