Incident and support management¶

This page describes team Wilson's approach to handling both Incidents and Support calls.

The rota (accessible to Team Wilson only) specifies who is on call for support & incident management each day. If a team member is on call for these, they should be:

Monitoring the GCP and other alerts coming in via email, following the steps for incident management.
Monitoring the Wilson HALO queue and handling first line support triage, following the steps for support management.
Monitoring the "API Gateway - Notifications" teams channel, and actioning those access requests.
Monitoring other open communication channels with separate UIS teams.

They are responsible for ensuring that every incident & support call is being handled by a member of the team.

They are not responsible for handling every incident & support call personally.

If another team member is on call, other team members should not feel they have to wait for the on call team member to raise incidents. If you spot a potentially serious alert start the incident management process immediately - it is always better to raise the incident than ignore a potentially serious problem.

Incident management¶

Incident: "an event that causes disruption to or a reduction in the quality of a service which requires an immediate response."

This may include (but is not limited to):

Developer notices unexpected behaviour/degraded service.
Alert email from GCP, Gitlab, or any other alerting service received.
External user raises an in issue via email, teams, or the service desk.
A deployment pipeline fails.

The incident protocol needs to be followed as soon as possible after alerts are received. It does not necessarily mean that the problem needs to be entirely resolved, but the incident should be tracked and handled immediately.

The process to handle incidents is as follows:

Raise incident¹ in Gitlab in the iam/admin project, using the "Incident" issue template then post into the "Wilson - Devs" teams channel.
1. Incident does not need to be fleshed out, can add basic issue description and some possible initial findings.
2. All members of the team should avoid investigating or posting any logs/observations in the teams channel.
3. Add the incident to the current iteration, and label with workflow::In Progress.
One developer takes the incident, decided via discussion in teams channel, and assigns themselves. Ensure to announce in teams ownership of incident.
The assigned developer processes the initial incident issue.
1. Assess the incident status
  1. Not urgent - Business can proceed as usual OR workaround available OR incident has self-resolved - move directly to step 4.
  2. Resolve internally - Prevents more than 1 user performing non-critical work OR prevents further development work - continue with steps below.
  3. Urgent - Prevents more than 100 users performing non-critical work OR prevents more than 1 user performing critical work - organise brief meeting with team to discuss whether this needs to be raised to UIS Major Incident Process or if should be resolved internally.
2. Investigate and attempt to resolve the incident.
  1. Assigned developer will involve others as needed, the main responsible developer should reach out to subject experts to involve them whenever it is sensible.
  2. Document findings in incident comments as much as possible.
  3. While the incident is live, keep the incident ticket up to date with observations, actions taken and the reasoning behind actions, as well as the current status of the service(s).
  4. Raise MRs for code changes if needed, request that they be reviewed in the teams channel thread.
  5. For any click-ops changes (gitops style changes are preferred) and manual maintenance ensure at least two developers are involved and are directly communicating (via a Teams call or similar) to verify any decisions made.
3. Issue is resolved when (but not closed):
  1. Issue is no longer occurring.
  2. The cause of the incident is identified (or the limit of understanding reached).
  3. If rollbacks have occurred, trunk/main/master of effected repos are safe to re-deploy (i.e. the incident won’t be retriggered when pushing changes).
  4. The teams channel thread should then be notified of the status of the incident.
4. Remove the incident from the iteration, and label with workflow::Review Required to be processed in the incident review meeting.
Raise secondary issues as required. All of these should be linked to the initial incident issue and labelled with issuetype::Maintenance and incident-remedial. These should be tagged for refinement, and not be added directly to a sprint. The incident will be discussed in the next incident backlog.
1. If the issue has been hotfixed, but a more stable long-term solution would be preferable, raise an issue to do this.
2. If the issue has self-resolved or is not urgent, raise an issue to investigate and resolve. Tag this issue as a bug.
3. If the incident was determined to not actually be an incident, consider an issue to improve alerting (if source was alert).
4. If the issue was a result of poor documentation, or errors resulting from protocol not being followed, raise an issue to provide or improve the documentation.
5. If the issue was completely resolved during step 3, and none of the above items are relevant do not raise a follow-on issue.

Incident review¶

Incidents are closed during the fortnightly incident review meeting. As part of this meeting we assess the overall status of the incident, discuss any additional concerns or problems as a team, and finalise any closing actions (usually raising any further maintenance tickets as in item 4 above).

The board of incidents to be reviewed is available in our administration wiki.

Purpose of the protocol¶

The incident protocol exists and is written as above so that:

We split immediate critical work and less-critical follow-on work down, reducing the impact on normal sprint work while allowing us to deal with problems when they arise.
We reduce duplicate effort, i.e. we want to avoid the entire team stopping work to investigate an incident.
We respond to problems quickly.
We share skills and knowledge of systems between team members.
We have clear and (relatively) compact rules to follow to handle these problems.

Support management¶

Support calls for our services are directed to us by the service desk team via the HALO platform.

If you are in doubt about any of the steps below, ask the rest of the team in the "Wilson - Devs" team channel, or message your team lead directly. The nature of support calls is that the steps to resolve them are often very different. People can ask anything, in various degrees of clarity, and so a rigourous set of steps to follow exactly is impossible. Use your own judgement and ask the team for input whenever things are unclear.

Important

In communication with end-users and members of the service desk, always remain respectful and considerate.

The process to handle a support ticket in HALO is as follows:

Read through the email exchange in HALO and try to understand what the user is asking.
1. If the support call is unclear or lacking detail, assign the support call to yourself and ask some clarifying questions before continuing.
Assess whether the support call is for one of our services².
1. If the ticket appears to be for something not-related to DevOps, assign the ticket back to the service desk user who passed it down with a note explaining where you think this might need to go, or why it's not related to our services.
2. If the ticket appears to be for a different DevOps team, assign the ticket to that team, and tag the original service desk member in a note explaining why this belongs with that DevOps team instead.
3. If the ticket appears to be for one of our services, continue to the next step.
Assess what needs to happen for the support call to be resolved, and assign someone to the ticket.
1. Most support calls are requests for information on our services.
  1. If you are comfortable answering the question, assign the support ticket to yourself, and communicate with the user via HALO, answering their question(s).
  2. If you do not know the answer to the question, ask in the "Wilson - Devs" channel. Either yourself (if the question is answered by another dev) or another member of the team with the relevant knowledge should then be assigned the support ticket and communicate with the user via HALO.
2. If the support call is a request for a feature or enhancement to a service, assign the support call to your team lead and let them know via a teams message or email.
3. Some support calls fall into some of our "standard procedures", where we have a known business process that is handled by us, see the list of standard procedures below.
4. If the support call is reporting a bug or issue with the service, assign the ticket to yourself and then begin the incident management process.
Once the ticket has been resolved, it can be closed.
1. It is not always 100% clear when a support ticket is "done", look out for the following things:
  1. The user replies confirming the problem is resolved.
  2. The user has not replied for >10 days.
  3. The question asked was clear, and has been succintly answered.
2. It is easy for a user to re-open closed support tickets, so it is not necessary to seek permission from a user to close a ticket.

Standard support procedures¶

There are a couple of standard requests and business processes that have to be handled by our team.

Access to google drive files that were shared, but the owner of the files has left.
Request to add or remove a user from the password app top-level admins list.

For access to google drive files, take the following steps:

Assign the ticket in HALO to yourself.
Raise an issue in the IAM admin project tagged with the issuetype::Support and team::Identity labels, assign it to yourself and add it to the current iteration. Include a link to the HALO ticket, and the CRSId(s) of users removed³.
Follow the process outlined in the operational documentation.
Inform the user via HALO that permissions have been restored, and they must make a copy of the files as they will be removed again in 2 weeks time.
After two weeks is up, remove the permissions, and close both the HALO ticket and issue in Gitlab.

For a request to add or remove a user from the password app top-level admins, take the following steps:

Assign the ticket in HALO to yourself.
Confirm that the requesting user is authorised to make this request, they must be a member of the User Administration team. If you receive a request to add a user to the top-level admins list from a CRSId you do not recognise, reach out to your team lead and/or the current User Admin head (Michelle Hollins, mar82).
Raise an issue in the IAM admin project tagged with the issuetype::Support and team::Identity labels, assign it to yourself and add it to the current iteration. Include a link to the HALO ticket, and the CRSId(s) of users being added or removed³.
Make changes to the ansible deployment to add or remove the user's CRSId from the top level admin lists. (There are many example MRs to draw from for this activity, use these as a template.)
Deploy the changes to the password app.
Reply to the HALO ticket confirming the user has been added, then close both the HALO ticket and issue in Gitlab.

API Gateway access requests¶

The process to deal with API gateway access requests is as follows:

Raise an issue in the IAM admin project tagged with the issuetype::Support and team::Identity labels, assign it to yourself and add it to the current iteration. Include details of the requesting user, and the app id requesting access.
Check that the created app conforms with the guidance in the getting started guide (i.e. belongs to a sensibly named team, not a personal account).
1. If the app does not belong to a team, email the requesting user advising them of the requirements for apps to be approved.
Check that if the access request includes access to personal information that the requesting user is authorised to do this (they should be an IT or computer officer member of staff).
1. If the user is not authorised, email them asking why they need access to these APIs. Pass any legitimate seeming requests over to your team lead to handle.
Add the app to the relevant permissions lists if required, and deploy this update.
1. This varies locally by API, for the Card API apps must be added to the list of CARD_READER applications in the infrastructure, but this will vary by project. Refer to README documentation.
Approve the app for the API access in the GCP console.
1. Note there is no need to email the user informing them this has happened, they will automatically be informed by the API Gateway.
Close the gitlab issue.

The issue should be raised as an "incident" type rather than a standard issue. See the gitlab documentation ↩
The service desk team will occasionally mis-triage tickets onto our team, this is normal and to be expected. It is very difficult for the service desk to always interpret support calls accurately, and mistakes are unavoidable. ↩
It is often convenient to re-use the same issue for multiple requests if many arrive while you are on support. ↩↩