How to respond to billing alerts¶
This guide shows you how to respond to our Google Cloud Platform (GCP) email billing alerts. These alerts are generally handled by the person on the Cloud Team monitoring rota.
Convert the email alert to a GitLab issue¶
At the time of writing our GCP billing alerts are sent via email to the cloud@uis.cam.ac.uk email address. The first task is to manually raise a corresponding GitLab issue against the gcp-product-factory including the details from the email alert. This issue can then be worked on using our standard issue lifecycle workflow.
Alert: 100% of budget expected to be reached¶
As detailed in the Budgets and Alerts section, this is the first of two default billing alerts that we configure. When this alert is received for a particular product the steps below should be followed.
Identify any recent increase in usage¶
The first stage is to gather information for the product in question relating to any increased usage over recent months using the following steps.
- Log into the Google Cloud Billing
Console
and select the
University of Cambridge - Information Services (UIS) - New
billing account. - Select
Reports
from theCost management
menu. -
In the
Filters
section on the right hand side of the console set the following:- Time range: Last 90 days
- Group by: Date > Service
- Folder and organizations: Select the folder corresponding to the product
in question. For example, if the Card System product had triggered the
alert you would select the
Card System (855378831331)
folder.
-
Using the data in the table and graph you should now be able to identify any services which have increased their usage by an unusual amount over the last few months.
Escalate a potential security breach¶
If you suspect a security breach, for example hundreds of compute instance suddenly being deployed or terabytes of data suddenly appearing in storage buckets, you must escalate the situation using the university's process for reporting security incidents. However, you should also contact the following people to discuss any immediate steps that can be taken to protect the university.
If you do not suspect a security breach you should discuss the increased usage with the product owner/team to determine which of the following courses of action are required.
Assess any potential infrastructure deployment improvements¶
Work with the product owner/team to determine if there are any areas of the cloud infrastructure deployment that can be improved to reduce the monthly costs, especially in non-production environments. Some examples of things to consider are as follows.
- Are there any scaling improvements that can be implemented?
- Are SQL/compute instances sized correctly?
- Are Cloud Storage buckets using the most appropriate storage classes?
- Do Cloud Storage buckets have lifecycle rules configured to remove stale data etc?
- Are Google Kubernetes Engine clusters configured optimally? Ideally we should be running Autopilot.
If you identify any improvements that can be made you should raise a new GitLab issue against the relevant infrastructure repository to track the proposed work. The new issue should be linked to the original billing alert issue created at the beginning of this guide.
Amend a product's budget inline with legitimate increased usage¶
If you identify a legitimate increase in usage for a particular service, and you
have exhausted all potential infrastructure optimisations, you should raise a
merge request (MR) against the
gcp-product-factory
project to propose an increase to the product's budget
variable in the
relevant .tfvars
file. The MR should:
- Be linked to the GitLab issue created at the beginning of this guide.
- Be reviewed by either Adam Deacon or Abraham Martin.
- Increase the relevant product's
budget
variable to allow for approximately 20% headroom given the product's current usage.
Accept a temporary increase in usage with no action required¶
In some situations, the increased usage may be legitimate but may not warrant an approved increase to the budget amount, for example if a service is undergoing heavy development to release a new feature. If the product owner/team expects the usage to return to previous levels in the very near future it is acceptable to temporarily ignore the billing alerts. However, the GitLab issue raised at the beginning of this guide should be updated to explain the situation, and it should be moved to a future iteration with a due date (for example in two weeks time) to ensure that we review the situation to confirm that usage has returned to previous levels.
Alert: 100% of budget reached¶
This is the second of the two default billing alerts that we configure. This alert will ideally never be triggered as we should've responded to the "100% of budget expected to be reached" in an appropriate way to resolve the issue. Therefore, these alerts should be treated as a priority, following the steps detailed above to reach a resolution ASAP.