Skip to content

How to respond to billing alerts

This guide shows you how to respond to our Google Cloud Platform (GCP) email billing alerts. These alerts are generally handled by the person on the Cloud Team monitoring rota.

Convert the email alert to a GitLab issue

At the time of writing our GCP billing alerts are sent via email to the email address. The first task is to manually raise a corresponding GitLab issue against the gcp-product-factory including the details from the email alert. This issue can then be worked on using our standard issue lifecycle workflow.

Example billing alert

An example GitLab issue being raised with details from the billing alert email.

Alert: 100% of budget expected to be reached

As detailed in the Budgets and Alerts section, this is the first of two default billing alerts that we configure. When this alert is received for a particular product the steps below should be followed.

Identify any recent increase in usage

The first stage is to gather information for the product in question relating to any increased usage over recent months using the following steps.

  1. Log into the Google Cloud Billing Console and select the University of Cambridge - Information Services (UIS) - New billing account.
  2. Select Reports from the Cost management menu.
  3. In the Filters section on the right hand side of the console set the following:

    • Time range: Last 90 days
    • Group by: Date > Service
    • Folder and organizations: Select the folder corresponding to the product in question. For example, if the Card System product had triggered the alert you would select the Card System (855378831331) folder.
  4. Using the data in the table and graph you should now be able to identify any services which have increased their usage by an unusual amount over the last few months.

    Example billing report
    An example billing report view showing a spike in Cloud Storage usage for the product in question.

Escalate a potential security breach

If you suspect a security breach, for example hundreds of compute instance suddenly being deployed or terabytes of data suddenly appearing in storage buckets, you must escalate the situation using the university's process for reporting security incidents. However, you should also contact the following people to discuss any immediate steps that can be taken to protect the university.

If you do not suspect a security breach you should discuss the increased usage with the product owner/team to determine which of the following courses of action are required.

Assess any potential infrastructure deployment improvements

Work with the product owner/team to determine if there are any areas of the cloud infrastructure deployment that can be improved to reduce the monthly costs, especially in non-production environments. Some examples of things to consider are as follows.

  • Are there any scaling improvements that can be implemented?
  • Are SQL/compute instances sized correctly?
  • Are Cloud Storage buckets using the most appropriate storage classes?
  • Do Cloud Storage buckets have lifecycle rules configured to remove stale data etc?
  • Are Google Kubernetes Engine clusters configured optimally? Ideally we should be running Autopilot.

If you identify any improvements that can be made you should raise a new GitLab issue against the relevant infrastructure repository to track the proposed work. The new issue should be linked to the original billing alert issue created at the beginning of this guide.

Amend a product's budget inline with legitimate increased usage

If you identify a legitimate increase in usage for a particular service, and you have exhausted all potential infrastructure optimisations, you should raise a merge request (MR) against the gcp-product-factory project to propose an increase to the product's budget variable in the relevant .tfvars file. The MR should:

  • Be linked to the GitLab issue created at the beginning of this guide.
  • Be reviewed by either Adam Deacon or Abraham Martin.
  • Increase the relevant product's budget variable to allow for approximately 20% headroom given the product's current usage.

Accept a temporary increase in usage with no action required

In some situations, the increased usage may be legitimate but may not warrant an approved increase to the budget amount, for example if a service is undergoing heavy development to release a new feature. If the product owner/team expects the usage to return to previous levels in the very near future it is acceptable to temporarily ignore the billing alerts. However, the GitLab issue raised at the beginning of this guide should be updated to explain the situation, and it should be moved to a future iteration with a due date (for example in two weeks time) to ensure that we review the situation to confirm that usage has returned to previous levels.

Alert: 100% of budget reached

This is the second of the two default billing alerts that we configure. This alert will ideally never be triggered as we should've responded to the "100% of budget expected to be reached" in an appropriate way to resolve the issue. Therefore, these alerts should be treated as a priority, following the steps detailed above to reach a resolution ASAP.

See also