Our Approach to Monitoring and Alerting¶

Overview¶

Our approach to monitoring and alerting has evolved over time as both our usage of GCP and our internal tooling have matured. The goal is to provide consistent, cost-effective observability across all workspace projects by applying a reliable baseline of platform-level monitoring, while still allowing each system to define additional, targeted alerting where required.

This page outlines how we design and implement monitoring and alerting for GCP resources at the cloud platform level. It explains the principles that guide our approach, the recommended patterns for creating effective alert policies, and how our Terraform modules support these patterns. It does not cover application-specific checks such as HTTP uptime checks, those are documented in the App Deployment reference section.

Our key aims are to:

ensure all UIS DevOps products benefit from a standardised set of monitoring aligned with the Systems Management Policy requirements,
encourage standardisation across all monitoring and alerting whether that is at the platform level or at individual system level,
apply alerting in a way that scales across all resources of a given type, without requiring per-system threshold overrides, ensuring consistent coverage and reducing maintenance overhead, and
keep operational costs manageable by using aggregation patterns that follow Cloud Monitoring best practices.

By applying the baseline provided by our opinionated ucam-minimal-gcp-monitoring module and extending it where necessary using the lower-level gcp-monitoring module, teams can maintain a consistent foundation of observability while tailoring alerts to their specific system needs.

Meta project vs workspace project¶

Historically, we created most alerting policies at the meta project level, primarily due to early limitations in Cloud Monitoring’s ability to aggregate and target resources across project boundaries. These technical constraints no longer apply, and we now recommend that alert policies are defined within each workspace project.

This shift aims to remove a long-standing point of confusion for both new and existing developers. It has never been immediately obvious why alert policies lived in a different project from the resources they were monitoring. By standardising on workspace-level monitoring and alerting going forward, we hope to eliminate this ambiguity and reduce friction for teams working with the platform.

To support this transition, the gcp-product-factory has been updated so that notification channels defined in a product’s .tfvars file are created in all relevant projects — the meta project and all workspace projects. These notification channels are also exposed via the config_v1 object, which is consumed by gcp-deploy-boilerplate projects.

This means that, in many cases, moving an existing alert policy from the meta project to a workspace project should be relatively straightforward. Updating the project variable and switching to the workspace-specific notification_channels is often all that is required, for example:

resource "google_monitoring_alert_policy" "main" {
  project               = local.project
  notification_channels = local.workspace_config.notification_channels
  # ...

However, each migration should still be assessed on a case-by-case basis as metrics, queries, and other configuration details may need adjusting.

Can I keep using the meta project for my alerts?¶

Yes. For now, we will continue to support alert policies defined in the meta project as well as in workspace projects. There is no immediate need to move existing alert policies.

However, all new monitoring and alerting work for the Unified DevOps Platform will focus on deploying resources in the workspace projects. In time, we may deprecate the ability to use the meta project for alerting, but if we reach that point the Cloud Team will communicate any proposed deprecations well in advance via the UIS DevOps General channel in Microsoft Teams.

Alerting¶

Determining when to create an alert (and when to notify someone about it) is challenging. Alert fatigue is real, and excessive notifications can quickly become noise. As a general rule, if an alert does not require immediate human action, it should never trigger an email or a Teams message. Alerts that need attention within a few days belong in a service dashboard or report instead. Finally, if an alert is consistently ignored, it’s a strong indicator that the alert is unnecessary and should be removed.

When designing alerts, we try to follow Google’s SRE principles where possible, which are also recommended in the DORA Monitoring and Observability capability. For more information on this topic, Google has published a number of books. In particular, the Monitoring distributed systems chapter of the Site Reliability Engineering book and the Alerting on SLOs chapter of the Site Reliability Workbook should be of interest.

Authoring alert policies¶

The ucam-minimal-gcp-monitoring documentation provides detailed guidance on how alert policies should be authored for that module. We recommend following the same guidance when creating any additional system-level alert policies, as this helps ensure consistency and predictability across all UIS DevOps products.

In addition to the guidance in ucam-minimal-gcp-monitoring, developers should familiarise themselves with Google's Alerting overview and Manage alerting costs pages. Both offer valuable insight into how alerting behaves in Google Cloud Monitoring and provide best practice recommendations for defining effective, cost-efficient alert policies.

Key principles¶

While the detailed guidance lives in the ucam-minimal-gcp-monitoring module, the following high-level principles should be applied when authoring any alert policy:

Target all resources of a given type within the project. For example, a policy monitoring Cloud SQL instance CPU usage should apply to every Cloud SQL instance in the workspace.
Aggregate conditions at the individual resource level. This approach ensures that each resource is monitored without needing to be listed explicitly, and that alerts fire for the specific resource affected.
Avoid using chargeable custom metrics when suitable built-in Google Cloud metrics already provide the necessary information. Standard metrics cover most common use cases and help keep monitoring costs predictable.
Be mindful of upcoming Cloud Monitoring pricing changes. From May 2026, each condition within an alert policy will incur a charge. As a result, alert policies should ideally only be deployed long-term to production workspaces, with temporary or ad-hoc deployment to development and staging environments while policies are being developed or tested.

Terraform modules¶

We currently use two main Terraform modules to implement monitoring and alerting.

ucam-minimal-gcp-monitoring¶

The ucam-minimal-gcp-monitoring opinionated module is included by default in the gcp-deploy-boilerplate. Its purpose is to provide a centralised, consistent baseline of logging and monitoring across all projects. It aims to implement as much of the generic monitoring required to meet the Systems Management Policy requirements as possible, reducing the need for teams to reinvent common alerting configurations.

gcp-monitoring¶

The gcp-monitoring foundation module provides a lower-level set of building blocks for defining Cloud Monitoring resources. It is also the module that the opinionated ucam-minimal-gcp-monitoring module is built on. Teams can use the foundation module directly when they need to define additional system-specific alerting policies beyond the ucam-minimal-gcp-monitoring baseline. It supports more flexible composition of alert policies and conditions without imposing defaults or conventions.

Using both modules together allows teams to rely on the baseline coverage provided by ucam-minimal-gcp-monitoring, while adding further monitoring and alerting logic through the foundation module where appropriate.