DevSecOps standards for Continuous Deployment¶

Introduction¶

This document outlines a strategy for extending the principle of DevSecOps to the infrastructure and application deployment processes in the UIS DevOps division. It follows on from the DevSecOps for Ongoing Development paper and focuses specifically on the deployment and operations aspects of DevSecOps.

Applicability¶

This standard should be implemented for all services that the university develops, whether internally, or through third parties. It is particularly relevant to services that handle sensitive data or support important processes, where a security breach could have significant negative impacts.

Exceptions can be submitted following the exemption process within the Systems Management Policy informing and seeking approval of the Tech Leads Forum first.

Standards¶

Infrastructure as code¶

Infrastructure as code (IaC) is key to enabling many of the practices described in this document. By defining our deployments using IaC, we are able to create secure and compliant infrastructure configurations which can be stored in version control.

Templates¶

UIS DevOps uses Terraform as our standard IaC tool. We have developed internal templates around the standard Terraform workflow such as our Terraform CI/CD template and our Google Cloud Platform (GCP) Deployment project template. These templates are configured in accordance with the standards defined in this document and should be the basis for all new and existing deployments. Where there is a need to extend these templates, care should be taken to ensure that standards continue to be adhered to.

Reusable Terraform modules¶

In addition to our IaC templates, we also maintain a number of shared Terraform modules. These modules define our approved methods for deploying specific groups of resources. The modules are the building blocks of our GCP Deployment template and should be used when extending the template’s base configuration. The modules are primarily maintained by the Cloud Team, however, contributions are welcome from all teams. If a module is missing functionality, the ideal scenario is to add the functionality to the module instead of deviating from the standard path.

Configuration as code¶

Configuration as code is the practice of defining application/service configuration as code. Much like with IaC, this code is centrally managed using our existing version control practices, helping us to ensure that our configurations are peer-reviewed and compliant with our security policies. We currently use this practice to manage our GitLab configuration, via the gitlab-project-factory repository, and our API Gateway configuration, via the api/gateway-ops repository. We aim to implement configuration as code for as many of our services as possible providing that the relevant APIs/Terraform providers exist.

Container security¶

Many of our services involve the deployment of one or more container images. As such, ensuring the security of these images is critical to the overall security of our applications and services.

Base images¶

Every new container is built upon a pre-built, stable image called the base image. It is crucial to ensure that the base image used for creating any custom images is up to date and patched with the latest security updates. To help with this, we have a Docker Images repository which hosts many of our standard base images. This repository contains scheduled CI/CD jobs which automate the routine rebuilding of these images, pulling in the latest security updates and publishing new versions for projects to use. These base images should be used wherever possible. If a specific image doesn’t yet exist it should be added to the repository.

Multi-stage builds¶

A multi-stage build uses multiple base images in a single Dockerfile. Initial stages are used to perform the various different build tasks as required, with the final stage containing only the artefacts which are absolutely necessary for the production image. Using multi-stage builds not only reduces the resulting size of our images, but, more importantly, it reduces the potential attack surface of our deployed services.

Run as a non-privileged user¶

By default, Docker containers run as the root user, which can pose security risks if the container becomes compromised. To protect against this, our containers should configure a non-root user using the USER instruction in the Dockerfile unless there is absolutely no other alternative.

Regular image builds and deployments¶

In addition to the automated rebuilding of our base images, services should ensure that their customised images are regularly rebuilt and redeployed. This ensures that software libraries and dependencies are also kept up to date with latest security fixes. This goes hand-in-hand with the automated dependency update practice detailed in the DevSecOps for Ongoing Development paper.

As a minimum, services should aim to deploy updated images at least once a month. For services under active development this should be a relatively easy target to meet, however, for other services extra care should be taken to ensure that images are regularly updated. For newly discovered vulnerabilities services should use the Vulnerability and Patch Management standard schedule

IaC security¶

As discussed in the Continuous Integration & Continuous Delivery standards paper, deployment automation is crucial to continuous delivery workflows. In terms of DevSecOps, we can build on the CI/CD deployment pipeline to include industry standard security checks, ultimately ensuring that each deployed environment is as secure as possible.

trivy¶

Our standard terraform-pipeline CI template includes a trivy job by default. This is a static security analysis job which detects many different IaC misconfigurations and security risks. It comes pre-configured with hundreds of checks for multiple IaC tools and cloud providers, and must be included in our IaC deployment pipelines.

KICS¶

KICS is another static analysis tool for finding security vulnerabilities and misconfigurations in IaC projects. It is included in the GitLab AutoDevOps templates and can be added to any GitLab pipeline using the include keyword.

Although there is some crossover between the trivy and KICS tools, between them they support a variety of different IaC products with many different rulesets. In general, both of these jobs must be enabled in our IaC repositories to give us maximum coverage.

Identity and access management¶

Identity and Access Management (IAM) is a crucial element of DevSecOps and the principles of least trust and zero trust should be at the forefront of our minds when designing and implementing services. For example, our Cloud Platform configuration for GCP lays the foundation for these principles by creating environment specific deployment service accounts with restricted permissions, and ensuring only the required team members have the ability to impersonate these accounts. The GCP deployment boilerplate then builds on this by ensuring that cloud services are configured to run in the context of dedicated, task-specific service accounts, again, with just the right permissions to perform the required tasks.

Least privilege¶

The principle of least privilege is a security concept that limits an IAM principal’s access to the minimum level required to perform its job. It allows a principal to access only the resources needed, thereby reducing the potential damage that can be caused by malicious attacks or accidental errors. The principle of least privilege mustz` be followed as closely as possible for all of our services wherever possible.

Zero trust¶

Zero trust is a security model that assumes no trust between different entities and establishes strict access controls. It requires verification for every access request to ensure that only authorised users are granted access to resources. This is generally how most Google Cloud services communicate with each other. For example, Cloud SQL databases can use IAM database authentication which allows the use of short-lived access tokens to authenticate database sessions, rather than using built-in username and password authentication.

The principle of zero trust is currently being scaled up across the DevOps division and, as mentioned previously, it should certainly be a key consideration when designing and implementing our services.

Google service accounts¶

A Google service account represents a non-human user/identity. They are usually associated with a specific cloud resource and are granted permissions to allow the resource to access required services. Service accounts themselves should be managed as a resource, meaning that they should be created and destroyed in code, and with the same lifecycle as their associated resource.

Create task-specific service accounts¶

Service accounts should be created for a single, task-specific purpose. This allows us to grant a narrow set of permissions to each service account, ensuring that we adhere to the principle of least privilege. This usually means that a deployment will consist of multiple service accounts, each tied to a specific resource.

Service account keys and impersonation¶

Service accounts can have one or more key pairs associated with them. These keys are downloadable as JSON files and can pose a significant security risk if they are not managed appropriately. With this in mind, the creation of service account keys must not happen unless absolutely necessary.

Instead, users (and other service accounts in some situations) should be granted permission to impersonate the required service account(s). This allows the user to generate a temporary token to be able to act on behalf of a service account to perform a particular task. This is the default method for authenticating as service accounts in our current deployments.

Regularly review role recommendations¶

Google provides a built-in role recommendation feature to help identify which permissions a service account is actually using, and which permissions might be unused. Teams must regularly review these recommendations and adjust the roles granted to their service accounts as required. These recommendations are displayed in the Security Insights column of the default IAM table in the Google Cloud Console.

`IAM table in the Google Cloud Console showing the excess permissions recommendations.`

Secrets management¶

One of the most difficult areas of DevOps deployments is secrets management. With so many interconnecting systems, most of our services require multiple pieces of sensitive information and keeping these secrets secure is paramount when thinking about DevSecOps.

1Password and Google Secret Manager¶

The DevOps division primarily uses two secret managers, 1Password and Google Secret Manager (GSM). 1Password is the canonical source of truth for any of our manually provided secrets. For secrets that are dynamically generated (for example via IaC) GSM should be considered the source of truth, with the full lifecycle of the secret resource being managed via automation.

Our standard Terraform deployments cannot access 1Password securely at runtime. Therefore, for manually provided secret values we should use our in-house tool, Sanctuary, to synchronise the required 1Password secrets with environment-specific GSM secret objects prior to deployment.

Providing secret values to Google Cloud resources¶

As mentioned above, generally we synchronise a service's required secrets from 1Password to GSM secret objects before we initiate a deployment. This allows us to configure task-specific service accounts with the required secret accessor role to be able to read secret values at runtime. Many of Google Cloud’s resources support this workflow, for example Cloud Run and Cloud Functions.

Environment variables¶

A common configuration pattern for services such as Cloud Run and Cloud Functions is to store secret values as environment variables. This can be made somewhat secure by the fact that the APIs for these services allow specifying a GSM secret object by ID, the value of which is then loaded into the environment at runtime, avoiding any hardcoding of secret information in IaC for example. However, as explained by Seth Vargo (Google engineer) in his blog post Secrets in Serverless:

While this approach is simple and straightforward, it comes with considerable security drawbacks - the secrets exist in plaintext in the environment. Any other process, library, or dependency running inside the process has access to the environment which has already been exploited multiple times. Unfortunately, it is trivial for a malicious library author to inject this type of vulnerability into an otherwise helpful utility package.

To be absolutely, unequivocally clear, you should not store secret or sensitive information in environment variables in plaintext .

With this in mind you should either (a) Access secrets directly on runtime, or (b) Mount secrets as Volumes

Access secrets directly on runtime¶

Fetch the GSM secret object on runtime when they are needed. This approach may only be possible for webapps that can use a supported Google Cloud library. Secrets can be fetched dynamically, ensuring that your application always uses the most up-to-date credentials without needing to restart or redeploy. Direct access allows us to better leverage Google Cloud's audit logging to monitor and track access to secrets.

Mounting secrets as Volumes¶

Sometimes, although rarely, we won’t be able to access secrets directly from runtime because of a requirement from a library or because the application we need to use doesn’t allow us to do this or doesn’t have support for the Google Cloud library. Mount GSM secret objects as volume mounts only accessible by the application user and configure the application code to read the mounted file(s) from disk to ingest the secret values. Mounted secrets are static and require container restarts to update, which can be less flexible. Managing secrets as volumes requires careful handling to ensure they are not exposed.

Accessing secrets in GitLab pipelines¶

The DevOps division’s GKE GitLab runner platform configure’s GitLab pipeline jobs with a Google IAM identity. This gives us the ability to grant a job’s identity access to specific GSM secret objects allowing us to access the secret value at runtime. This removes the need to store secrets in GitLab’s CI/CD variables which is a potential security risk and something which we want to avoid. The guidebook has some useful guides on how to run CI/CD jobs on a GKE-hosted runner and how to access a GSM secret value using service account impersonation in a CI/CD job.

Secrets in Terraform¶

Terraform configurations often need to configure resources with sensitive information, for example providing a secret value to a Cloud Run service. However, by default, Terraform state files are not encrypted. This means that actions such as generating secrets using the random provider, or reading secrets from a GSM secret object using a data source cause the secret value to be stored in plain text in the state file.

In our standard deployments we protect against this by storing state files in Google Cloud Storage buckets which are encrypted at rest by default. We also apply restricted IAM permissions to the buckets to ensure that only a small number of users/service accounts are able to access the state files. However, even with these protections in place, our recommendation is to avoid generating secrets in Terraform configurations, or reading secrets into Terraform configurations using data sources wherever possible.

Continuous monitoring¶

Continuous monitoring is the process of automating the collection and analysis of data and producing relevant reports and alerts. This helps to provide real-time insights into system performance and to identify and resolve potential security threats and vulnerabilities. For example, an increase in 4xx - as opposed to 5xx - responses from an API, while not indicating an issue with the API itself may be indicative of an automated attack.

The Google Cloud Platform automatically collects and stores performance data for most Google Cloud services by default. It is the responsibility of our product teams to configure the relevant reports and alerts depending on the requirements of each service. Our GCP deployment boilerplate includes our shared Site Monitoring Terraform module which configures default SSL expiry and uptime checks for Cloud Run services. This is expanded to include all required elements from the Security Logging Technical Standard.

Backups and data retention¶

Ransomware attacks, data leakages, and data loss are unfortunately becoming commonplace in the modern IT landscape. As such, we must ensure that our services and data are protected and have appropriate backups and data retention policies in place. It is impossible to detail in this document the exact strategy for each individual service, as this is dependent on what data each one of them treats, uses, how it stores this, etc. As a general rule, we need to have offline backups / ransomware resilient back ups, a plan to recover from disasters, and appropriate data retention following University's master records retention schedule.

We will discuss some processes which are common across our services.

GitLab backups¶

All our code and build artifacts are stored in the University GitLab service following DevSecOps standards for Ongoing Development. As such, we must ensure that this service is properly protected to ensure that the business impact of any potential outage is kept to a minimum.

The Cloud team configures and manages a nightly automated backup of the entire GitLab service which is based on the documentation provided by GitLab. These automated backups include the main GitLab database, all repository data, CI build logs and artefacts, pages site data, and all container images and published package data. Backup data is retained for a minimum of 60 days and the restore process is tested at least once every 12 months.

Database backups¶

Many of our services rely on some form of database. For example, our GCP deployment boilerplate configures a standard PostgreSQL instance using the Google Cloud SQL service. Databases should be configured to perform automated backups at least once a day with backup data being retained for a minimum of 7 days.

Google Cloud SQL¶

With regards to Google Cloud SQL database instances, the Cloud team manages an automated process to export all of our production Cloud SQL databases to Cloud Storage buckets in a central Google Cloud project. This process runs nightly and ensures that we have an “off-site” copy of our production databases in addition to the Cloud SQL automated backups/snapshots. This provides us with the ability to recover from potential major issues such as a Cloud SQL instance being completely deleted or a Google project becoming compromised or deleted.

Bucket and object storage backup and retention¶

Google Cloud Storage buckets are a common place for our services to store business critical data. As such, we must ensure that buckets are configured with the appropriate backups and retention policies.

Object versioning¶

Cloud Storage object versioning can be enabled on a bucket in order to retain older versions of objects. This allows you to retain previous versions of an object and to restore objects from accidental deletion. This is simple to configure and must be used in all situations where a bucket contains important data as a first line of defence. However, it should be noted that this does not protect against bucket deletion.

Bucket and object retention locks¶

The Cloud Storage service provides a number of different options to configure retention period locks for both buckets and bucket objects. These locks ensure that buckets and/or bucket objects have a “retain-until” date and time configured which cannot be modified once set. During this time the targeted data cannot be deleted or replaced/modified.

Bucket and object retention locks can provide protection against threats such as ransomware attacks and must be configured for any business critical data. Retention periods will vary between services depending which data they handle, these must follow University's master records retention schedule.

Bucket data exports¶

In some situations, object versioning and bucket locks still do not provide enough protection for our business critical data. If this is the case, you must configure an automated export/sync of the bucket data to “air-gapped” buckets in a secondary Google project.

Networking¶

Load Balancing¶

Many of our deployments make use of one or more Google Cloud Load Balancers. The following sections detail some of the changes to the defaults that we must be applied when configuring these load balancers

SSL Policies¶

By default, Cloud Load Balancers are configured with an “COMPATIBLE” SSL policy which supports legacy versions of TLS and RSA signatures to provide wide compatibility with many different clients. However, you must use the more restrictive MODERN SSL unless a requirement to support old clients is justified. This configuration is the default in our templates and can be seen in our Cloud Run App shared Terraform module.

Cloud Armour¶

Google Cloud Armour helps to protect our deployments from multiple types of threats, including distributed denial-of-service (DDoS) attacks. By default, all Google Cloud projects that include an external Cloud Load Balancer are automatically enrolled into Cloud Armour Standard protection. This includes always-on DDoS protection and access to the Cloud Armour web application firewall ( WAF) rules capabilities, including preconfigured WAF rules for OWASP Top 10 protection.

This default protection is very useful and is often enough for our standard deployments. However, we should continually review our services as they are developed to determine if additional WAF rules are required to improve our security position.

DNSSEC¶

The Domain Name System Security Extensions (DNSSEC) is a DNS feature that protects against DNS record spoofing by authenticating responses to domain name lookups. DNSSEC must be enabled on all services. All DNS zones that are created and managed by the gcp-product-factory have DNSSEC configured by default.

DevSecOps standards for Ongoing Development

DevOps Continuous Integration and Continuous Delivery (CI/CD) standards