DevOps standards for Change Management¶

Preface¶

The following document has been co-created and/or reviewed by UIS DevOps tech leads, UIS DevOps Service Managers, and UIS Product Managers.

The document requires some previous knowledge about what DevOps is, for this reason we have included the definition by the ISO standard ISO/IEC/IEEE 32675:2022(E) “Information technology — DevOps — Building reliable and secure systems including application build, package and deployment”. We * strongly encourage* reading the following citation before the rest of this document.

The term DevOps evolved from the availability of fully automated application build, package, and deployment tools, along with the recognition that information technology (IT) organizations were not prepared to use those tools effectively. [...] DevOps aims to satisfy a dynamic and competitive marketplace that favors products that balance the V requirements (volume, velocity, variety, veracity, value, and others). DevOps seeks to achieve a balance between velocity and system reliability and stability. DevOps was created to provide solutions to constantly changing complex problems, where reducing organizational risk and improving security and reliability are critical requirements.[...]

DevOps focuses on business and organizational goals ahead of procedural and technical considerations. DevOps utilizes information-rich feedback loops to understand progress and threats to attaining business and mission goals. Taking a business or mission first view helps to balance the concerns of risk and the activities which provide the most value to the customer. [...] DevOps takes a customer-centric view, prioritizing and designing work to deliver value to the customer, as well as identifying and managing risk. In short, if it makes sense for the customer and meets a customer need, then it is likely to be the right approach from a DevOps perspective. [...]

DevOps relies on keeping stakeholders informed and aware of changes that can impact them, by means of automation when practicable. [...]

Left-shift and continuous everything¶

The normal DevOps practice is information-driven, risk-based, continuous everything. DevOps continuous everything means using the same practices in development as in operations and sustainment. DevOps practices are founded on automation for continuous integration, delivery and deployment, and operations and sustainment. The approach to DevOps in this document is to build systems to be secure and verifiable from the very beginning. [...]

The continuous delivery, testing, and QA practices are shifted left (earlier in the workflow) to be planned and executed at the same time as design and development. For example, in DevOps, automated tests are built along with the product from inception. Most efforts begin with effective reviews of requirements, test strategies, and coding standards. Common methods include test-driven development (TDD), automatic code scanning (on build), automated regression testing, and functional and non-functional (e.g., performance) testing. Continuous QA and testing are essential in any DevOps-centric effort. Reducing rework and waste contributes to the achievement of improved velocities, a hallmark of well-implemented DevOps.

Similarly, information security cannot be tacked on to the end of a development effort. The DevOps view of security is sometimes referred to as DevSecOps, but in reality, there is no DevOps without a continuous focus on security. This includes building systems to be secure from the very beginning of the systems and applications life cycle and continuing throughout their life cycle, including code that is deemed ready to be safely deprecated.

Left-shift is particularly valuable for improving the reliability of methods for production software release and deployment.[...] In DevOps, “left-shifting” of deployment procedures means lower-risk deployment using the same methods for all environments in the continuous delivery pipeline.

To accomplish continuous delivery more securely and reliably, many firms have turned away from traditional monolithic, sequential, and mostly manual development and operational approaches to one that integrates an ever-growing number of market-proven external solution components (i.e., frameworks, libraries, application programming interfaces [APIs], and software as service solutions) with a more manageable set of targeted custom solution components. There has also been a significant shift to cloud- based hosting that can easily and efficiently scale up and down as needed to satisfy the dynamic load demands of users. To enable this, DevOps requires the use of tailored processes and specialized pipeline tools able to leverage automation wherever practicable, across the entire system’s life cycle.

Systems thinking¶

Systems thinking counters a myopic approach of utilizing specialists—such as networking professionals, database administrators, and systems administrators—who rarely communicate with either the development or operations teams and lack understanding of the system as a whole. In DevOps, taking a comprehensive view encourages technology professionals to fully understand the system from end to end. Systems thinking can enable resolution of complex and emergent problems that are not easily traceable to a single flaw. Systems thinking should apply to a consistent architecture for the enterprise tools used for DevOps as well as to the system under development.

[...]Consistent systems thinking among DevOps stakeholders can be challenging when there are differing or even incompatible philosophies, policies, procedures, activities, tasks, and tooling.[...]

Leadership¶

While DevOps involves cooperation at all levels of an organization, it is most successful when leadership is viewed internally and externally as fully supporting the letter and the spirit of the DevOps policies, principles, and practices. When those in leadership roles exert their authority and influence, the entire DevOps team is better able to interact with stakeholders; uphold the power of organizational procedures; sustain a sophisticated, complex, and effective pipeline; and engage capable human resources.

Addressing DevOps vulnerabilities requires commitment to investments to establish, sustain, and improve capabilities so that the ecosystem remains aligned with evolving policies, processes, mechanisms, practices, and tools. Without clear and lasting support from and continued active engagement of leadership, the benefits of investments in procedures and tools often fade and disappear. It takes time, the long-term leadership commitment of considerable resources, and a level of dedication to fully deploy and realize the continuous improvement of the DevOps principles, practices, and processes[...] To achieve velocity and other quality goals for DevOps, proper utilization of proven forms of automation depends on the involvement of humans receiving and acting on statistically valid information. These mechanisms and processes are particularly challenging to establish, utilize, sustain, and improve due to the differing baseline cultures of the stakeholders. Proactive and engaged leadership can set the example for stakeholders[...]

Leadership benefits from clear and timely procedures for escalation and dynamic information-driven risk management and issue resolution. Decision-making at all levels empowers the team and the organizational leaders with the capability to handle emerging opportunities and risks inherent in the DevOps ecosystem. This approach helps avoid the overload and burnout of leadership and other personnel by involving, empowering, and authorizing the right people with the right capabilities at the right time with the right information to do the right thing. [...]

DevOps values effective and collaborative communication and consequently thrives in organizations where transparency and collaboration are expected and enjoyed.[...]

When dealing with new tools supporting “continuous everything,” unforeseen issues can arise. DevOps values communication and feedback loops, allowing the organization to experiment with varying approaches. A feedback loop is a virtuous cycle where action produces information which is used in future action to improve results. Team members regularly communicate what is going well and what needs to be improved. DevOps helps implement systems and software life cycle processes with regular checkpoints and rapid course correction when warranted.[...] Effective adaptation of DevOps continuously invests in transforming work into automated services that improve velocity, support sporadic acceleration, and enhance other quality attributes, thereby allowing leadership and stakeholders to be more strategic. [...]

DevOps and life cycle processes¶

[...]DevOps is a full life cycle endeavor which gives equal consideration to each stage. DevOps is a set of principles and practices which enable better communication and collaboration between relevant stakeholders for the purpose of specifying, developing, continuously improving, and operating software and systems products and services. It is not just a matter of technical practices affecting other life cycle processes.

Teams using DevOps typically start a systems or applications effort by creating a continuous delivery pipeline (set of tools and procedures) that takes the code from the source code management system and automates the complete application build, package, deployment (including transitions to other environments), operations, and sustainment workflow. Contributors often start with a simple program, write the pipeline, and then iteratively (and rapidly) develop their code. In development, multiple teams often integrate code continuously, automatically deliver the code to a test automation framework, and on to subsequent workflow participants. [...]

DevOps is suitable for most life cycle process models, and particularly appropriate when teams adopt agile methodologies. DevOps can be just as valuable in an iterative waterfall approach.

Introduction¶

This document provides an implementation of the ISO standard ISO/IEC/IEEE 32675:2022(E) on change management and follows standard practice in DevOps software development.

Not all of our services follow this model. Sometimes this is because they have been inherited from elsewhere with large amounts of technical debt or because they are just emerging from the Discovery process. Some products may be too small or too specialised in nature to fit this model.

What is “change”?¶

At its base, a “change” is any modification to the state of a product or service. We can group change as a whole into three categories:

Change to the data held within a product. This change may be administrator-led or user-supplied. Ordinarily such change is seen as “Business as Usual” for our products. For example, a user electing to select their preferred title from a configured in an identity system is not seen as a “change” per se. There are exceptions. For example, the removal of a title from this configured list may be classified as a “change”.
Change to the internal operation of a product. This change may be far-reaching and profound but if it is not user-visible, it is generally seen as a “change” from the point of view of internal change-management processes but not from an external perspective.
Change to the documented interface of a product. This change may be minor but it is user-visible. Consideration should be given to the effect of the change on our users and whether this change requires advance communication.

Any given change may further be classified by its nature. For example, a change may be a modification to a published Application Programmers’ Interface (API), a bug fix, a mitigation for a security issue or simply a change to functionality. In this document we’ll try to keep a high-level view of change and describe processes which apply across a range of change natures although, inevitably, some processes will have a greater affinity for changes of a particular nature.

Guiding principles¶

We use the following principles when managing change:

Change is undramatic, frequent and regular.
Change of implementation need not be externally visible.
Communications with users should be proportionate, otherwise these will get ignored by the user.
Inputs and outputs are known and they are documented as contracts.
There is defence in depth.
Stakeholders are kept informed.
State is documented in code and is replicable.
Automate what can be automated and streamline the rest.

We’ll reference these principles in the sections below.

Change cadence¶

We assert that change should be undramatic, frequent and regular. Frequent change means that each change is likely to be well understood, limited in scope and consequently lower in risk. Having a regular cadence of change means that those delivering a service get into a rhythm which is appropriate to the particular service.

While not directly related to change management per se, we hope that change is driven by a desire to deliver value for users of the service. By having small, frequent and regular improvements to value, we want to encourage users to view change as an exciting thing to look forward to, not something to dread.

Communicating change cadence¶

As change is undramatic, frequent and regular, it follows that the number of changes are numerous. A typical week in November 2023 resulted in 94 changes being applied to our 10 most-updated products and services that week. In this diagram a “change” is an individual Merge Request (MR). A MR may consist of multiple sub-changes, known as commits. Multiple MRs may be combined together into a release and finally a release may be atomically and idempotently deployed to an environment.

The deployment is the moment at which changes take effect, which will happen immediately after a MR is merged on those following Continuous Integration and Continuous Deployment (CI/CD) practices. Each MR represents a distinct “change” which is proposed, implemented, reviewed and tested (in multiple aspects).

`A prototype dashboard showing code-changes proposed, rejected and merged in the course of one week`

in DevOps for the 10 most-updated products.

We are trialling this dashboard as a means for those interested to have a real-time view into which services have recently been changed. Each project will link to the repository where a more in-depth “Changelog” will link to the actual changes Merge Request / commit.

Delivery management¶

The DevOps division has a strong focus on technology but not all roles within service management are engineering roles. Although this document focuses on technical approaches to change management, we also recognise the importance of stakeholder and relationship management, delivery management and governance as part of a coordinated set of practices.

We use an Agile approach to development where features are proposed, refined, estimated, scheduled, implemented, reviewed, tested, merged and deployed. The management of our feature backlog is an essential part of our change management process. Our use of two-week sprints to schedule change is a keystone of our approach to continual improvement.

The scheduling, implementation and release of changes over the course of a sprint help us ensure that change is undramatic, frequent and regular. We deploy our services using a high degree of automation. Changes naturally are designed to be incremental and not revolutionary. With reference to the ITIL Deployment Management practice, our choice is for the “Continuous Delivery” model. Our delivery framework is incompatible with models such as “Big Bang” or “Phased Delivery”. The “Pull Deployment” model is usually not applicable to our services.

Change of implementation should not be externally visible. Although there is undoubtedly some concept of the “current version” of Office 365 within Microsoft, this is not visible to users. Word Online may be updated multiple times per day with bug fixes, additional features, A/B testing, etc. In the same way, we strive to design our services in such a way that the implementation may be changed with no external effect and that features may be added incrementally without breaking existing workflows.

When features are deployed, communications should be proportionate, to avoid users ignoring too many communications which are not relevant for them. Returning to the example of Word Online, the appearance of a new menu option one day does not require months of communications work. On the other hand, a major feature change, does necessitate Microsoft communicating with Office 365 administrators who then cascade information internally.

Deployment and release management¶

Our deployment and release management strategy focuses on automated packaging, immutability of naming, strong versioning and automation.

Server-side software is generally packaged as “container images”. These images contain the software packaged in such a way that they can be deployed “as is” to container-hosting infrastructure. For software we develop ourselves, we use immutable “version tags” for an image. As such if we deploy version “x.y.z” of our software, we know bit-for-bit the image which is deployed. It follows that a version of the software deployed to our staging environment will be bit-for-bit identical when subsequently deployed to production.

Our release management process has recently been completely automated. Now, as changes are merged in GitLab, a “next release” Merge Request is maintained automatically. Once approved and merged, this will automatically generate a Changelog, build and test a packaged version of the software and upload it to Google Cloud ready for deployment.

`A Changelog generated by our release automation tooling for a recent release of`

the Ballots of the Regent House application.

`Container images uploaded by our release automation to Google Cloud ready for deployment. Note the`

rapid release cadence.

The packaging of software as container images is not an unusual practice and software which we do not develop ourselves, such as GitLab, is similarly packaged and versioned. The use of container images for software packaging and distribution is now well established within our profession.

Using release version tags, we can deploy the exact same container image, first to a staging environment in order to run integration tests and then to production. Since the container images are identical and our staging and production environments are configured from the same underlying description, we have a high degree of confidence that code working in staging will work in production.

For deployment, we are a heavy user of infrastructure as code tooling such as terraform. Our deployments are described in code and that code lives in GitLab. Changes to deployments are therefore planned, reviewed, tested and merged like any other code changes. Merging changes will automatically deploy the change to the staging environment and provide an interface in GitLab to trigger deployment to production once any manual integration or approval tests are performed in the staging environment. Wherever possible, service configuration also lives in GitLab and is automatically applied.

`Code within GitLab describing the precise version of software deployed to each environment.`

`Code within GitLab specifying access control configuration which determines who may access the`

software and what roles they have.

`Production, staging and development environments deployed automatically via GitLab CI pipelines.`

As a technical measure, we are currently developing our approach to “rolling release” whereby a change is made available to a small set of users initially with that set growing over time to encompass all users. Such an approach naturally includes the ability to reduce the set of users exposed to a change, ultimately setting the proportion to be zero. This leads to the ability to have both gradual rollout and gradual rollback of changes.

Our release and deployment management pipeline is an example of “automate what can be automated and streamline the rest”. Making a new release of our software involves hitting “Approve” and “Merge” buttons in GitLab. Re-deploying a service, or deploying an update to production involves hitting a “Play” button on a pipeline and the most recent deployment can be seen in the environments page in GitLab. Manual triggering of deployments allow teams to deploy at a cadence which suits their product’s needs.

In conclusion, the current deployment process is heavily tested and documented and uses common reusable patterns and libraries, which reduces risks significantly, as all components are used several times a week by many applications.

Database schema migrations¶

Usually schema migrations happen automatically as part of releases. The DORA Core model recommends that schema changes be decoupled from application changes so that they may be rolled forward and backward over some number of releases without breaking compatibility. Although we have been using this model for schema changes, we are still working on formalising automated testing for schema migration and automation of snapshot before this is applied. All Google Cloud deployments have automated database snapshotting and backups.

Security and governance¶

By using GitLab as the primary source of truth for change management and deployment, we naturally keep an audit log. For each change, the change itself is recorded along with the originator of the change, the discussion which preceded it, the discussion of the change itself, who approved the change, when it was merged, when it was released, which deployments contain that release, who added a release to a deployment, who approved the addition of the release, which environments it is deployed to and who approved that deployment.

Our desire for automation is also motivated by the desire to keep all state related to a service recorded and hence have changes to that state audited. Aside from governance, this can also help with root cause analysis as we determine which of our risk management barriers described below failed and why.

As part of a separate document, we will be describing DevOps’ approach to security and vulnerability management (DevSecOps) but it is worth mentioning here that the focus on strong naming for release versions and infrastructure-as-code for deployment means that we can automatically determine which versions of our software in which environments have known vulnerabilities.

`Supply-chain vulnerability reporting and management tooling within GitLab.`

GitLab provides a number of security-related policies which can be added to change management processes such as requiring approval from a narrower set of people if a security vulnerability is introduced or periodic scanning of deployed releases to alert about new vulnerabilities.

Risk management and software development¶

The Swiss Cheese model of accident causation notes that any one safety barrier will inevitably be permeable given a specific set of circumstances. “Defence in depth” is the concept of stacking different barriers together so that any one accident requires multiple barriers be penetrated. We prefer our processes to be stacked atop one another to ensure safety of change and to manage risk. We want these processes to be automated where possible so as not to needlessly impede agility or velocity while also increasing repeatability and reliability. No process is perfect and this is why all our processes are constantly being refined.

A defining aspect of DevOps is the “shift left”. This is the practice of moving things that used to happen at the end of the change process, like testing or quality early in the development process, often before any code is written. This also applies to other aspects of the change process, e.g. risk analysis is done at the time of planning, not at the time of release. The impact of the change on the service, whether the change warrants wider broadcasting is also done at the time of planning which allows comms to be done while the feature is being developed, not after it has been finished.

As an example of defence in depth, consider the barriers a change to a typical back-end service must pass:

The change will have been discussed and refined prior to scheduling in a sprint. This is normally led by a Product Manager in conjunction with a Technical Lead after work from Business Analysts and/or User Experience Researchers and in consultation with stakeholders and/or the Service Owner. That discussion will be stored in GitLab and is available to the person implementing the change.
Code being changed or added must be tested via an automated testing suite. This suite ordinarily includes unit and regression testing.
Within the GitLab interface, changes to code are highlighted depending on whether they have been covered by the automated testing. This aids reviewers in determining if changes which implement new features are tested.
GitLab requires all tests to pass for changes to be merged.
The software is packaged into a “container image”. Additional automated tests (including security tests) are run against the container image so that the exact packaged version of code is tested.
The change is manually reviewed by another team member.
The change may not be merged in GitLab until approved by another team member.
We deploy changes into a staging environment which is a close analogue of the production environment. This provides a final opportunity for full integration testing in advance of deployment to production. For products where it is feasible to do so, we may run automated integration tests against the staging environment alongside manual testing. Since we use infrastructure-as-code, differences between production and staging are explicit and documented, as well as tested. The deployment to staging happens automatically with each new release.

Our focus on automated code review, testing and immutable naming for packaged code means that when a change is reviewed, we can have confidence that the basic safety checks have been completed. We can track that change from inception to deployment and have confidence in what changes are running in production.

When we do have an incident, we take time to learn the root cause and reevaluate our barriers in the Swiss Cheese model to prevent that. This re-evaluation should ordinarily lead to the amendment of one or more barriers and/or the addition of a new one. An example of this would be additional technical “fail safe” protections added to our GitLab backup process in the wake of an incident early on with that service.

Communicating changes¶

Although changes should be undramatic, frequent and regular, this is not always the case. For certain changes an announcement of the change is required. For the small number of services where UIS are our sole customers, we use media such as the CAB to broadcast change information. For the majority of our services, we will need to broadcast change wider.

The communication channel will depend on the type of product. For a lot of our products, UIS is not seen as the department responsible for it. For example, communicating changes for our admissions process will be led by the relevant business unit, in this case, the Admissions Office. Equally, user support will also be led and owned by them, as they will be the ones receiving all user queries or facing the impact of any changes to these products.

We often also include communication channels within our applications. For example, we use a GitLab functionality that displays messages to all users of it to advise users of maintenance.

Monitoring, event management¶

Our standard Google Cloud deployment configurations include monitoring and alerting. Out of the box, we get:

worldwide availability checking, and
Transport Level Security (TLS) certificate validity checking.

The majority of our products deployed to Google Cloud include an auto-scalability functionality that scales the application based on the load it is experiencing at every moment.

Alerts are received and acted upon by the team managing and developing the service.

Google Cloud allows additional alerting such as increased error rate, abnormal traffic spikes or impending exhaustion of storage, memory or processing resources. Alerts may be raised by email and/or Microsoft Teams message. The precise set of alerts is usually product-specific based on the needs of the management team.

Cloud Infrastructure updates¶

Google Cloud, as well as other major cloud providers, provide a lot of components of our applications architectures as Platform as a Service (PaaS): SQL databases, Object storage, Kubernetes clusters, Serverless containers, Load Balancers, etc.

A lot of these PaaS services are commonly used in all our cloud deployments. Upgrades to the Operating Systems (OS), system versions (e.g PostgreSQL version, or Kubernetes version), and other changes required to the underlying infrastructure of these PaaS services to be kept up to date is all managed by the cloud provider, in this case, Google Cloud. This means that we have very little control of when these updates/changes will happen. We can only specify maintenance windows (e.g. Sundays from 2 AM to 4 AM) that the cloud provider will use to decide when they want to apply those changes. We do not get notified of when these changes are going to happen.

With the majority of our services deployed to the cloud and being run using a lot of PaaS services, changes to our application’s underlying infrastructure are happening all the time even if we do not make any changes to the application directly.

Suppliers and customers¶

In this section we use “supplier” and “customer” in the SIPOC sense.

When we design a service which is interacting with some other service, we document which contract we are assuming. For example, when we deploy applications, we use a hosting platform which publishes a runtime contract. So long as we package our software in accordance with that contract, we can change the details of implementation at will. This is an output contract. Similarly, we publish a machine-readable specification for our inputs. An example of this would be the University Card API contract. So long as we continue to accept inputs according to that specification, we are free to change details of implementation.

This approach does not guarantee inadvertent change in behaviour because of bugs or errors but it does delineate the interfaces we do not intend to change or interfaces where we would advise consumers of upcoming change. If an unannounced change is observed, we can be alerted and, if necessary, roll it back. Similarly, changes to the documented interfaces with supplier systems are changes we may need to be made aware of but we do not necessarily need to be aware of changes to internal implementation.

Some intra-UIS systems are currently ad hoc and/or undocumented. We prefer to reduce the number of such interfaces over time and move to a model where our systems are based on documented interfaces and runtime contracts, whether the suppliers and/or customers of these are UIS or external.

Dealing with technical debt¶

Although this document describes our best practices, some of our products do not align with this model yet because of technical debt. Technical debt is generated sometimes because applications have dependencies from other applications through ad hoc or undocumented interfaces. Some other times because the use of frameworks or technologies that the organisation never standardised into. Some other times because services have been manually deployed to infrastructure and no documentation exists.

The use of automated testing at multiple points within the development lifecycle helps us have confidence that refactoring changes intended to pay off technical debt do not adversely affect functionality.

By keeping our newer services and products cohesive, orthogonal, loosely coupled and relying on documented interfaces and contracts we hope to reduce technical debt wherever possible. This will not always be possible.

Conclusion¶

This document outlined some of DevOps’ approach to change. We covered the guiding principles of design which we use and some of the processes we follow aligned against ITIL4 management practices. Our guidebook provides more in-depth technical information.

Our main challenges at the moment are dealing with technical debt within inherited services, being able to have a view of dependencies between these inherited services and other services, and the lack of documentation of the interface contracts between them. This currently stops us from applying DevOps change management best practices described in this document.

Bibliography¶

Those interested in more details of our workflows, processes and use of technology may find the following pages from our guidebook useful.