Hibri Marzook Musings on technology and systems thinking

Terraform Code Smells - The Middleman

Infrastructure as Code has code smells (its code too !!) and as we work more with it, it’s important to be aware of code smells that indicate a deeper problem.

The Middleman doesn’t do anything other than delegate to a native resource

Let’s look at an example;

module storage_account {

	source = "./module/resource-group"

	name = var.storage_account_name

	resource_group_name = var.resource_group_name

	location = var.location

	account_tier = var.account_tier

	account_replication_type = var.account_replication_type

	tags = {

		environment = "staging"

	}

}

Here we see a module that creates a storage account. Let’s look inside the module itself.

  
resource "azurerm_storage_account" "storage_account" {

	name = var.storage_account_name

	resource_group_name = var.resource_group_name

	location = var.location

	account_tier = var.account_tier

	account_replication_type = var.account_replication_type

	tags = var.tags

}

The module has a single resource. The variables are delegated to the resource. There are no collaborating resources inside the module.

Why is this a code smell?

The module itself does not add any value over using the native resource directly, and is a thin wrapper around the single resource. A Terraform module is a group of resources that work together cohesively.

This code smell appears because of the need to control what types of resources are used in the organisation’s context. We can use Azure Policy and AWS Guardrails to achieve the same result

Another reason is the need to deploy resources consistently across the organisation. For example, to apply tags consistently. Again we can use policies to enforce consistent rules.

Does it satisfy the five CUPID properties?

  • Unix philosophy: Does it do one thing well? it does, but all it does is delegate. The work is done by the resource that is wrapped in the module
  • Predictable: Does it do what you expect? It’s unclear why a module is used than using the resource directly
  • Idiomatic: Does it feel natural? It introduces an extra level of in-direction compared to using the resource directly
  • Domain-based: does the solution model the problem domain? The module doesn’t model the domain, it duplicates what Terraform does

Treatment

Avoid wrapping single resources with modules. Terraform modules are composed of resources that work together to provide a cohesive building block. Get rid of the middleman.

Terraform Code Smells - The Confused Module

Infrastructure as Code has code smells (its code too !!) and as we work more with it, it’s important to be aware of code smells that indicate a deeper problem.

A Confused Terraform module has parameters, that change the behaviour of the module. Different parameter value combinations allow the behaviour to change. The module is confused about its identity. Let’s look at an example;

module "secure_vpc" {

	source            = "./aws_vpc"
	name              = "mysupersecurevpc"
	range             = "192.168.0.0/24"
	region            = "us-east-1"
	allow_egress      = true
	enable_monitoring = true
}

This is an example of a module that creates a VPC. The module declaration has parameters that change its behaviour. There is the option for the VPC to allow egress outside the network, and the option to turn monitoring on or off. This is a simple example, but modules tend to have long lists of parameters, with switches that change the module’s behaviour. A consumer of the module has to specify the right combination of switches for the behaviour they want.

switches.png

Why is this a code smell?

The default behaviour of the module isn’t clear, without reading the documentation (or the code for the module). It’s easy to accidentally allow egress and deploy the VPC to an insecure state. If allowing egress implies that monitoring is turned on (to make sure that traffic that goes out of the network is monitored), the optional ‘enable_monitoring’ flag needs to be set. The onus is on the consumer of the module to ensure that the VPC is deployed in a compliant state.

The switches in the parameter list proliferate into the code, and each time we add a new capability we have to guard against breaking existing capabilities.

Does it satisfy the five CUPID properties?

  • Unix philosophy: Does it do one thing well? The module has multiple behaviour depending on the combination of switches
  • Predictable: Does it do what you expect? It’s easy to accidentally deploy the module in a non-compliant state
  • Domain-based: does the solution model the problem domain? It’s not clear what this module offers over a the out of the box aws vpc resource

Treatment

Create modules that encapsulate the behaviour rather than make it optional. For example;

module "secure_vpc_with_egress" {
	source            = "./aws_vpc_with_egress"
	name              = "mysupersecurevpc_with_egress"
	range             = "192.168.0.0/24"
	region            = "us-east-1"
}


module "locked_down_vpc" {
	source            = "./aws_locked_down_vpc"
	name              = "mysupersecurevpc"
	range             = "192.168.0.0/24"
	region            = "us-east-1"
}

In the example, we’ve created a module that is explicit about its behaviour. A consumer of the module is clear about what it does and doesn’t have a choice of turning monitoring on or off. There isn’t the option to deploy the module to an insecure state.

Each module now does one thing well, and it’s behaviour is predictable. The name of the module describes the behaviour in the context that it’s being used.

Both modules can now evolve without having to worry about breaking the functionality of the other.

How to create a secure and postive Developer Experience on Azure

An anti-pattern I’ve seen in security-conscious organisations is, access to the Public Cloud provider’s console is restricted in development environments. In a recent project I worked on, developers did not have access to the Azure portal to view, debug and test changes in a non-production environment. The restricted access impeded the developer experience (DevEx), negating the Public Cloud’s productivity benefits.

Let’s look at how we can safely improve the developer workflow and mitigate the risks associated with giving developers more autonomy.

The problem - restricted developer experience

Access to the portal was given only to the Ops team. The team did not have access to the Azure subscription through the Azure portal to verify changes or debug problems. The team had to engage the Ops team to look at the problem, usually via raising a ticket and then explaining the problem and waiting for a response.

This mode of working is a DevOps anti-pattern. A developer should have the freedom to resolve their issues without having to depend on another team.

The handoff creates a significant delay in the development workflow. It increases the development cycle from minutes to hours and days. The long debugging cycle and handoffs incur costs on both the development team and the Ops team.

The developer is blocked, and the wait-time increases development costs. An Ops team member has to be interrupted to un-block a developer. Resources are diverted from making platform improvements, creating a negative loop. I call this the Wall of Confusion.

Restricted Developer Experience - Wall of Confusion

What should an ideal developer experience look like?

The developer writes infrastructure as code (IaC) to create and configure Azure. The developer checks in code and tests to source control. The code commit triggers a pipeline, which runs the IaC to make the intended changes first in the development (Dev) environment.

The pipeline then executes the tests automatically to test for regression and that the new functionality works. When the tests pass (the build is green), the changes are propagated to the following environment.

Unrestricted Developer Experience

In the Dev environment, a developer will need to eyeball the changes made to Azure resources to verify that the resources were created and configured as expected. A successful run of the pipeline may not always indicate a successful deployment.

If a test fails, the developer will need to use the Azure portal to check the state of resources related to the failure. The developer should also have access to manually create new resources to test the latest changes before applying the fix to code and committing the change to source control, which triggers the pipeline.

The developer goes through the development workflow described above many times (minimum of 50 to 100 times) a day and runs through the whole workflow in minutes. The quality of the system is improved when the developer has a fast feedback loop.

The impact of a poor developer experience

Cochran (2021) shows a simple representation of how developers use feedback loops and a comparison of the time taken for developer activities in the restricted (low-effective) environment and un-restricted (high-effective) environment.

Feedback Loops during feature development (Cochran, 2021)

The important observations are;

  • Developers will run the feedback loops more often if they are shorter
  • Developers will run tests and builds more often and take action on the result if they are seen as valuable to the developer and not purely bureaucratic overhead
  • Getting validation earlier and more often reduces the rework later on
  • Feedback loops that are simple to interpret results, reduce back and forth communications and cognitive overhead

How to improve the developer experience safely?

So how do we improve the developer experience whilst keeping production data and sensitive configuration secure?

We need to start with some. Restrictions. Developers will only have permissions to view, manage and debug resources via the Azure Portal in lower environments such as Dev and Test.

Developers will only have read-only access in higher environments, such as Pre-Prod and Prod, for the following;

  • Monitor application health and alerts
  • View resource costs and budget limits
  • Error messages and logs

Developers will be not be given permissions to view the following in higher environments;

  • Sensitive configuration settings
  • Data in databases
  • Key vault secrets
  • Logs with sensitive data

We leverage Role-Based Access Control (RBAC) to manage access. Here’s an example using Azure’s built-in roles.

Levels of access across environments

The built-in Azure roles, which will be assigned to the user group that the developer is in. It’s expected that the RBAC assignments will be at the management group level and not at the subscription level.

To make this work, subscriptions are associated with product team-specific Azure management groups to ensure proper configuration and control of subscriptions.

The only path to a Production environment will be via pipelines in Azure DevOps, using code from source control. Changes made via the portal will not be propagated to higher environments and can be overwritten on the next pipeline run.

Mitigating the risks associated with the approach

There are common arguments against the approach described above and the risks associated with it. However, there are effective mitigations.

  1. A developer can accidentally destroy resources belonging to other teams. We mitigate this risk by scoping Contributor access only to the product team-specific Azure development subscription, and mistakes are not propagated to other teams. The team should be able to re-create resources from source control.
  2. Shared platform components can be deleted. Again, this risk is mitigated by having dedicated subscriptions for shared platform components. Only the teams responsible for those components have access.
  3. A developer can accidentally expose data to the Internet. We mitigate this risk by having guardrails using Azure Policy, which are applied at the management group level. The policy prohibits the creation of resources that will allow data egress to the Internet. We allow ingress and egress to only via shared services, which have the appropriate governance built into them. There should also be no live data in a development environment.
  4. Developers will be able to modify production infrastructure. Azure Portal access in higher environments is read-only and restricted to observability. Temporary break-glass access to Prod is catered to through Azure Privileged Identity Management.
  5. Costs could increase if developers can create resources. Subscriptions are assigned to product teams, who will own the cost of running the subscription. When developers have access to the billing dashboard, they can self manage expenses. The cost of lost developer productivity is often greater than the cost of infrastructure.

What do we get when we unblock the developer experience?

Adopting the approach has positive benefits, even though managing RBAC at scale requires extensive automation.

  • The developer experience is positive and creates a high effective environment. A highly effective environment is where there is a feeling of momentum; everything is easy and efficient, and developers encounter little friction
  • Product teams can launch new services quickly
  • The developer can observe the application in a production environment, creating a feedback loop that reinforces a positive culture
  • The developer can take advantage of existing Azure knowledge and familiarity with the Azure Portal to diagnose problems independently

References

Cochran, T. (2021) Maximizing Developer Effectiveness. Available at: https://martinfowler.com/articles/developer-effectiveness.html (Accessed: January 11, 2021)

Is DevOps adoption a 'wicked problem'?

Horst Rittel and Melvin Webber coined the term’ wicked problems’ in the article ‘Dilemmas in a General Theory of Planning’ (Rittel and Webber, 1973).

Wicked problems are described as;

‘ill-defined, ambitious and associated with strong moral, political and professional issues. Since they are strongly stakeholder dependent, there is often little consensus about what the problem is, let alone how to resolve it. Furthermore, wicked problems won’t keep still: they are sets of complex, interacting issues evolving in a dynamic social context. Often, new forms of wicked problems emerge as a result of trying to understand and solve one of them.’

Rittel and Webber describe the following features of wicked problems in comparison to tame problems.

  1. There is no definitive formulation of a wicked problem
  2. Wicked problems have no stopping rule
  3. Solutions to wicked problems are not true or false, but good or bad
  4. There is no immediate and no ultimate test of a solution to a wicked problem
  5. Every solution to a wicked problem is a “one-shot operation”; because there is no opportunity to learn by trial and error, every attempt counts significantly
  6. Wicked problems do not have an enumerable (or an exhaustively describable) set of potential solutions, nor is there a well-described set of permissible operations that may be incorporated into the plan
  7. Every wicked problem is essentially unique
  8. Every wicked problem can be considered to be a symptom of another problem
  9. The existence of a discrepancy representing a wicked problem can be explained in numerous ways. The choice of explanation determines the nature of the problem’s resolution
  10. The planner has no right to be wrong

When I look at the common problems associated with DevOps adoption in large organisations;

1. There is no definitive formulation of a wicked problem

There isn’t a clear definition of the problem the organisation is trying to solve. DevOps adoption isn’t always tied to a goal or purpose of the organisation. However, sometimes and eventually, the organisation can define the problem they are trying to solve.

2. Wicked problems have no stopping rule

DevOps adoption is an on-going change for large organisations. There rarely is an end-state. There is a failure to recognise that this is on-going continuous change (‘We have transformed’).

3. Solutions to wicked problems are not true or false, but good or bad

Depending on the organisational context, approaches to DevOps adoption differ. What worked for Spotify might not work for a government organisation. However, it’s good enough and works in that context.

4. There is no immediate and no ultimate test of a solution to a wicked problem

Finding a solution to one aspect of adopting DevOps can have repercussions elsewhere. For example, using the Public Cloud creates challenges in the organisation’s approach to governance. Employees find that they need to learn new skills and change their behaviour.

5. Every solution to a wicked problem is a “one-shot operation”; because there is no opportunity to learn by trial and error, every attempt counts significantly

The approach to DevOps adoption is unique to each organisation. I see organisations struggle when they blindly adopt an approach that worked somewhere else. Practitioners also fall into this trap of using a templated approach. The problem has to be solved within a limited timeframe to get a return on investment in DevOps. There is resistance to experimentation and making mistakes.

6. Wicked problems do not have an enumerable (or an exhaustively describable) set of potential solutions, nor is there a well-described set of permissible operations that may be incorporated into the plan

There is no single definitive playbook for DevOps adoption. However, there are specific solutions in the DevOps toolbox to each of the unique problems. The sequence of steps needed to apply the particular solutions in each organisation is unique.

7. Every wicked problem is essentially unique

Each organisation’s approach to adopting DevOps is unique. It’s not repeatable across organisations (it’s not even repeatable within departments inside the same org). The adoption approach has to be reformulated each time. It can’t be planned up-front.

8. Every wicked problem can be considered to be a symptom of another problem

The barriers to DevOps adoption are elsewhere. Recruitment, org structure, culture and lack of diversity, to name a few. The root causes lie outside of the immediate situation of adopting DevOps.

9. The existence of a discrepancy representing a wicked problem can be explained in numerous ways. The choice of explanation determines the nature of the problem’s resolution.

It’s a multi-faceted problem, and people will have various explanations for the problem. Some may not even see it as a problem, and it’s hard to get a single view of the problem.

10. The planner has no right to be wrong

DevOps adoption defies structured planning, yet organisations demand plans. The planner is held to account when ‘DevOps’ still hasn’t been adopted after 6 months. There is no linear path to DevOps adoption.

A paper by the Australian Public Service Commission looking at the public policy challenges to tackling wicked problems (Australian Public Service Commission, 2018) expands on Rittel and Webber’s list of features by adding.

Wicked problems involve changing behaviour. The solutions to many wicked problems involve changing the behaviour and/or gaining the commitment of individual citizens. The range of traditional levers used to influence citizen behaviour—legislation, fines, taxes, other sanctions—is often part of the solution, but these may not be sufficient. More innovative, personalised approaches are likely to be necessary to motivate individuals to actively cooperate in achieving sustained behavioural change.

DevOps adoption, at its core, requires individuals to change their behaviour. For example, developers need to learn how to write tests first and make time for learning. Project managers need to learn how to deal with uncertainty. Senior leadership needs to find out what incentives can be used to motivate individuals to cooperate in achieving sustained behavioural change. Organisations that have previously relied on hierarchy to function have to find new ways to work in a collaborative manner.

Why do I ask?

The reason I ask; Is DevOps adoption a ‘wicked problem’ in large organisations?

If the answer is yes, we must contextualise our approach to DevOps adoption in large organisations. We should also avoid misclassifying the problem as simple or ‘tame’.

Correctly identifying the problem is half the battle, allowing us to use creative and holistic approaches to engage with the situation. It helps to set realistic expectations of change and not go into it blindly with a 14 sprint plan.

References

Australian Public Service Commission (2018) Tackling wicked problems : A public policy perspective, . Australian Government. Available at: https://www.apsc.gov.au/tackling-wicked-problems-public-policy-perspective (Accessed: 21 December 2020).

Rittel, H. W. J. and Webber, M. M. (1973) “Dilemmas in a general theory of planning,” Policy Sciences, 4(2), pp. 155–169. doi: 10.1007/bf01405730.

Embracing Uncertainty - Notes from Lean Agile Scotland 2019

The first keynote “In defence of Uncertainty” by Abeba Birhane set the tone for the conference. She made the case for why searching for certainty is futile. Ilya Prigogine and Isabelle Stengers in their book Order out of Chaos, say

Traditional science in the Age of the Machine tended to emphasize stability, order, uniformity, and equilibrium. Whereas most of reality, instead of being orderly, stable, and equilibrial, is seething and bubbling with change, disorder, and process.

By using technology we reduce reality, into simpler and manageable pieces. However, this reductive process blinds us to the complex nature of reality. This can and does cause great harm. The systems we build have biases. People are being disadvantaged by technology and they don’t know it. Our blindfolds of certainty cause more harm than good.

Abeba makes us aware of the harm caused by biased and ill thought out solutions. Examples are Amazon’s AI recruiting tool, a blockchain based app for sexual consent and biased ML software to predict future criminals.

We alter reality with technology - Abeba Brihane

Abeba suggests a different approach.

Its not about the ‘right answers’ but asking meaningful questions

Push more towards understanding and less towards prediction

In place of final, and fixed answers, we aspire for continual negotiation and revision

In place of certainty, we emphasise in-determinability, partial-openness and unfinalizability

Partial-openness leaves room continual dialogue, negotiation, reiteration and revision

The first keynote set the stage for the second keynote “People are the adaptable element of Complex Systems” by John Allspaw. I read Web Operations and the Art Capacity Planning a while back, and it’s interesting to see how John helps the software development community understand Resilience Engineering. At the start of the keynote, he asks;

“If you remove the people keeping systems running, how many systems would keep running?”

Allspaw presents a model which can be used as a lens to look at how we interact with systems

Above the line representation

We can use incidents to find out what happens above the line. We need to look deeper into incidents to figure out what happens and how people adapt. Incidents are too complex to be filed under post-mortems. We should learn from incidents. Use incidents as a rich seam of information for research instead of recording incidents to be filed and forgotten.

Magdalena Kiaca and Monica Madrid Costa ran a Visual Faciliation Workshop run by. I’ve been trying to create sketch notes with varying degrees of failure. The workshop helped with a methodical approach.

Sketchnote Guides

Attended my first event storming workshop after hearing good things about it from Nick, but this was my first opportunity to see how it’s done by Kenny Baas-Schwegler

Kenny pointed me to a cheat sheet he published.

Finally, Adarsh Shah and I gave a talk on “Using the Toyota Improvement Kata to solve hard problems”

Lean Agile Scotland is going on the annual conference list. Reminds me of attending XP Days London in the early days of my Agile learning.