Hibri Marzook Musings on technology and systems thinking

Terraform for grownups - A model for scaling Terraform workflow in a large organisation

It’s very easy to run terraform from a laptop and go wild managing your infrastructure. But what do you do when you start working on more complex infrastructure. How do you avoid running into each other ?

This is not about scaling infrastructure, but about scaling working practices and architecture so that you can support a large landscape of teams and applications.

More importantly, how do you scale the workflow to support working with more of your team and other teams. This is a model used in a recent project for a large organisation to support ~200 developers in multiple teams.

Nicki Watt coined the term “Terraservices” at HashiDays London 2017. This view of Terraform modules is key to scaling.


A Terraservice is an opinionated Terraform module that provides an architectural building block. A Terraservice encapsulates how your organisation wants to build a particular component.

For example, this is a Terraservice to build a Web Application Service on Azure.

module "frontend" {
  source   = "git::github.com/contino/module-webapp?ref=master"
  product  = "${var.product}-frontend"
  location = "${var.location}"
  env      = "${var.env}"
  app_settings = {
    REDIS_HOST                   = "${module.redis-cache.host_name}"
    RECIPE_BACKEND_URL           = "${module.recipe-backend.url}"

This Terraservice encapsulates how the Web Application should be configured. Scaling parameters, logging, alerting and sane defaults are set inside the Terraservice. The Terraservice exposes an interface that doesn’t require the consumer to know too much about the details of the underlying infrastructure and is opinionated towards productivity.

In this instance, the Terraservice is consumed by an application development team, to build out their product stack.

Terraservices have their own pipelines

A Terraservice has it’s own CI/CD pipeline. The pipeline runs unit, integration and compliance tests. It can even autogenerate documentation.

The integration and compliance tests run in a sandbox environment, and spin up their own test harnesses to exercise changes the Terraservice.

At the end of the pipeline, when tests are green, the Terraservice is versioned and tagged.

This ensures that a consumer picks a specific published version or stays on master to be on the bleeding edge. Keeping master stable and use Trunk Based Development practices.

Terraservices and Terraform Projects

Terraservices are used to compose infrastructure in Terraform projects. The Terraform projects each have their own pipelines. Terraform projects use remote state to share outputs, which are consumed by other Terraform projects. This is how it all flows together.

It’s the Terraform project pipelines that build actual infrastructure. They are decoupled from the application pipelines intentionally. The pipelines create empty environments that code can be deployed to. The master branch of a Terraform project reflects the current state of infrastructure. Each environment is built the same way. Environment specific variables can be tuned to change the behaviour, but once a change is committed to master, the pipelines apply the change to all environments as long as each stage in the pipeline stays green.

There can be more automated quality gates between pipeline stages but it’s important to ensure that no infrastructure used, is built manually. Building infrastructure via delivery pipelines early on, helps drive out issues that can hit your workflow later when more people work on it. An automated process is easier to improve, than a manual one.

Pick the right architecture

This is a delicate balance to achieve and there is no right answer. Pay attention to the automation pain points. Some infrastructure components can take a long time to provision. Split these out into their own Terraform projects so that they can be pre-provisioned, to be consumed by other downstream Terraform projects.

Keep core network and components that deal with core security away from Terraform projects that deal with application stacks. The application infrastructure should have dependencies on core components and use what has already been created but not be able to change it.

The infrastructure should be architected in a way that an application infrastructure can be destroyed without affecting other applications or core components.

Pick the right team structure

A crucial part of this is how the teams are organised and use Terraservices. A platform engineering team is needed to shepherd Terraservices and continuously drive the right ways of working. This team has to look across the organisational landscape and define how the Terraservices are built, ensure they help teams build their applications quickly and provide enough so that teams don’t have the need to build their own infrastructure differently.

The Terraform projects which drive the application stacks are under the control of the application teams. This means that they can iterate rapidly, and use components from an approved library of Terraservices to compose their infrastructure.

The Terraservices themselves, are maintained by a platform engineering team, but can accept pull requests from anyone in the organisation, using an internal OSS model. The role of this platform engineering team is to provide enough tooling for teams to self-serve.

Will it work for us?

This model provides a way for large organisations to allow individual product teams to build their own infrastructure, whilst providing control over how they are built.

What this doesn’t cover is, what if a team decides to use Terraform primitives instead of the available Terraservices?

This can be solved via peer reviews and the use of a Janitor monkey, which ensures that only components built the right way are allowed to exist.

In addition, Terraservices provide a happy path for teams to get up and running quickly, and are less likely to build their own.

Further reading

  1. Evolving Your Infrastructure with Terraform
  2. How to use Terraform as a team
  3. Terraform Recommended Practices

Walking the Skeleton

The Walking Skeleton pattern, encourages early delivery, of a minimal system to an environment where a user could potentially use it.

Alistair Cockburn describes the walking skeleton, in the context of software as follows;

A Walking Skeleton is a tiny implementation of the system that performs a small end-to-end function. It need not use the final architecture, but it should link together the main architectural components. The architecture and the functionality can then evolve in parallel

This post is not about the walking skeleton. It’s about using the delivery of the walking skeleton to surface the dysfunctions an organisation has, in delivering software.

Many teams struggle with delivery of their first feature. Even when the feature is built, it languishes for days, even weeks before someone uses it and gives feedback.

Why should we incur such a huge initial startup cost ?

I’d like to take the analogy of “walking” the skeleton, all the way from a developer’s work station to the user. Here are some roadblocks that stand in the way of walking the skeleton to the user.

Some common roadblocks are;

The shared test environment: This is the only environment available for testing. New and old services are dumped here, because at some point in the past it was a copy of the production environment, and everything has to be tested here.

This dysfunction is a result of not having an automated way to spin up an isolated infrastructure for testing.

The DNS Czar: A single person is responsible for creating DNS entries. Forms have to be filled in triplicate, and a few meetings had before a service can be hosted. This leads to hard coded IP addresses scattered around configuration files.

Again this is a dysfunction of not having an automation first approach.

Fear of the new: A new way of automation or a change to an existing process finds resistance. If the metaphorical skeleton dons a Jet pack, the request is denied. This causes much antagonism between devs and ops.

This leads to a suboptimal deployment processes, as development teams think of workarounds to get things deployed.

This dysfunction, is a result of a lack of communication between teams, and shared ownership of common code. Teams tend to focus only on their own deliverables.

Lack of a shared deployment pattern: There isn’t a shared deployment pattern across the organisation. There will be isolated examples of great and good enough patterns, but these aren’t shared. Each time something new is deployed the wheel has to be reinvented and a shared delivery framework does not emerge.

In the projects I’ve observed, the first week or more of a new project (a new idea) is spent dealing with these dysfunctions. That is one week without customer feedback loop. It is also one week of wasted productivity.

Being sensitive to and removing the roadblocks that slow down the delivery of a walking skeleton, will help reduce the cost of launching a new idea.

The ideal goal is to reduce the time and cost of launching a new idea as close to zero as possible.

Continuous Delivery - Going from rags to riches

My last year and a half has been spent guiding a team, working on a legacy codebase, to a level of maturity where they can deliver reliably.

What we took were a series of small steps, working within the constraints we had, and slowly working our way through to a higher degree of continuous delivery maturity.

My aim is to show the first few small steps we took, and to show that applying CD practices can be done in an iterative manner. With each iteration building on what was done before.


The team was in a chaotic stage. They were responsible for delivering an API for the main mobile app, building on the Microsoft .Net stack. They had just migrated their code away from TFS, to Git, and were starting to use JIRA for rudimentary tracking.

There was no CI, or automation. One of the things that I observed during the early days was the chaos around deployments. Every other deployment was rolled back. The testing team would take a couple of days to test a release. The definition of done was when a release was thrown over the wall to the testing team.

Introduce a basic level of monitoring

Introducing monitoring early on helps the team quantify the response to outages. Not every outage needs the whole team to drop what they are doing and work on the outage. The response can be quantified. The team becomes proactive in dealing with outages. This gives the team a bit more breathing room.

We introduced NewRelic, which gave us out of the box performance and error monitoring and alerting. We hooked it up to PagerDuty to get alerted as soon as issues occur.

Before the introduction of NewRelic we had no visibility of how the API was performing in production, and didn’t even know if was working properly. The only visibility we had of production issues was when customer support calls increased above the usual level.

This gave us a tiny start on CD. The team knew what was going on in production, and were eager to take ownership, without waiting for someone to assign a task to them. We were able to quantify what we were dealing with. We added more monitoring tools later on.

Visibility and limiting work in progress.

Get the team to work on only one thing at a time. A team in a chaotic state needs to get in to the rhythm of finishing work in progress and delivering it. It’s important to stress on a “Ship it” mindset early on and increase the WIP limits when the team can deliver reliably.

When a team is in a chaotic stage, they are already juggling multiple priorities, and don’t have the time to focus on doing one thing well. Limiting work in process helps the team to focus on what they are doing. I still recommend keeping WIP limits very low even when the team is doing well.

Visualise the work the team is doing. Put up the classic card wall, use electronic tracking tools such as JIRA purely for tracking.

A physical Kanban board in the team area empowers the team to take ownership of what they do. They can show what they are working on, and the act of simply working generates tangible artefacts.

Electronic task boards are a hinderance at this stage of a team’s maturity. Usually the electronic task boards are owned by someone else outside the team. The team doesn’t have the sense of ownership of their own process.

Visibility and limiting work in progress allows the team to be clear to everyone else involved on what they are dealing with.

Continuous Integration

Introduce CI as soon as possible. It’s not necessary to build the whole pipeline and an automated pipeline can co-exist with manual steps. The aim should be to convert existing manual steps into automated steps run by a CI server, whilst pulling code from version control.

The first step on our CI pipeline was only a compile step. Getting to this stage was tough. It involved chasing down dependencies that were not checked in or documented. We then used the artefacts generated by the build server to do manual deploys.

We did this first because, even though the team was using version control, deployment artefacts were built on a developer’s machine. Moreover, the artefacts could only be built on a couple of key developers’ machines. Builds had to wait for someone to generate them.

When you don’t have anything else put everyone together

Communication is key during the early stages of helping a team. Good communication builds trust between different members of the team. Focus on building trust between team members. A team in a dysfunctional state, has low trust communication between team members.

Encourage pairs between testers and developers as this creates a tight feedback loop between a developer and a tester, and helps test code even before it is committed. Keep this tight feedback loop till an automated suite of tests is in place. I recommend keeping this close collaboration even after.

What helped us in the early days was sitting together in the same area. We had a tester on the team, who had deep knowledge about the product, and knew all the quirks to test for when doing regression tests. We didn’t have the communication overhead of waiting for someone outside of the team to do a task. I encouraged the team to talk to each other, and move away from using JIRA tickets as the primary communication channel.

Amplify the good things

Even when a team is in a chaotic state there are good practices. Learn to leverage these good practices as a building block for better things. They may not be perfect, but it’s a foundation to build on.

We were lucky to have a meticulous tester, who had built his own suite of tests, even when the developers did not have any reliable tests. The tester used to run his suite of tests after every release. We used these tests as the basis for our first automated regression test suite.

We converted the rudimentary tests into a simple suite of BDD style tests. The tests weren’t perfect, but these were the tests that gave us a little bit more confidence that our system wasn’t broken after a release.

Focus on learning

The practices above, should serve one purpose. To give enough slack time for a team to learn. This is where the real change towards Continuous Delivery maturity happens. Encourage pair programming early. Talk about books, show examples of how things can be done better.

It’s easy to fall into the trap of focussing on the code, but keep in mind that the code is an expression of the thought process of how the individuals on a team think.

Give the team space to experiment and support them even when experiments fail.


Starting with these small steps, almost a year later, we were able to deliver reliably. We didn’t fix all the problems in the code. It was still gnarly in places.

We had an automated regression test suite which covered all our key scenarios that ran on every commit. We were able to commit to master and have that change deployed to production within the hour and we rarely had a broken build.

Think of all the CD practices as a toolbox. You can’t have a CD pipeline from day one, nor should the teams focus should be on building the perfect CD pipeline. Focus on educating the team to have a quality and delivery focussed mindset.

Iterate on what you already have. The small initial steps can be force multipliers.

Here are a selection of books that have helped me.

Fearless Change : Patterns for introducing new ideas - Linda Rising and Mary Lynn Manns

The Five Dysfunctions of a Team - Patrick Lencioni

Creativity Inc.: Overcoming the Unseen Forces That Stand in the Way of True Inspiration - Ed Catmull

The Nature of Software Development: Keep It Simple, Make It Valuable, Build It Piece by Piece - Ron Jefferies

The Goal: A Process of Ongoing Improvement - Eliyahu M Goldratt

Getting started with Windows Containers

Earlier this year Microsoft announced Windows 2016 Server TP-5 with support for Windows Containers.

Windows Containers allows a server to act as  container host for containers that can be managed with tools like Docker.  However, a Windows container host can run only Windows containers, and not Linux containers.

I work on a Mac, and I want to use the Docker client on OSX to build Windows Containers. Here is what I went through to set up my environment to start playing with it.

Step 1

Build a virtual machine with the latest Windows 2016 Technical Preview (TP5 at the time of writing).

The usual way is to download, mount the iso in VirtualBox or VMware Fusion and install. After the installation, follow the quick start instructions to configure Windows Containers.

The quick start guide is at https://msdn.microsoft.com/en-us/virtualization/windowscontainers/quick_start/quick_start_configure_host

My preferred method is to create a Vagrant box, so that I can have a repeatable base build to experiment on.

Packer is the tool to build Vagrant boxes. There exists a Packer Windows project to build vagrant boxes. The main Packer Windows doesn’t contain templates for Windows Server 2016 yet. Stefan Scherer has been working on supporting it, and has built templates to provision container hosts as well. Clone the Packer templates from https://github.com/StefanScherer/packer-windows, and run

packer build windows_2016_docker.json

This tells packer to build a Vagrant box with Windows 2016 TP5 as a container host.

Once the box has been built, copy it to a place where it can be reused. My preferred place is a private Dropbox. You’ve now got a Vagrant box acting as a Windows Container Host, ready to experiment with.

Step 2

Spin up the Vagrant, Windows Container Host box by creating a Vagrant file

vagrant init -mf  windows_2016_docker <url to your vagrant box>

Start the Vagrant machine by running

vagrant up

Wait till the Vagrant machine starts up.

Step 3

Connect the Docker client to the Vagrant machine running the Windows Container Host.

When the Vagrant machine starts up, it will display the IP address of the Vagrant machine. Use this IP address to set the DOCKER_HOST environment variable to tcp://:2375

In my environment it’s done by running,

export DOCKER_HOST=tcp://
Then run  docker version and look at the output. It should be something like the following.
 Version: 1.11.1
 API version: 1.23
 Go version: go1.5.4
 Git commit: 5604cbe
 Built: Wed Apr 27 00:34:20 2016
 OS/Arch: darwin/amd64

 Version: 1.12.0-dev
 API version: 1.24
 Go version: go1.5.3
 Git commit: 2b97201
 Built: Tue Apr 26 23:51:36 2016
 OS/Arch: windows/amd64


The OS/Arch value should tell you that the Docker client on OSX, is connects to a Windows Host.

Step 4

Create a Dockerfile, with the following content.

FROM windowsservercore
RUN powershell.exe Install-WindowsFeature web-server

This creates a Windows Docker Container image, using Windows Server Core as the base image, installs IIS, and exposes port 80

Build the Dockerfile.

docker build -t iis .

And run the image with

docker run --name iisdemo -d -p 80:80 iis

That’s it. You now have a container running IIS. Visit http:// to see the familiar IIS start page.


You’ve now got an environment to start experimenting with Windows Containers and Docker. You can start writing Docker files for your Windows only applications, and even start  porting .Net services to run on Windows Containers.


Coney Island – Dead and Alive

View of Coney Island

It was my first trip across the Atlantic to the States and it was exciting to be in New York. The storms had passed, and the autumnal sunshine was in it’s peak. I wanted to capture the palette of New York, just as Saul Leiter would have. Yet, 3 days later on a quiet Halloween morning, I find myself on Coney Island. I see a different palette of colours. Bright yellows, and eye catching reds. Not the fast fleeting yellows of the NYC taxi cabs, but a more sauntering yellow, of a slow pace of life.


Coney Island is an anachronism. It seems to want to go back to the post World War II era, of it’s first decline, resisting any attempts to bring it into the modern age. These tensions between dying and living again, are seen, whilst sauntering along the boardwalk.


The old folks, from the neighbouring senior homes, greeting each other during their morning walks and the old man showing off on his classic Schwinn bike. Coney Island felt like a place that resisted youth, even when a gang of 6 year olds, dressed in Halloween outfits stormed Nathan’s Hotdogs.




All photos taken with a Fuji X100T