Gap analysis with Terraform

 We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.

Donald Rumsfeld

It may seem obvious, but often overlooked, that in order to manage something – a cloud asset like a VM or GKE cluster or even a DNS entry – you must first know that it exists. Gap analysis is a technique for finding out what you have, but don’t manage.

In terraform, we are used to declaring resources as terraform resource blocks which have a type and a name by which terraform knows the resource.

"dns_a_record_set" "www" {
  zone = "example.com."
  name = "www"
  addresses = [
    "192.168.0.1",
    "192.168.0.2",
    "192.168.0.3",
  ]
  ttl = 300
}

In this trivial example, a DNS record set, terraform knows the resource by the name “dns_a_record_set.www” which is the concatenation of the terraform resource type and the name.

Once we have run our terraform apply, this record exists in our DNS provider. Let us assume for the sake of argument that is Google Cloud Projects (GCP) DNS manager. The DNS record will also have a name buy which GCP knows it, a URL like : /projects/project/managedZones/managedZone

GCP DNS API allows us to list all these URLs in a GCP project, with the Cloud Asset API. In addition, terraform GCP provider stores the URL to the cloud asset in the terraform state. We may thus write a small application which retrieves all the cloud assets from the terraform state, and all the cloud assets from the Cloud Asset API, and which produces a set difference between the two. That set difference is the set of cloud assets which exist but are not managed by terraform! Now we know what we don’t know, and with that knowledge we can write the missing terraform and adopt the resources into it.

There are a couple of subtly gotchas around this, one is that some assets should not be managed by terraform so we need to whitelist those, for example nodes in a GKE nodepool are managed by the nodepool resource, which is what we should manage with terraform. The individual nodes should not be managed by terraform as they are managed by the nodepool, in fact actually the Managed Instance Group (MiG) which backs the nodepool. So we exclude those nodes from our report.

Another one is that you need to make sure you’re paginating the output from the APIs appropriately.

This technique is very powerful and can also be applied to things like helm, ArgoCD, GKE, anywhere you have IaC and an environment containing assets managed by a tool like terraform or ArgoCD.