Dan's Musings

That Does It. Ansible Wins.

Started writing this 2022-12-13T08:46:50-0700

Overview

Writing Ansible for infrastructure (i.e. using it instead of Terraform, not using it instead of Puppet) feels dirty. It feels like you're some script kiddy just cludging things to work together. This is in sharp contrast to Terraform, which has a much more elegant model for infrastructure management and supports a declarative style, and even supports dependency management to some extent via git module URLs. It is also less visible than CloudFormation for AWS, which provides "Stacks" that are clearly visible, auditable, and GUI-friendly right from the AWS console.

But because of past and now recent experiences, some of which I will now share, Ansible DESTROYS the competition for stability and refactorability over time. Compared to its ability to remain stable come storm or high water, Terraform is brittle and CFT is absolutely unusable.

This is because Ansible doesn't maintain state. Yes, you heard me right: that "feature" that Terraform and CFT support has caused all my problems, ever.

CloudFormation is Bad

First, an attack on the more fully-featured foe, CloudFormation Template, aka CFT.

My previous company demanded that, because of CFT's visibility and auditability, we use CFT for our infrastructure, please. They also tried to sweeten the deal by providing a bunch of library CFT modules and asked us to use them so that our infrastructure was compliant.

I didn't particularly enjoy using CFT, but I like to play nice with other IT groups because I've been one of them and know how it feels to just try and get everyone on the same page.

So I played along. I did everything with CFT, and used their modules.

BUT THEN.

Some innocent contractor decided he didn't like the way that one of the modules was written. This module was the company's module for spinning up s3 buckets. It would spin up the bucket, apply labels, etc. This kid was just trying to do an innocent refactor, to clean things up. His only mistake was not clearly communicating the (later revealed to be breaking) change. But in his defence, it didn't look like it would break anything.

His refactor of the S3 bucket module meant that my entire stack with the S3 bucket in it was marked by the ever-cautious CFT system as in need of an "update" operation. But my stack also included an RDS database. RDS databases do not support the Update operation in CFT.

If I changed my code in any way, the Update operation would likewise be forced, so that wouldn't help.

It got to the point where I realized the only way to make CFT happy would be to delete the database and re-create. UGH.

Ansible would have handled this refactor cleanly. Why, then, could CFT not do so?

This is because CFT maintains state. You tell it "give me a database" and it makes one and then remembers it makes it. Then you say "I'mma update this stack" and CFT doesn't "get" that you're not updating the database, just that you are updating the stack. It fails early.

Ansible doesn't maintain state. You can only tell it, "Create a database named 'Roger'". Then, when the playbook runs, it asks, "Does an RDS database named 'Roger' exist?" And if Roger doesn't exist, Ansible creates it. If one exists that is already there, Ansible says "Cool, works as coded."

Contrast this with CFT. If CFT were asked to create a database named "Roger", and one already existed outside of its state, it panics. This is almost never what the operator wants. The operator probably created Roger, but CFT always acts like some threat actor made it. That's why names exist -- to delineate things.

Terraform Is Only Slightly Better

Imagine my relief when, upon switching from my previous employer to my current one, the chosen IaC tool for my team was Terraform. "Finally!", I thought. "A sane tool."

Alas.

Yes, Terraform is designed well. Yes, the git-repo-as-modules thing is awesome, the orthogonal design between the plugins of TF and TF itself is a work of art, etc. etc. but Terraform made a big mistake: it decided to support state.

Therefore, like I found out, it suffered the same problems with refactors as CFT. I wold find myself creating and deleting swaths of big infrastructure just so that I could satisfy Terraform plans upon making refactor changes to the code.

Yes, Terraform is better, because it lets you change the state using the terraform state command and friends. However, this puts the operator in a position where the operator must manually edit, either by terraform state commands or actual text editor, a state file that was generated by a machine. Gross.

It got to the point where I knew about the state commands, but I still preferred to delete real infrastructure rather than fix things. I also tried pulling tricks like editing modules in non-standard ways, copying and pasting things everywhere trying to make Terraform happy during a refactor, and I still failed to avoid deleting and re-creating infrastructure. It's too easy to do it; it's the default.

Stateless IaC Is Better

Ansible playbooks are much easier to refactor. Going back to the CFT example, if the contractor had refactored the S3 bucket module, only everything were in Ansible, I would not have noticed.

The stateful IaC tools are brittle, because in order to refactor your code, you also have to refactor the state to reflect it. Sure, the infrastructure requirements remained the same, but Terraform doesn't know that. It just sees a bunch of infra addresses that weren't there before and a bunch of new ones that are.

At least Terraform lets you mess around with state via import/state. That is, if you're willing to go through the pain. CFT doesn't let you change stacks on-the-fly, at least not to my knowlege.

Ansible feels quick and dirty, like there should be a better tool for this. But shiny axe head and ergonomic handle does not a better axe make, if the steel in the axe head is of poor quality. Like the Millenium Falcon, Ansible doesn't look/feel like much, but "she's got it where it counts".