The Great IaC Tradeoff: Authoring Experience vs API Synchronization

#terraform #azure #aws #devops

For any Infrastructure as Code tool vendor the most critical choice is balancing the template authoring experience with ensuring the tool remains in sync with target provider APIs. You want users to type less and easily understand their templates but you also need to support new cloud features immediately. It is close to impossible to achieve both at the same time. If you go with an authoring experience where the Domain Specific Language schema does not have a one to one mapping with the underlying REST API request schema then you cannot achieve rapid synchronization.

This fundamental friction defines the current landscape of cloud automation. We see engineering teams constantly battling between writing clean code and accessing the latest features that their cloud provider just released. The abstraction layer that is supposed to make life easier often becomes a bottleneck.

In this comprehensive deep dive we will explore exactly why this tradeoff exists and how major ecosystems like Microsoft Azure and Amazon Web Services attempt to solve it. We will also look at how a stateless approach to Infrastructure as Code offers a completely new path forward that eliminates these compromises.

The Core Dilemma of State and Synchronization

Let us first examine the root cause of this problem. When a cloud provider releases a new service they expose a set of REST API endpoints. These endpoints have their own specific JSON schemas and validation rules and lifecycle behaviors. An Infrastructure as Code tool must translate a user defined template into these exact API calls.

If the tool vendor decides to create a beautiful and highly abstracted template language they must write custom mapping logic. This logic translates the simplified user input into the complex API payload. Every time the cloud provider changes the API the vendor must manually update this mapping logic. This creates a massive maintenance burden and guarantees that the tool will always lag behind the official API releases.

Conversely if the tool vendor decides to auto generate their provider directly from the API specifications they achieve immediate synchronization. However the resulting template language is usually incredibly verbose and difficult for humans to read or write. The schema directly reflects the API payload which often includes deeply nested objects and unintuitive property names.

There are several core challenges that arise from this dynamic.

Schema maintenance requires massive engineering effort from the open source community or the tool vendor to keep up with daily cloud provider updates.
Feature lag becomes a daily reality for platform engineering teams who want to use a newly announced cloud capability but find that their automation tool does not support it yet.
Complex validation rules are often undocumented by the cloud provider which forces the automation tool to guess whether a change will result in a simple update or a destructive recreation of the resource.
Cognitive load increases for the developer who has to constantly reference both the cloud provider API documentation and the automation tool documentation to figure out how to configure a simple resource.

The Microsoft Azure Conundrum

We can see this struggle clearly when looking at how Terraform handles Microsoft Azure. The Terraform ecosystem attempts to solve this problem by offering two entirely different providers for the Azure Resource Manager API. These providers are known as AzureRM and AzAPI.

The AzureRM provider focuses heavily on the desired state authoring experience. It is hand coded and heavily abstracted to make the developers life easier. To specify an instance size for a virtual machine in a standard Azure Resource Manager template you need to go three levels deep into the configuration hierarchy such as properties then hardwareProfile then vmSize. The Terraform AzureRM virtual machine resource type captures this at the root level with a simple attribute called vm_size. This clean authoring experience means there is very little typing for end users.

However this comes with a massive downside. Terraform needs to maintain the mapping between its custom schema and the actual Azure API schema. Keeping this provider in sync with the latest Azure API changes close to a new feature launch is almost impossible due to the sheer volume of manual updates required.

The AzAPI provider takes the opposite approach. It is a thin layer on top of the Azure REST APIs and is all about defining resources using the exact API payload. It captures the REST API endpoint contract directly so translating this into API invocation code is trivial. Azure uses the PUT method for both creating and updating a resource which makes this mapping straightforward.

The AzAPI approach introduces severe validation challenges. The tool struggles to validate and calculate the deployment plan accurately. Figuring out if a change in desired state will result in a simple update or a destructive recreate becomes incredibly difficult because a property in Azure may be conditionally immutable.

There are distinct characteristics that define both approaches.

AzureRM abstracts complexity by managing API versions on your behalf and providing intuitive property names which reduces the need to consult external documentation.
AzureRM suffers from delayed feature support because every new Azure service requires community members to write new Go code to support it.
AzAPI gives you day zero access to all new Azure features and preview services because it dynamically maps directly to the underlying REST API without requiring hand coded updates.
AzAPI requires deep knowledge of the raw Azure JSON payload structures which makes writing the templates much more cumbersome and less readable for the average developer.

The Amazon Web Services Parallel

This problem is not unique to Microsoft Azure. We see the exact same architectural split in the Amazon Web Services ecosystem. Terraform maintains two distinct providers for AWS which are the classic AWS provider and the newer AWSCC provider.

The classic AWS provider has been around for over a decade and is almost entirely hand coded. It offers an incredible authoring experience with over a thousand meticulously crafted resources. When you want to create a storage bucket or a compute instance the template schema is logical and highly documented. But just like AzureRM this provider suffers from the maintenance trap. When AWS announces a new service developers often have to wait weeks or months for the community to write test and merge the code required to support that service in the classic provider.

To combat this HashiCorp and AWS partnered to create the AWSCC provider. This provider is built on top of the AWS Cloud Control API which is a standardized set of endpoints that AWS uses to expose all new services uniformly. The AWSCC provider is automatically generated from these Cloud Control API specifications.

This means the AWSCC provider achieves day zero support for new AWS features. The moment AWS updates the Cloud Control API the Terraform AWSCC provider can manage that resource. But just like AzAPI this speed comes at a high cost to the authoring experience.

There are several clear parallels between the AWS and Azure ecosystems regarding this tradeoff.

The classic AWS provider guarantees stability and provides a highly refined developer experience but leaves you waiting for the open source community to implement new features.
The AWSCC provider eliminates feature lag by auto generating its schema from the Cloud Control API but it forces you to write code that mirrors the raw AWS API payload.
Documentation is heavily fragmented because the auto generated providers usually lack the rich detailed examples found in the hand coded providers.
Users are forced to mix and match providers within the same project which means they might use the classic provider for older resources and the cloud control provider for newly released services.

Why Traditional IaC Struggles with Validation

Regardless of whether you use AWS or Azure or Google Cloud the validation of desired state remains a monumental challenge for traditional stateful Infrastructure as Code tools.

These tools rely on comparing your local code against a stored state file and then comparing that state file against the live cloud environment. To generate an accurate execution plan the tool needs to know exactly how the cloud provider will react to a specific API call.

This is incredibly difficult because cloud provider OpenAPI schemas and official documentation rarely capture all the necessary details. They often fail to document which parameters are truly mandatory or what the default values are for fields that are required for provisioning but not explicitly marked as mandatory.

Furthermore the concept of conditional immutability plagues cloud APIs. A property might be updatable under certain conditions but immutable under others. If the automation tool does not have this specific logic hardcoded into its provider it cannot accurately warn you if a change will destroy and recreate your database versus simply updating a label.

The Stateless IaC Revolution

This is exactly why I started looking for alternatives and discovered MechCloud. They made a deliberate choice to solve this fundamental tradeoff rather than forcing users to compromise. They decided to keep their platform in sync with target cloud provider APIs at all times without the massive maintenance overhead that plagues traditional hand coded providers.

What is the point of having an automation tool that implements a cloud provider feature days or weeks after release? In a fast paced DevOps environment that delay is unacceptable. By focusing on a stateless IaC architecture MechCloud approaches the problem from a completely different angle.

Their templates for Azure and AWS remain close to the API so they can guarantee immediate feature support. However, they have simplified a massive number of configuration elements to make sure you write less to express the desired state. You get the power of the raw API without the verbosity of an auto generated wrapper.

I recently tested the updated desired state editing experience on their stateless IaC page. It now beautifully matches the intuitive YAML editing experience you expect in a modern IDE.

The platform handles the heavy lifting of complex validation, so you do not have to fight with undocumented API constraints or state file sync issues.

By moving to a stateless model this approach unlocks several major advantages for platform engineering teams.

It eliminates the schema mapping burden which means the tool never falls behind the official cloud provider API releases.
It provides a refined authoring experience that feels natural and concise without requiring you to memorize deeply nested JSON structures.
It solves the complex validation challenges centrally so you can deploy with confidence without worrying about unexpected resource destruction.
It removes the need to juggle multiple providers for a single cloud platform which radically simplifies your project configuration and reduces cognitive load.

Conclusion

The future of DevOps relies on tools that remove friction rather than adding abstraction layers that require constant maintenance. You should be able to enjoy a platform that handles complex validation and planning logic for you without sacrificing immediate access to the latest cloud features.

The days of choosing between a good developer experience and day zero API synchronization are over. Stateless Infrastructure as Code proves that you can indeed have the best of both worlds.