All courses 60 min5 chaptersAdvancedHashiCorp Terraform

Terraform for MLOps Engineers: Provisioning and Managing ML Infrastructure as Code

Senior ML/MLOps engineers who manage cloud infrastructure and want to enforce reproducibility, auditability, and safe change management on GPU clusters, model-serving endpoints, and data pipeline resources using Terraform 1.15.

What you'll learn
  • Write and apply HCL configurations using the terraform init/plan/apply/destroy cycle
  • Configure remote state backends with locking for safe multi-engineer collaboration
  • Author reusable Terraform modules for repeatable ML infrastructure components
  • Wire Terraform into a CI/CD pipeline with plan-then-approve gating on pull requests
  • Detect, evaluate, and remediate infrastructure drift in a live ML environment
Chapters in this course
Writing and Applying Your First HCL Configuration with Terraform 1.1512m
Managing Shared Remote State with Locking for Multi-Engineer ML Teams13m
Building Reusable Modules for ML Infrastructure Components12m
Integrating Terraform into a CI/CD Review Workflow with Plan-Then-Approve Gates10m
Detecting and Remediating Infrastructure Drift in a Live ML Environment13m
Chapter 1 · 12 min

Writing and Applying Your First HCL Configuration with Terraform 1.15

Slide deck · PDF
Open in new tab
Open deck preview

Two Primitives: Arguments and Blocks

HCL has exactly two syntax constructs. Once you recognize them, every Terraform configuration becomes predictable.

An argument assigns a value to a name:

region = "us-central1"

A block is a named container for related configuration:

resource "google_storage_bucket" "training_data" {
  name          = "my-ml-project-training-data-2026"
  location      = "US"
  force_destroy = true
}

Every top-level construct in Terraform — resource, provider, terraform, variable — is a specific block type. The resource block requires exactly two labels: the resource type (google_storage_bucket) and a local name (training_data). The local name exists only inside Terraform; it never appears in your cloud provider's API. To reference this resource elsewhere in the configuration, write google_storage_bucket.training_data.name — resource type, then local name, then the attribute.

Identifiers must start with a letter or underscore and can contain letters, digits, underscores, and hyphens. Files use the .tf extension, and Terraform reads all .tf files in the working directory as a single, unified configuration. According to the Terraform Configuration Syntax docs, both # and // are valid single-line comment delimiters, but terraform fmt converts // to # — use # from the start.

Knowledge check1 of 1
A teammate writes 'location = US' without quotes in their .tf file. What kind of syntax construct is this, and what is wrong with it?

Provider Block and Version Constraints

The terraform block sits at the top of your configuration and declares two things: the minimum Terraform core version your code requires, and every external provider it depends on.

```hcl terraform { required_version = "~> 1.15"

required_providers { google = { source = "hashicorp/google" version = "~> 6.0" } } }

provider "google" { project = "travel-ml-prod-314159" region = "asia-south1" } ```

The source field uses a three-part registry address: hostname/namespace/type. The hostname defaults to registry.terraform.io, so hashicorp/google expands to registry.terraform.io/hashicorp/google. The Google Cloud provider has accumulated over 600 million cumulative downloads, reflecting its weight in ML and data-engineering stacks.

The ~> operator is the pessimistic constraint and the most common choice for root modules. ~> 6.0 allows any 6.x minor version but blocks 7.0. ~> 6.0.1 is stricter: it permits 6.0.2 and 6.0.3 but rejects 6.1.0. Pick ~> 6.0 when you want minor-version security patches automatically; pick ~> 6.0.1 when you need patch-level stability for a sensitive production pipeline. Six constraint operators are available (=, !=, >, >=, <, <=, ~>); for reusable modules shared across teams, >= 6.0, < 7.0 provides explicit upper and lower bounds without pessimistic shorthand. Source: Terraform Version Constraints.

The Core Workflow: init → plan → apply → destroy

Terraform's four-command cycle is deliberate. Each step is safe to re-run and progressively more consequential.

`terraform init` downloads provider binaries into .terraform/ and writes .terraform.lock.hcl. It is idempotent — running it again never deletes your configuration or state. Re-run it after any change to required_providers or after cloning the repository.

`terraform plan` reads current state, compares it to your configuration, and proposes change actions. No cloud resources are created or modified. Treat it as a mandatory draft review: always run terraform plan -out=tfplan before apply, save the output, and verify it before committing.

`terraform apply` presents the plan one final time and prompts for confirmation. After you type yes, Terraform creates, updates, or destroys resources and writes the result to terraform.tfstate. The state file maps every resource block to its real cloud object — never edit it by hand.

`terraform destroy` is a convenience alias for terraform apply -destroy. It runs the same apply engine in destroy-planning mode. Always preview with terraform plan -destroy first to confirm exactly which resources will be removed.

<Callout type="warning"> The -auto-approve flag skips the interactive confirmation prompt on apply and destroy. Only use it in locked CI environments where no out-of-band infrastructure changes are possible. Using -auto-approve locally against a production GCS bucket holding 50 GB of training data is a fast path to an unrecoverable data loss event. </Callout>

After terraform destroy completes, verify removal in two places: the cloud console (the bucket should be absent from the GCS browser) and terraform.tfstate (the "resources" array should be empty). If state still lists the resource but the console confirms deletion, you have encountered drift — covered in 05-drift-detection-remediation.

Reading Terraform Plan Output

Four symbols appear in plan output, and misreading them is the most common source of production surprises:

SymbolMeaningWhat actually happens
+CreateNew resource provisioned
-DestroyExisting resource deleted
~Update in-placeAttributes changed, same cloud object
-/+ReplaceDestroy old, create new (replacement)

-/+ demands the closest scrutiny. It appears when an argument that cannot be changed in place is modified — for example, changing a GCS bucket's name. Because GCS bucket names are globally unique and immutable once created, Terraform must delete the old bucket and create a new one. For an ML team storing training events, that means data loss unless the bucket was emptied first. For a Vertex AI training instance, -/+ means any in-flight training jobs are terminated.

A plan ending with Plan: 0 to add, 0 to change, 0 to destroy is Terraform's confirmation that your running infrastructure exactly matches your configuration — the "clean slate" signal. According to the terraform plan CLI Reference, plan performs a three-step read-compare-propose: it reads current remote object state, compares it to your configuration, and proposes the minimal set of changes needed to converge them.

Locking Versions with .terraform.lock.hcl

After terraform init, examine the generated lock file:

provider "registry.terraform.io/hashicorp/google" {
  version     = "6.14.1"
  constraints = "~> 6.0"
  hashes = [
    "h1:ABC123xyz...",
    "zh:DEF456uvw...",
  ]
}

version records the exact provider binary installed — 6.14.1 — selected within the ~> 6.0 range. constraints echoes the range from required_providers (informational only; Terraform enforces version). The h1: hash checksums the package contents and is the current preferred scheme; zh: is a legacy archive checksum retained for compatibility with older tooling.

Commit `.terraform.lock.hcl` to Git. Add `.terraform/` to `.gitignore`. Provider binaries reach 100 MB or more and are rebuilt by terraform init from the lock file. Without the lock file on a fresh clone, init queries the Terraform Registry and may silently select a newer patch release, diverging from what your team tested. Standard Terraform .gitignore templates correctly exclude .terraform/, but some older templates also exclude *.lock.hcl by mistake — verify yours. Source: Terraform Dependency Lock File (.terraform.lock.hcl).


Hands-On Exercise

Goal: Provision a GCS training-data bucket and walk the complete init → plan → apply → destroy lifecycle.

  1. Create main.tf with the configuration from the worked example in this chapter (substitute your GCP project ID).
  2. Run terraform init. Confirm .terraform.lock.hcl was created and lists hashicorp/google with an exact version.
  3. Run terraform plan. Verify exactly one + resource in the output and note the planned bucket name.
  4. Run terraform apply. At the prompt, type yes. Confirm "Apply complete! Resources: 1 added, 0 changed, 0 destroyed."
  5. Open the GCS console — the bucket must be visible with the name from step 3.
  6. Run terraform plan -destroy. Read the - symbol on the bucket entry and confirm no other resources appear.
  7. Run terraform destroy. Type yes. Confirm "Destroy complete! Resources: 1 destroyed."
  8. Inspect terraform.tfstate in a text editor — the "resources" array must be empty ([]). Cross-check in the GCS console: the bucket must be absent.

Success criteria: Plan output shows 1 to add, 0 to change, 0 to destroy before apply; apply completes with 1 added; terraform.tfstate lists zero resources after destroy; GCS console shows no bucket under your project.

Your first configuration is running end-to-end — next, learn how to share state safely across a multi-engineer ML team. 02-remote-state-locking-ml-teams

Chapter 1 check
1 / 5
Which two syntax constructs form the complete basis of HCL?
Chapter 2 · 13 min

Managing Shared Remote State with Locking for Multi-Engineer ML Teams

Slide deck · PDF
Open in new tab
Open deck preview

Why Local State Breaks Multi-Engineer ML Teams

When three engineers share a local terraform.tfstate committed to git, concurrent applies are a silent data race. Priya commits a lifecycle-policy change to the model-artifact S3 bucket; Karan applies a cluster scale-out five minutes later from a stale local copy. Karan's apply completes last and overwrites Priya's write — the lifecycle change is gone, and stale model artifacts start accumulating. No error, no warning.

Local state has two structural problems: no locking and no single source of truth. A remote backend solves both. The state file lives in a shared object store (S3 or GCS), and every write operation acquires a lock before touching it.

Remote Backend Configuration: S3 and GCS

AWS S3 backend. Terraform 1.x recommends S3 native locking via use_lockfile = true, which creates .tflock files in the same bucket using S3 conditional writes — no DynamoDB table required. The older DynamoDB locking approach is deprecated and scheduled for removal in a future minor version.

terraform {
  backend "s3" {
    bucket       = "ml-infra-tfstate-prod"
    key          = "training-cluster/terraform.tfstate"
    region       = "ap-south-1"
    use_lockfile = true
    encrypt      = true
  }
}

encrypt = true enables AES-256 server-side encryption. Per Gruntwork's state management guide, S3 object durability is 99.999999999% — state file loss is not a realistic concern; corruption from concurrent writes is, which is exactly what locking prevents.

One critical constraint: backend blocks cannot reference Terraform variables. Writing bucket = var.state_bucket_name causes a parse error at init time because backend configuration is evaluated before variable processing. Use literal strings, or supply values via terraform init -backend-config="bucket=ml-infra-tfstate-prod" (partial configuration).

GCS backend. GCS makes locking even simpler — it is enabled automatically:

terraform {
  backend "gcs" {
    bucket = "ml-infra-tfstate-prod"
    prefix = "terraform/training-cluster"
  }
}

GCS creates .tflock files alongside each state file with no additional configuration. No external service is required. Enable object versioning on the bucket for state recovery. Per the GCS backend docs, workspace states are stored at <prefix>/<workspace_name>.tfstate.

Knowledge check1 of 1
On GCS, what enables Terraform state locking?

Migrating Existing State to a Remote Backend

Adding a backend block to an existing config requires a one-time migration. The terraform init -migrate-state command reads state from the current backend (local file or previous remote) and writes it to the new destination. Terraform prompts for confirmation before copying.

```bash # 1. Add the backend block to main.tf

The counterpart flag is -reconfigure, which silently discards the old state and configures the backend as if starting fresh. Never use -reconfigure on an existing workspace. Your infrastructure is still live but Terraform no longer knows about it; the next terraform plan shows everything as new creates. Per the init command reference, -migrate-state is the safe path; -reconfigure is for genuinely blank-slate setups only.

One bootstrapping trap: the S3 bucket must exist before init -migrate-state can use it as a backend. Don't declare the bucket resource in the same config that uses it as a backend — create it with a separate bootstrap config that uses local state, then add the backend block to the main config.

How State Locking Works in Practice

When Karan runs terraform apply, Terraform acquires a lock on the state file before writing a single byte. If Priya attempts a concurrent apply, she sees:

``` Error: Error acquiring the state lock

Lock Info: ID: 0071b31e-4d15-17dd-78b2-d24f117a2c35 Operation: OperationTypeApply Who: karan@ml-team (terraform 1.15.0 on linux_amd64) Created: 2026-06-11T07:42:10.123Z ```

Locking is automatic for all write operations — plan, apply, destroy. Priya's only action is to wait; when Karan's apply completes, the lock releases. Per the state locking docs, the six metadata fields (ID, Operation, Who, Version, Created, Path) give enough context to determine whether a lock is stale. terraform force-unlock <LOCK_ID> manually releases a stuck lock — use it only after confirming the holding process has genuinely terminated (check CI logs, confirm with the team). Running force-unlock while the original apply is still in progress corrupts state.

<Callout type="warning"> Force-unlock is not a shortcut for impatience. Verify the holding process is dead using the Who and Created fields before releasing. A live apply that loses its lock mid-write produces partial state. </Callout>

Workspace vs. Directory-per-Environment Isolation

CLI workspaces create named state isolation units within a single Terraform directory. The team creates dev, staging, and prod workspaces with terraform workspace new <name> and switches between them with terraform workspace select. Inside HCL, terraform.workspace returns the active name, enabling environment-specific sizing:

instance_type = terraform.workspace == "prod" ? "p4d.24xlarge" : "g4dn.xlarge"

The hard limit: all workspaces in a directory share the same backend configuration and the same IAM credentials. Selecting workspace dev does not prevent an accidental apply from reaching production infrastructure if both workspaces share an IAM role with prod write access. Per the Managing Workspaces docs and Gruntwork's analysis, this makes CLI workspaces unsuitable for credential-isolated environments.

The directory-per-environment pattern solves this. Each environment gets its own root module directory (envs/dev/, envs/staging/, envs/prod/) with an independent backend {} block pointing to a separate IAM role. A misconfigured dev/ config is bounded to dev; it cannot reach prod. The trade-off is moderate code duplication — shared child modules address that (covered in 03-reusable-modules-ml-infrastructure.md).

Use workspaces for short-lived feature branches and ephemeral test environments. Use directories for long-lived, credential-isolated ML environments.

Knowledge check1 of 1
Which is the strongest reason to use directory-per-environment instead of CLI workspaces for ML prod/staging/dev?

Detecting and Recovering from State Drift

State drift occurs when infrastructure changes outside Terraform — an engineer scales a training cluster from the AWS console, a lifecycle policy expires a model artifact bucket, or an autoscaler adjusts instance counts. The state file reflects the last apply; live infrastructure has moved.

The modern drift-detection workflow replaces the deprecated terraform refresh command:

```bash # Safe probe: reads live infra, shows drift, changes nothing terraform plan -refresh-only

plan -refresh-only is non-destructive and safe to run at any cadence. The standalone terraform refresh command is deprecated: it auto-approves state updates without a confirmation prompt. Per the refresh command docs, the replacement is terraform apply -refresh-only, which prompts before writing updated state.

After reviewing drift, choose one of two paths. Accept the drift (the change was intentional): run terraform apply -refresh-only to update the state file to match live reality. Reject the drift (the change was wrong): run terraform apply to re-enforce the configuration, overwriting the manual change. Handling unmanaged resources and import-based remediation is covered in 05-drift-detection-remediation.md.


Hands-on exercise: migrate local state and simulate a lock conflict

  1. Create an S3 bucket with object versioning using a separate bootstrap config with local state.
  2. Add an S3 backend block (use_lockfile = true) to your main ML config.
  3. Run terraform init -migrate-state and confirm the state file appears in S3.
  4. In two terminal sessions, run terraform apply simultaneously against the remote backend. Observe the lock error in the second session — note the Who and ID fields.
  5. After the first apply completes, run terraform plan -refresh-only to confirm no drift.

Success criteria: S3 contains terraform.tfstate; a .tflock file is visible in S3 during the active apply; the second session shows a lock error with a valid Who and Lock ID; plan -refresh-only reports no changes.

Next: learn how to extract reusable infrastructure patterns into parameterized child modules — 03-reusable-modules-ml-infrastructure.md.

Chapter 2 check
1 / 5
On GCS, what enables Terraform state locking?
Chapter 3 · 12 min

Building Reusable Modules for ML Infrastructure Components

Slide deck · PDF
Open in new tab
Open deck preview

The Case for Modules

Your fare-prediction ML pipeline runs in three environments — dev, staging, and prod. Without a module, you maintain three near-identical copies of a SageMaker training configuration, roughly 200 HCL lines each. After every AWS API change, someone updates one copy and forgets the others. A module collapses those 600 lines into three short module blocks and three .tfvars files, each holding only what differs by environment.

A Terraform module is any directory containing .tf files. The directory where you run terraform apply is the root module; every other directory you call from it is a child module. Child modules expose a clean contract — inputs in, outputs out — so the root never needs to know how the resources inside are implemented.

The Four-File Module Layout

Standard Module Structure — Terraform Language prescribes four files for the minimal complete module:

modules/fare-pred-training/
├── main.tf        # resource definitions
├── variables.tf   # all input variable declarations
├── locals.tf      # shared name prefixes and tag maps
└── outputs.tf     # everything the caller can read

Add iam.tf as a fifth file when IAM resources exceed 150 lines — a threshold from AWS Prescriptive Guidance that keeps privilege-boundary resources in one reviewable file, separate from compute.

Do not add a providers.tf or backend.tf inside a shared module. Provider and backend configuration belongs in the root module only — provider pinning syntax is covered in 01-hcl-configuration-core-workflow and backend setup in 02-remote-state-locking-ml-teams; placing either inside a child module forces a specific region, credential profile, or state location onto every caller. A versions.tf that declares only required_providers version constraints is the sole acceptable provider-related file in a shared module.

Authoring Input Variables

Every knob the caller needs to turn is a variable block in variables.tf. The type argument is not optional in a shared module — it lets terraform validate catch type mismatches before a plan touches your cloud account.

```hcl variable "environment" { type = string description = "Deployment tier: dev | staging | prod." validation { condition = contains(["dev", "staging", "prod"], var.environment) error_message = "environment must be dev, staging, or prod." } }

variable "instance_type" { type = string default = "ml.m5.xlarge" } ```

A variable without default is required — Terraform errors rather than guessing. A variable with default is optional. Use required variables for values that differ meaningfully across environments (environment, training_image_uri). Use optional variables for values safe to share most of the time: instance_type = "ml.m5.xlarge" serves dev and staging; prod overrides it via .tfvars. The validation block runs at plan time and surfaces a targeted error message instead of a cryptic downstream provider error, per Input Variables — Terraform Language.

Knowledge check1 of 1
A `variable` block declared without a `default` argument is:

Local Values — The Internal DRY Layer

Local values are the module's internal shorthand — defined once, referenced many times, and never overridable by the caller.

```hcl locals { name_prefix = "${var.project_name}-${var.environment}"

common_tags = merge( { Project = var.project_name, Environment = var.environment, ManagedBy = "terraform" }, var.tags ) } ```

local.name_prefix stamps every resource name consistently — SageMaker job, IAM role, CloudWatch log group — without repeating the concatenation. local.common_tags builds the full tag map once; every resource calls tags = local.common_tags.

The distinction from input variables is fundamental: if a value should vary across instantiations, use a variable. If a value is always derived from other expressions inside the module, use a local. The most common mistake is hard-coding an environment-specific value in locals.tf and then being unable to override it without touching the module source. Per Local Values — Terraform Language, locals can reference resource attributes, data sources, and functions — not just variables.

Declaring and Consuming Module Outputs

Outputs expose module internals to the caller via output blocks in outputs.tf:

```hcl output "sagemaker_role_arn" { description = "ARN of the IAM role used by SageMaker training jobs." value = aws_iam_role.sagemaker.arn }

output "training_job_name_prefix" { description = "Consistent name prefix applied to all training jobs in this module." value = local.name_prefix } ```

The calling root reads outputs as module.training.sagemaker_role_arn. This reference also registers an implicit dependency — Terraform knows that a CloudWatch log group using the ARN must wait for the IAM role to exist. Best Practices for Reusable Terraform Modules — Google Cloud states explicitly: every resource in a shared module should have at least one output so callers can declare dependencies without resorting to depends_on hacks.

<Callout type="warning"> Never add provider or backend blocks inside a shared module. Doing so binds every caller to one specific region, credential profile, or state location and silently breaks reuse. Shared modules declare only required_providers version constraints — never a full provider configuration. </Callout>

Knowledge check1 of 1
After calling a module labeled `training`, how does the root configuration read its `sagemaker_role_arn` output?

Calling the Module with Environment-Specific .tfvars

The root main.tf calls the module with a source pointing to its directory:

```hcl module "training" { source = "./modules/fare-pred-training"

project_name = var.project_name environment = var.environment instance_type = var.training_instance_type training_image_uri = var.training_image_uri s3_data_bucket = var.s3_data_bucket } ```

Per-environment values live in separate .tfvars files and are applied with -var-file:

```hcl # envs/dev/terraform.tfvars instance_type = "ml.m5.xlarge" training_image_uri = "123456789.dkr.ecr.ap-south-1.amazonaws.com/fare-pred-train:dev-latest"

terraform apply -var-file=envs/prod/terraform.tfvars

Variable loading precedence from lowest to highest: variable defaultTF_VAR_* environment variables → terraform.tfvars (auto-loaded) → *.auto.tfvars (lexicographic order) → -var-file flags (CLI order) → -var flags (CLI order, highest). An explicit -var-file=envs/prod/terraform.tfvars overrides any terraform.tfvars present in the working directory. The -var flag overrides everything.

Keeping IAM, Tagging, and Logging Inside the Module Boundary

IAM roles, tagging, and logging configuration belong inside the module, not in the root. Co-locating them with the resources they govern means a reviewer auditing modules/fare-pred-training/iam.tf sees the complete privilege boundary in one file. Letting IAM leak into the root creates scattered ad-hoc policies disconnected from the compute they control and makes security audits much harder.

When IAM resources inside the module exceed 150 lines, break them into a dedicated iam.tf inside the module directory. The module still owns them — the separate file is purely a readability split, not a boundary change.

Terraform Validate — Pre-Plan Type Checking

Run terraform validate inside the module directory after every edit:

cd modules/fare-pred-training
terraform init -backend=false
terraform validate
# Success: The configuration is valid.

terraform validate checks syntax correctness, attribute names, and type constraints against the provider schema that was downloaded during init — no cloud API calls, no state reads. It catches a misspelled assume_role_polciy attribute or a list(string) value passed to a string variable in under a second. What it does not catch: whether the instance type exists in your target region, whether the IAM role has the right permissions, or whether the ECR image URI resolves — those require a live terraform plan with valid credentials. CI/CD gating on plan output is covered in 04-cicd-plan-approve-gates. Run validate as a fast pre-save check; run plan as the gate before merge.


Hands-On Exercise

Task: Extract a minimal SageMaker training module and call it from a root configuration with separate dev and prod .tfvars files.

  1. Create modules/fare-pred-training/ with variables.tf, locals.tf, main.tf, outputs.tf, and iam.tf.
  2. In variables.tf, declare project_name (required, string), environment (required, string with validation restricting to dev/staging/prod), instance_type (optional, default "ml.m5.xlarge"), and training_image_uri (required, string).
  3. In locals.tf, define name_prefix = "${var.project_name}-${var.environment}" and a common_tags map merging standard keys with var.tags.
  4. In outputs.tf, export sagemaker_role_arn (referencing the IAM role in iam.tf) and training_job_name_prefix (from local.name_prefix).
  5. Run terraform init -backend=false && terraform validate inside the module directory — confirm exit code 0.
  6. Create envs/dev/terraform.tfvars with instance_type = "ml.m5.xlarge" and envs/prod/terraform.tfvars with instance_type = "ml.p3.2xlarge".
  7. In the root main.tf, call the module and reference module.training.sagemaker_role_arn in a aws_cloudwatch_log_group resource.

Success criteria: - terraform validate exits 0 with "The configuration is valid." - Setting environment = "qa" in a .tfvars file triggers the validation error message, not a generic provider error. - terraform plan -var-file=envs/prod/terraform.tfvars shows ml.p3.2xlarge as the instance type. - Removing training_image_uri from the .tfvars file causes Terraform to error with "no value for required variable."


Next: 04-cicd-plan-approve-gates — wiring terraform validate and plan into a GitHub Actions pipeline with manual approval gates before every apply.

Chapter 3 check
1 / 5
A `variable` block declared without a `default` argument is:
Chapter 4 · 10 min

Integrating Terraform into a CI/CD Review Workflow with Plan-Then-Approve Gates

Slide deck · PDF
Open in new tab
Open deck preview

The Four-Stage Validation Funnel

Every PR touching Terraform configuration should pass four sequential checks before a human reviewer opens the diff. Think of them as a funnel: each stage is faster and cheaper than the next, and each catches a different failure class.

`terraform fmt -check -recursive` reads every .tf file in the directory tree and exits non-zero if any file deviates from canonical Terraform style. It makes zero network calls, completes in under a second, and never modifies files — it only reports. If the check fails, the PR is blocked; the engineer fixes locally by running terraform fmt -recursive, then pushes again.

terraform validate parses the configuration and verifies type constraints, attribute names, and module structure. No provider API calls, no backend access, no credentials required — safe to run on every fork and external PR. It exits non-zero with a machine-readable JSON report when it finds a problem.

terraform init -input=false initialises providers and the backend, the first step that may need credentials to reach the remote state backend. (Remote state backend setup is covered in 02-remote-state-locking-ml-teams.)

terraform plan -no-color -input=false contacts the provider APIs and produces the actual resource diff. This is the most expensive step; the three earlier stages catch the majority of issues before you pay for it.

The -input=false flag is required on every CI plan and apply. Without it, Terraform blocks on stdin waiting for missing variable values. On a non-interactive CI runner there is no stdin, so the job hangs until the runner timeout kills it.

Knowledge check1 of 1
Which of the four pipeline stages makes no network calls AND requires no credentials?

Posting Plan Output as a PR Comment

CI logs require pipeline read access to view. A PR comment is visible to every collaborator with repository read permission. That asymmetry is the reason the canonical pattern posts plan output as a PR comment: reviewers must see exactly what will change before deciding whether to approve the merge.

The `hashicorp/setup-terraform` action installs a shim — enabled by default via terraform_wrapper: true — that captures stdout, stderr, and exitcode as step outputs. Your comment-posting step accesses ${{ steps.plan.outputs.stdout }} and embeds the full diff inside a collapsible <details> block, giving reviewers a clean summary table at the top and the full diff on demand.

One constraint to plan around: GitHub PR comments are capped at 65,535 characters. Large ML infrastructure — GPU clusters, dozens of S3 buckets, dozens of IAM policies — routinely exceeds this limit. The safe fallback is to write full output to $GITHUB_STEP_SUMMARY and post only a resource-count summary (Plan: 3 to add, 1 to change, 0 to destroy) as the PR comment body.

<Callout type="warning">Never use terraform plan -out=planfile in CI and treat the file as safe to log. The binary plan file stores sensitive variable values in cleartext. Committing or logging a .tfplan file exposes credentials. If you need a saved plan for a two-step apply, store it as an encrypted CI artifact with restricted access.</Callout>

Placing the Approval Gate Correctly

A GitHub Actions environment with required reviewers pauses any job that references it until at least one of up to six authorised reviewers clicks Approve and deploy. Configure the gate by creating a production environment under Settings → Environments → New environment, adding required reviewers, and setting environment: production on the apply job. According to the GitHub Docs, only one of the configured reviewers must approve for the job to proceed.

The gate belongs on the apply job only. Three wrong placements are common:

  • Before plan: Reviewers see nothing yet. They are approving a blank cheque.
  • After apply: The change already happened. The gate is now a post-mortem notification.
  • On the plan job: Any job that references an environment with required reviewers will pause for approval — including a plan job. Setting environment: { name: production, deployment: false } suppresses creation of a deployment record but does not bypass required reviewers; the plan job still waits for approval. To let the plan stage run unblocked, reference a separate environment without required reviewers, or inject credentials as repository-level secrets.

The canonical job dependency graph: the plan job runs on every PR, reads credentials from repository-level secrets or a gate-free environment, and posts the diff as a comment. The apply job declares needs: plan, runs only on push to main, and pauses for reviewer approval before executing.

Storing Secrets Safely in CI

Cloud credentials and Terraform input variables such as database passwords or ML platform API keys must never appear in .tf files, .tfvars files, or workflow YAML values committed to the repository.

The correct injection path uses two layers. First, GitHub secrets store encrypted values at rest; any attempt to echo a secret in a workflow step is automatically masked as *** in the runner output. Second, TF_VAR_<name> environment variables map those secrets to Terraform input variables without a .tfvars file. The mapping TF_VAR_db_password: ${{ secrets.DB_PASSWORD }} sets var.db_password in your HCL. The name after TF_VAR_ is case-sensitive and must match the variable block declaration exactly.

Pair this with sensitive = true on the HCL variable declaration and Terraform replaces the value with (sensitive value) in all plan and apply CLI output — confirming the secret never surfaces in the PR comment or CI logs.

Important caveat: `sensitive = true` suppresses CLI output only. The actual value is still persisted in the state file. Running terraform output -raw var_name prints the plaintext value, bypassing redaction. For credentials that must never touch state at all, Terraform 1.10+ offers ephemeral = true. State encryption and access control are covered in 02-remote-state-locking-ml-teams.

Diagnosing Missing-Variable Failures in CI

Error: No value for required variable
 The root module input variable "db_password" is not set, and has no default value.

This error means a Terraform input variable declared without a default received no value from any source. In CI it almost always traces to a missing or misspelled secret. Follow this diagnostic path:

  1. Find the variable name in variables.tf — e.g., variable "db_password".
  2. Locate the corresponding TF_VAR_db_password entry in the workflow env: block.
  3. Confirm the referenced secret (e.g., secrets.DB_PASSWORD) exists under Settings → Secrets and variables → Actions.
  4. Check case: TF_VAR_db_PasswordTF_VAR_db_password. The name after TF_VAR_ is case-sensitive.

A second variant appears on forks: forked repositories do not inherit parent repository secrets. PRs from external contributors will always fail for any required secret variable. Address this with fork-aware workflow conditions (if: github.event.pull_request.head.repo.full_name == github.repository) or by declaring safe defaults only for non-sensitive configuration variables.


Hands-On Exercise: Wire a Plan-Then-Approve Pipeline

Goal: Create a two-job GitHub Actions workflow for a single S3 bucket with a plan-then-approve gate wired end-to-end.

Steps:

  1. Create .github/workflows/terraform.yml. The plan job runs on pull_request to main with no environment attribute — credentials are injected directly from repository-level secrets so the job never pauses for approval. The apply job runs on push to main, declares needs: plan, and sets environment: production.
  2. Create a GitHub environment named production under Settings → Environments and add yourself as a required reviewer.
  3. Add a Terraform variable bucket_suffix (type string, no default) to variables.tf. Store a value as a GitHub Actions secret named BUCKET_SUFFIX. Wire TF_VAR_bucket_suffix: ${{ secrets.BUCKET_SUFFIX }} in both job env: blocks.
  4. Open a PR with a trivial change. Confirm the plan job posts a PR comment showing the resource diff with no secret values visible.
  5. Merge the PR. Confirm the apply job enters a paused state requiring your approval, then approve and verify the bucket is created.

Success criteria: - The PR comment displays Plan: 1 to add, 0 to change, 0 to destroy with (sensitive value) replacing any secret variable. - The apply job pauses at the approval gate and does not execute terraform apply until you click Approve and deploy. - The Terraform state shows the bucket after apply completes, with no secret value visible in the PR comment thread.

Next: 05-drift-detection-remediation.

Chapter 4 check
1 / 5
terraform fmt -check -recursive exits non-zero when:
Chapter 5 · 13 min

Detecting and Remediating Infrastructure Drift in a Live ML Environment

Slide deck · PDF
Open in new tab
Open deck preview

Why Drift Happens in ML Infrastructure

Infrastructure drift is the gap between what Terraform believes exists — the state file — and what actually runs in your cloud account. In ML environments, it is endemic: according to a Firefly 2024 vendor survey, 90% of large-scale IaC deployments experience drift, and roughly half of those incidents go unnoticed without active detection tooling.

The trigger is almost always urgency. A training job OOMs at 2 AM. A data scientist logs into the AWS Console and resizes the GPU node from p3.2xlarge to p3.8xlarge to meet a deadline. The job finishes; the console tab closes. Terraform's state file still records p3.2xlarge. The gap is invisible until the next terraform plan run — which, without a scheduled CI pipeline, might be days away and hundreds of dollars in unexpected GPU spend later.

Reading Drift in terraform plan Output

terraform plan is your primary drift detector. Every time it runs, Terraform refreshes resource attributes from the cloud provider and compares them against the state file. When it finds a discrepancy, it prints a dedicated header before the change summary:

Note: Objects have changed outside of Terraform since the last "terraform apply"

Below that header, drifted resources appear with the ~ modifier, showing the state-file value on the left and the current cloud value on the right:

~ resource "aws_instance" "gpu_trainer" {
      id            = "i-0ab123cd456ef7890"
    ~ instance_type = "p3.2xlarge" -> "p3.8xlarge"
  }

The left side is what Terraform recorded; the right side is what the cloud provider reports now. The plan body that follows describes what Terraform intends to do — which, by default, is to revert the drift back to the configuration. Read both sections before acting; the header tells you what happened, the plan body tells you what will happen next.

Knowledge check1 of 1
You run terraform plan and see 'Note: Objects have changed outside of Terraform'. Which symbol marks an attribute that was modified externally?

Making the Reconcile Decision

When drift is detected, you face a binary choice.

Re-apply the Terraform definition. Leave the configuration as-is and run terraform apply. Terraform reverts the cloud resource to match the configuration. Use this path when the manual change was a mistake, an unauthorized shortcut, or a temporary hotfix that is no longer needed.

Accept the manual change. Update the .tf file to match what is currently running in the cloud, then run terraform apply -refresh-only to sync the state file without touching the resource. Use this path when the change was intentional and must persist.

For the GPU-resize scenario above, if retraining is complete and p3.8xlarge is no longer needed, re-apply is the right call — run terraform apply and the three instances resize back to p3.2xlarge. If the model genuinely requires more GPU memory going forward, update the config to p3.8xlarge, sync state with apply -refresh-only, and open a PR to record the decision.

<Callout type="warning">Never use the old terraform refresh subcommand. It silently overwrites the state file with no review or approval gate and is deprecated in favor of plan -refresh-only. The -refresh-only mode shows you exactly what would change in state and requires an explicit apply confirmation before writing anything.</Callout>

Bringing Unmanaged Resources Under State Control

Sometimes drift is not a changed attribute but an absent entry: a resource exists in the cloud but Terraform has never tracked it. This is common when teams prototype in the console and later want to codify what they built. `terraform import` resolves this by reading the existing cloud resource and recording it in the state file at a specified address.

The workflow has three required steps — skip any one and the import either fails or produces a plan that immediately destroys what you just imported.

Step 1 — Write the matching resource block first. terraform import does not generate configuration; you must author it.

resource "aws_sagemaker_endpoint" "inference" {
  name                 = "bert-sentiment-prod"
  endpoint_config_name = aws_sagemaker_endpoint_configuration.bert.name
}

Step 2 — Import by cloud ID.

terraform import aws_sagemaker_endpoint.inference bert-sentiment-prod

Step 3 — Verify zero drift. Run terraform plan. If the configuration matches the real resource, the output reads "No changes." If attributes differ, Terraform proposes modifications — adjust the config until the plan is clean.

Since Terraform 1.5, the declarative import block lets you declare imports inside .tf files and preview them through the normal plan-apply cycle, making it safer for CI/CD pipelines than the imperative CLI command.

Knowledge check1 of 1
You run terraform import aws_sagemaker_endpoint.inference bert-sentiment-prod without first writing the resource block. What happens on the next terraform plan?

Inspecting State with terraform state list and terraform state show

terraform state list returns every resource address currently tracked in the state file. Use it to audit what Terraform owns without triggering a plan or modifying anything. Pass the -id flag to look up a resource by its cloud provider ID — essential when a monitoring alert fires on an instance ID and you need to find its Terraform address quickly:

$ terraform state list -id=i-0ab123cd456ef7890
module.training_cluster.aws_instance.gpu_trainer[2]

Once you have the address, terraform state show dumps all stored attributes in human-readable HCL format:

$ terraform state show 'module.training_cluster.aws_instance.gpu_trainer[2]'
# resource "aws_instance" "gpu_trainer" {
#     id            = "i-0ab123cd456ef7890"
#     instance_type = "p3.2xlarge"
#     ...
# }

Both commands are strictly read-only — they never modify state or cloud resources. For machine-readable output, use terraform show -json. The -id lookup pattern is especially useful during incident response when you know a resource's cloud ID from an alert but not the module path or index it lives under in the configuration.

Blocking Unauthorized Instance Types with lifecycle Preconditions

Detection after the fact is useful; prevention is better. A lifecycle block's precondition sub-block runs during terraform plan, before any resource changes are proposed. If the condition expression evaluates to false, Terraform emits Error: Resource precondition failed and halts the entire plan — no apply is possible until the condition is satisfied:

```hcl resource "aws_instance" "gpu_trainer" { ami = data.aws_ami.deep_learning.id instance_type = var.instance_type

lifecycle { precondition { condition = contains( ["p3.2xlarge", "p3.8xlarge", "p3.16xlarge", "p4d.24xlarge"], var.instance_type ) error_message = "ML training requires a GPU instance. Got '${var.instance_type}'; must be one of: p3.2xlarge, p3.8xlarge, p3.16xlarge, p4d.24xlarge." } } } ```

With this guard in place, any attempt to apply with a CPU instance type — from a misconfigured .tfvars file or a mistaken variable override — fails at plan time with a clear error before any cloud resource changes. The manual console workaround that caused drift in the first place still requires a human to log in, but the next terraform plan run will catch it immediately.

Hands-On Exercise

Scenario: Your travel-ranking model's training cluster experienced a weekend drift event. Three p3.2xlarge GPU nodes were manually resized to p3.8xlarge in the AWS Console to unblock a failing retraining job. The job has since completed.

Task: Work through the full drift lifecycle in a sandbox account.

  1. Run terraform plan and locate the "Objects have changed" header. Identify which instances drifted and record their instance_type change.
  2. Use terraform state list -id=<instance-id> to find the Terraform resource address for one of the drifted instances.
  3. Run terraform state show on that address and confirm the stored instance_type still reads p3.2xlarge.
  4. Retraining is done and p3.8xlarge is no longer needed. Run terraform apply and verify all three instances revert to p3.2xlarge.
  5. Add a lifecycle precondition to aws_instance.gpu_trainer that only permits p3.2xlarge and p3.8xlarge. Attempt terraform plan -var instance_type=t3.large and confirm the plan halts with Error: Resource precondition failed.

Success criteria: terraform plan shows "No changes" after re-apply; terraform plan -var instance_type=t3.large prints your precondition error message and stops before evaluating any resource block.


This is the final chapter of Terraform for ML Engineers — you now have the complete toolkit: write and apply HCL configurations (ch1), manage shared remote state with locking (ch2), build reusable modules (ch3), gate changes through CI/CD review pipelines (ch4), and detect and remediate infrastructure drift in production (ch5).

Chapter 5 check
1 / 4
You run terraform plan and see the header 'Note: Objects have changed outside of Terraform'. What does the ~ symbol beside an attribute mean?