Mohammad Gufran Jahangir February 12, 2026 0

Terraform is fantastic… until state goes wrong.

If you’ve ever seen:

  • “Why is Terraform trying to recreate everything?”
  • “Who ran apply last?”
  • “It worked on my machine, but prod is broken”
  • “We changed it in the console and now Terraform is angry”

…you’ve met the real boss of Terraform: the state file.

This guide will make you dangerously confident with state management—remote state, locking, drift, and recovery—with step-by-step workflows and real examples you can use immediately.


Table of Contents

1) What Terraform state actually is (in plain English)

Terraform state is Terraform’s memory.

It answers two critical questions:

  1. What did I create?
  2. What is the real ID of that thing in the cloud?

When Terraform creates a resource (say an AWS EC2 instance), the cloud assigns a real ID like i-0abc123.... Terraform stores that mapping in state so it can later update or destroy the correct thing.

What’s inside state?

  • Resource addresses (like aws_instance.app)
  • Real cloud identifiers
  • Last-known attribute values
  • Dependencies graph metadata
  • Often sensitive values (yes, state can contain secrets)

What state is NOT

  • It’s not your desired configuration (that’s your .tf files)
  • It’s not a “nice-to-have”
  • It’s not safe to treat casually

State is the source of truth for Terraform operations. Lose it or corrupt it, and you can’t trust plan or apply.


2) Local state vs remote state (why remote is the default for teams)

Local state (default)

Terraform stores terraform.tfstate on your laptop.

Okay for:

  • learning
  • personal sandboxes
  • short-lived experiments

Bad for:

  • teams
  • production
  • compliance
  • long-running infra

Because the moment two people run Terraform, local state becomes a time bomb.

Remote state (recommended)

Terraform stores state in a shared backend (like object storage or Terraform Cloud/HCP).

You get:

  • one shared source of truth
  • collaboration without chaos
  • easier recovery via versioning
  • locking support (in many backends)
  • access control & auditing

If you run Terraform in a team and you’re not using remote state, it’s not “if” you’ll have a problem—it’s “when.”


3) Remote state: how to set it up safely (step-by-step)

Step A — Choose a backend (simple decision rules)

A good backend should provide:

  • Durability (state must not disappear)
  • Encryption at rest
  • Versioning / history (for recovery)
  • Locking (or a strong alternative)
  • Access control (least privilege)

Common choices:

  • AWS: S3 (state) + DynamoDB (locking)
  • GCP: GCS (state) (locking depends on approach; many teams rely on object preconditions / CI discipline)
  • Azure: Azure Blob Storage (often used with locking support)
  • Terraform Cloud / HCP Terraform: managed remote state + locking + runs

Step B — Use a separate state per environment (avoid mega-state)

A “mega-state” (dev+stage+prod in one file) becomes fragile and slow.

Recommended layout:

  • prod has its own state
  • stage has its own state
  • dev has its own state

This reduces blast radius: a mistake in dev shouldn’t even touch prod state.

Step C — Configure the backend (example)

Example: AWS S3 backend + DynamoDB locking

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "mycompany-terraform-locks"
    encrypt        = true
  }
}

Then run:

terraform init -migrate-state

That -migrate-state is important. It tells Terraform:
“Move my existing local state to the remote backend.”

Step D — Protect the state bucket/container

Treat your state storage like a production database:

  • enable versioning
  • restrict access tightly
  • enable encryption
  • log access
  • prevent public access
  • consider “delete protection” via policies

Because state is not just metadata—it can contain sensitive infrastructure details.


4) Locking: the feature that prevents silent disasters

Here’s the nightmare scenario:

  • Engineer A runs terraform apply
  • Engineer B runs terraform apply 30 seconds later
  • Both read the same “old” state
  • Both make changes
  • Last writer wins → state becomes inconsistent with reality

You might not notice immediately.
You’ll notice later when Terraform wants to destroy/replace something “mysteriously.”

What locking does

Locking ensures only one Terraform operation that writes state can run at a time.

If someone else tries:

  • they’ll get an error saying state is locked
  • they can wait or stop

Real-world example (what locking prevents)

Without locking:
Two applies happen. One creates resources. Another overwrites state without those resources. Terraform now “forgets” resources exist.

With locking:
Second apply is blocked. State stays consistent.

What if a lock gets stuck?

Sometimes a crashed run leaves a lock behind.

You can remove it only if you’re sure nothing is running:

terraform force-unlock <LOCK_ID>

Use this like a fire extinguisher:

  • useful in emergencies
  • dangerous if used casually

5) Drift: when reality changes but Terraform didn’t do it

Drift is the difference between:

  • what Terraform thinks exists (state + config)
  • what actually exists in the cloud right now

Common causes of drift

  • manual console changes (“quick fix”)
  • auto-scaling or platform automation changing settings
  • someone runs a script outside Terraform
  • provider/API defaults change
  • resources modified by other tools (Kubernetes controllers, CI pipelines)

How drift shows up

You run:

terraform plan

…and Terraform says:

  • “I’m going to change X”
  • “I’m going to recreate Y”
  • “I detected changes made outside Terraform”

This is Terraform telling you: “Your world and my memory are diverging.”


6) Drift detection: a simple workflow that actually works

Step 1 — Make drift detection a scheduled habit

Do not wait until release day.

Run a drift plan daily (or every few hours for critical infra) using:

terraform plan -detailed-exitcode

Exit codes:

  • 0 = no changes
  • 2 = changes detected (possible drift or desired change)
  • 1 = error

This is perfect for CI.

Step 2 — Use refresh-only when you want to “sync memory” safely

Sometimes you only want Terraform to update state with reality without changing real resources.

Use:

terraform plan -refresh-only
terraform apply -refresh-only

This is a clean way to:

  • acknowledge external changes
  • update Terraform’s view
  • then decide what to do next

Step 3 — Decide: accept drift or revert drift

When drift is detected, you have two choices:

A) Accept drift (Terraform should match reality)

  • update your .tf to reflect the new desired state
  • then apply normally

B) Revert drift (reality should match Terraform)

  • do a normal terraform apply to restore to declared config

Real example: drift in an EC2 instance type

Someone changes t3.mediumt3.large in console.

Terraform plan shows:

  • “instance_type will be changed back to t3.medium”

Now decide:

  • Was the console change intentional? Update code.
  • Was it a temporary “panic fix”? Revert via apply.

Either way, make code match the decision. That’s the point.


7) Recovery: what to do when state goes wrong (your playbook)

State issues feel scary because the blast radius can be huge. The calm truth is: most state disasters are recoverable if you follow a disciplined approach.

The 4 most common state emergencies

  1. State file deleted or overwritten
  2. Terraform wants to recreate resources that already exist
  3. Someone removed/renamed resources in code without state migration
  4. A partial apply happened (some resources created, others failed)

Let’s fix them one by one.


Recovery Scenario 1: “State is gone” (or empty)

Symptoms

  • Terraform wants to create everything from scratch
  • remote backend key points to missing/blank file

Best recovery path

  1. Stop all runs immediately (no more applies)
  2. Restore from backend versioning (that’s why we enabled it)
  3. Re-run:
terraform init
terraform plan

If versioning is not available, you can rebuild state by importing resources (painful, but possible):

terraform import aws_instance.app i-0abc123...

You do this for each important resource until Terraform plan stabilizes.

Rule: when rebuilding from imports, start with foundational resources first (networking, IAM, clusters), then services.


Recovery Scenario 2: “Terraform wants to destroy/recreate everything”

This often happens after:

  • refactors
  • moved modules
  • renamed resources
  • changed for_each keys
  • workspace/state mixups

The fix is usually state mapping, not actual recreation.

Step 1 — Confirm you’re in the correct state/environment

Check workspace (if used):

terraform workspace show

Step 2 — If resources were renamed or moved, use moved blocks (preferred)

When you rename something, tell Terraform:

moved {
  from = aws_security_group.old_name
  to   = aws_security_group.new_name
}

This preserves identity without destroying resources.

Step 3 — For more complex refactors, move state addresses

terraform state mv 'module.old.aws_s3_bucket.logs' 'module.new.aws_s3_bucket.logs'

Then:

terraform plan

If the plan now looks sane, you avoided a disaster.


Recovery Scenario 3: “State is corrupted” (or contains bad data)

Symptoms

  • errors reading state
  • provider serialization issues
  • weird plan behavior after upgrades

Safe path

  1. Pull a copy of current state:
terraform state pull > backup.tfstate
  1. If you have a known good version in remote history, restore it first.
  2. If you must push a corrected state (advanced, risky):
terraform state push fixed.tfstate

Only do state pushes when:

  • you fully understand the consequences
  • you have backups
  • no one else is running Terraform

Recovery Scenario 4: Partial apply (some resources created, then failure)

Symptoms

  • cloud shows resources exist
  • Terraform state may not include them
  • next plan wants to create duplicates

Fix

  • If resources exist but aren’t in state: import them
terraform import aws_lb.app arn:aws:elasticloadbalancing:...
  • If resources exist and Terraform wants replacements, confirm whether it’s safe.
    Sometimes it’s just a naming mismatch; other times it’s a real diff.

Pro move: run plan first, and import only what’s missing.


8) The “state safety rules” every team should adopt

Rule 1: Remote state is mandatory for shared environments

Local state is for learning and personal sandboxes only.

Rule 2: Never run apply from laptops for production (if you can avoid it)

Use CI/CD or controlled runners so you get:

  • consistent Terraform version
  • consistent credentials
  • audit trail
  • reduced human mistakes

Rule 3: Keep state small by splitting stacks

Separate states for:

  • networking
  • clusters/platform
  • apps
  • data stores

Smaller state = faster plans, safer changes.

Rule 4: Use locking (or enforce single-run discipline)

If backend supports locking, turn it on.
If not, enforce one-writer workflow using CI.

Rule 5: Treat state as sensitive

State can contain:

  • internal IPs
  • resource metadata
  • sometimes secrets

Restrict access like you would for production credentials.

Rule 6: Make drift detection routine, not reactive

Schedule drift plans and alert on exit code 2.


9) A complete “gold standard” workflow (copy this)

Daily (automated)

  • terraform init
  • terraform plan -detailed-exitcode
  • alert if exit code = 2

Every change (PR pipeline)

  • lint/format/validate
  • plan and attach output to PR
  • require review for large cost/risk changes

Apply (controlled)

  • apply only from CI runner
  • state locking enabled
  • approvals for prod

Monthly (maintenance)

  • review untagged/orphaned resources (cost + drift)
  • verify state backend versioning and access controls
  • rotate credentials if needed

10) Final takeaway (the line to remember)

If Terraform is your infrastructure brain, state is its memory.

  • Remote state makes memory shared and durable
  • Locking prevents two people from rewriting memory at once
  • Drift is reality changing behind your back
  • Recovery is having a calm, tested plan when memory breaks

Master state, and Terraform becomes a power tool instead of a roulette wheel.


Implementation: Remote State + Locking + Drift Checks + Recovery (AWS / Azure / GCP, Laptops or CI/CD)

The goal (what “good” looks like)

A production-grade Terraform setup usually has:

  • Remote state (durable + encrypted + versioned)
  • Locking (prevents concurrent applies)
  • Per-environment state (dev/stage/prod separated)
  • Drift detection (scheduled plan and alert)
  • Recovery plan (restore previous state or rebuild safely)

Option 1: AWS (S3 Remote State + DynamoDB Locking)

A) Create your state storage (best-practice checklist)

S3 bucket

  • versioning: ON
  • encryption: ON
  • public access: BLOCKED
  • access logs: ON (optional but great)
  • lifecycle: keep versions (don’t aggressively expire)

DynamoDB table

  • used only for Terraform locking
  • partition key: LockID (string)

B) Terraform backend config (example)

terraform {
  backend "s3" {
    bucket         = "mycompany-terraform-state"
    key            = "prod/network/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "mycompany-terraform-locks"
    encrypt        = true
  }
}

Then:

terraform init -migrate-state

C) How to structure state keys (avoid mega-state)

Use separate keys per environment and per stack:

  • dev/network/terraform.tfstate
  • dev/platform/terraform.tfstate
  • dev/apps/terraform.tfstate
  • prod/network/terraform.tfstate
  • prod/platform/terraform.tfstate
  • prod/apps/terraform.tfstate

This keeps plans fast and reduces blast radius.


Option 2: Azure (Azure Blob Remote State)

A) Create your state storage

  • Storage account (private)
  • Container for state (e.g., tfstate)
  • Encryption enabled (default)
  • Soft delete/versioning (if available in your policy; highly recommended)

B) Terraform backend config (example)

terraform {
  backend "azurerm" {
    resource_group_name  = "rg-terraform-state"
    storage_account_name = "mystatetfstore"
    container_name       = "tfstate"
    key                  = "prod/network/terraform.tfstate"
  }
}

Then:

terraform init -migrate-state

Option 3: GCP (GCS Remote State)

A) Create your state storage

  • GCS bucket (private)
  • versioning: ON
  • uniform access: ON
  • encryption: default or CMEK (org policy dependent)

B) Terraform backend config (example)

terraform {
  backend "gcs" {
    bucket = "mycompany-terraform-state"
    prefix = "prod/network"
  }
}

Then:

terraform init -migrate-state

Deploying Terraform: Laptops vs CI/CD (choose your maturity level)

Approach A — Terraform CLI on laptops (OK for early stage)

This is workable if you enforce discipline:

Minimum rules (non-negotiable)

  1. Remote state only (never local for shared envs)
  2. Locking enabled (AWS is easiest)
  3. One apply at a time (respect locks)
  4. No manual console edits (or document and reconcile immediately)
  5. Use same Terraform version (pin version)

“Safe laptop apply” workflow

terraform fmt -recursive
terraform validate
terraform plan -out tfplan
terraform apply tfplan

Handling a stuck lock (only if you’re sure no run is active)

terraform force-unlock <LOCK_ID>

When laptops become risky:
The moment you have multiple engineers touching prod weekly, move to CI/CD.


Approach B — CI/CD (recommended for production)

This is the best practice for reliability and auditability.

What CI/CD gives you

  • consistent Terraform version and environment
  • predictable credentials
  • approval gates (especially for prod)
  • audit logs
  • reduced “works on my machine” failures

Recommended pipeline shape

On Pull Request

  • fmt / validate
  • plan
  • publish plan output as build artifact or PR comment

On Merge (or manual approval)

  • re-run plan (optional but best)
  • apply only with approval for prod

Practical environment protection

  • dev: auto-apply on merge
  • stage: auto-apply on merge (optional)
  • prod: manual approval required

Drift detection (works for any cloud)

The simplest drift detector

Run this daily (or every few hours for critical stacks):

terraform init
terraform plan -detailed-exitcode

Interpretation:

  • exit 0 → no drift / no changes
  • exit 2 → drift or pending changes detected
  • exit 1 → error (broken auth, provider issues, etc.)

Best practice: “refresh-only” when you want to sync state safely

If you suspect the state is stale but you’re not ready to change infra:

terraform plan -refresh-only
terraform apply -refresh-only

This updates Terraform’s “memory” without changing real resources.


Recovery playbook (copy/paste into your internal wiki)

Scenario 1: State deleted or overwritten

Best fix: restore a previous version from your backend (this is why versioning matters).

Then:

terraform init
terraform plan

Scenario 2: Terraform wants to recreate everything after refactor

Usually you renamed/moved resources.

Preferred fix: add moved blocks:

moved {
  from = aws_security_group.old_name
  to   = aws_security_group.new_name
}

Or move state addresses:

terraform state mv 'module.old.x' 'module.new.x'

Scenario 3: Resources exist but are missing from state

Import them:

terraform import <address> <id>

Then re-run:

terraform plan

Scenario 4: Stuck lock

Only if you confirm no apply is running:

terraform force-unlock <LOCK_ID>

What I recommend for you (simple upgrade path)

If you’re on AWS, do this (fastest and strongest):

  1. S3 + DynamoDB locking
  2. Move applies to CI/CD for prod (laptops okay for dev/stage initially)
  3. Schedule daily drift detection per stack
  4. Split state into network, platform, apps

Category: 
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments