Terraform is fantastic… until state goes wrong.
If you’ve ever seen:
- “Why is Terraform trying to recreate everything?”
- “Who ran apply last?”
- “It worked on my machine, but prod is broken”
- “We changed it in the console and now Terraform is angry”
…you’ve met the real boss of Terraform: the state file.
This guide will make you dangerously confident with state management—remote state, locking, drift, and recovery—with step-by-step workflows and real examples you can use immediately.
1) What Terraform state actually is (in plain English)
Terraform state is Terraform’s memory.
It answers two critical questions:
- What did I create?
- What is the real ID of that thing in the cloud?
When Terraform creates a resource (say an AWS EC2 instance), the cloud assigns a real ID like i-0abc123.... Terraform stores that mapping in state so it can later update or destroy the correct thing.
What’s inside state?
- Resource addresses (like
aws_instance.app) - Real cloud identifiers
- Last-known attribute values
- Dependencies graph metadata
- Often sensitive values (yes, state can contain secrets)
What state is NOT
- It’s not your desired configuration (that’s your
.tffiles) - It’s not a “nice-to-have”
- It’s not safe to treat casually
State is the source of truth for Terraform operations. Lose it or corrupt it, and you can’t trust plan or apply.
2) Local state vs remote state (why remote is the default for teams)
Local state (default)
Terraform stores terraform.tfstate on your laptop.
Okay for:
- learning
- personal sandboxes
- short-lived experiments
Bad for:
- teams
- production
- compliance
- long-running infra
Because the moment two people run Terraform, local state becomes a time bomb.
Remote state (recommended)
Terraform stores state in a shared backend (like object storage or Terraform Cloud/HCP).
You get:
- one shared source of truth
- collaboration without chaos
- easier recovery via versioning
- locking support (in many backends)
- access control & auditing
If you run Terraform in a team and you’re not using remote state, it’s not “if” you’ll have a problem—it’s “when.”
3) Remote state: how to set it up safely (step-by-step)
Step A — Choose a backend (simple decision rules)
A good backend should provide:
- Durability (state must not disappear)
- Encryption at rest
- Versioning / history (for recovery)
- Locking (or a strong alternative)
- Access control (least privilege)
Common choices:
- AWS: S3 (state) + DynamoDB (locking)
- GCP: GCS (state) (locking depends on approach; many teams rely on object preconditions / CI discipline)
- Azure: Azure Blob Storage (often used with locking support)
- Terraform Cloud / HCP Terraform: managed remote state + locking + runs
Step B — Use a separate state per environment (avoid mega-state)
A “mega-state” (dev+stage+prod in one file) becomes fragile and slow.
Recommended layout:
prodhas its own statestagehas its own statedevhas its own state
This reduces blast radius: a mistake in dev shouldn’t even touch prod state.
Step C — Configure the backend (example)
Example: AWS S3 backend + DynamoDB locking
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "mycompany-terraform-locks"
encrypt = true
}
}
Then run:
terraform init -migrate-state
That -migrate-state is important. It tells Terraform:
“Move my existing local state to the remote backend.”
Step D — Protect the state bucket/container
Treat your state storage like a production database:
- enable versioning
- restrict access tightly
- enable encryption
- log access
- prevent public access
- consider “delete protection” via policies
Because state is not just metadata—it can contain sensitive infrastructure details.
4) Locking: the feature that prevents silent disasters
Here’s the nightmare scenario:
- Engineer A runs
terraform apply - Engineer B runs
terraform apply30 seconds later - Both read the same “old” state
- Both make changes
- Last writer wins → state becomes inconsistent with reality
You might not notice immediately.
You’ll notice later when Terraform wants to destroy/replace something “mysteriously.”
What locking does
Locking ensures only one Terraform operation that writes state can run at a time.
If someone else tries:
- they’ll get an error saying state is locked
- they can wait or stop
Real-world example (what locking prevents)
Without locking:
Two applies happen. One creates resources. Another overwrites state without those resources. Terraform now “forgets” resources exist.
With locking:
Second apply is blocked. State stays consistent.
What if a lock gets stuck?
Sometimes a crashed run leaves a lock behind.
You can remove it only if you’re sure nothing is running:
terraform force-unlock <LOCK_ID>
Use this like a fire extinguisher:
- useful in emergencies
- dangerous if used casually
5) Drift: when reality changes but Terraform didn’t do it
Drift is the difference between:
- what Terraform thinks exists (state + config)
- what actually exists in the cloud right now
Common causes of drift
- manual console changes (“quick fix”)
- auto-scaling or platform automation changing settings
- someone runs a script outside Terraform
- provider/API defaults change
- resources modified by other tools (Kubernetes controllers, CI pipelines)
How drift shows up
You run:
terraform plan
…and Terraform says:
- “I’m going to change X”
- “I’m going to recreate Y”
- “I detected changes made outside Terraform”
This is Terraform telling you: “Your world and my memory are diverging.”
6) Drift detection: a simple workflow that actually works
Step 1 — Make drift detection a scheduled habit
Do not wait until release day.
Run a drift plan daily (or every few hours for critical infra) using:
terraform plan -detailed-exitcode
Exit codes:
0= no changes2= changes detected (possible drift or desired change)1= error
This is perfect for CI.
Step 2 — Use refresh-only when you want to “sync memory” safely
Sometimes you only want Terraform to update state with reality without changing real resources.
Use:
terraform plan -refresh-only
terraform apply -refresh-only
This is a clean way to:
- acknowledge external changes
- update Terraform’s view
- then decide what to do next
Step 3 — Decide: accept drift or revert drift
When drift is detected, you have two choices:
A) Accept drift (Terraform should match reality)
- update your
.tfto reflect the new desired state - then apply normally
B) Revert drift (reality should match Terraform)
- do a normal
terraform applyto restore to declared config
Real example: drift in an EC2 instance type
Someone changes t3.medium → t3.large in console.
Terraform plan shows:
- “instance_type will be changed back to t3.medium”
Now decide:
- Was the console change intentional? Update code.
- Was it a temporary “panic fix”? Revert via apply.
Either way, make code match the decision. That’s the point.
7) Recovery: what to do when state goes wrong (your playbook)
State issues feel scary because the blast radius can be huge. The calm truth is: most state disasters are recoverable if you follow a disciplined approach.
The 4 most common state emergencies
- State file deleted or overwritten
- Terraform wants to recreate resources that already exist
- Someone removed/renamed resources in code without state migration
- A partial apply happened (some resources created, others failed)
Let’s fix them one by one.
Recovery Scenario 1: “State is gone” (or empty)
Symptoms
- Terraform wants to create everything from scratch
- remote backend key points to missing/blank file
Best recovery path
- Stop all runs immediately (no more applies)
- Restore from backend versioning (that’s why we enabled it)
- Re-run:
terraform init
terraform plan
If versioning is not available, you can rebuild state by importing resources (painful, but possible):
terraform import aws_instance.app i-0abc123...
You do this for each important resource until Terraform plan stabilizes.
Rule: when rebuilding from imports, start with foundational resources first (networking, IAM, clusters), then services.
Recovery Scenario 2: “Terraform wants to destroy/recreate everything”
This often happens after:
- refactors
- moved modules
- renamed resources
- changed
for_eachkeys - workspace/state mixups
The fix is usually state mapping, not actual recreation.
Step 1 — Confirm you’re in the correct state/environment
Check workspace (if used):
terraform workspace show
Step 2 — If resources were renamed or moved, use moved blocks (preferred)
When you rename something, tell Terraform:
moved {
from = aws_security_group.old_name
to = aws_security_group.new_name
}
This preserves identity without destroying resources.
Step 3 — For more complex refactors, move state addresses
terraform state mv 'module.old.aws_s3_bucket.logs' 'module.new.aws_s3_bucket.logs'
Then:
terraform plan
If the plan now looks sane, you avoided a disaster.
Recovery Scenario 3: “State is corrupted” (or contains bad data)
Symptoms
- errors reading state
- provider serialization issues
- weird plan behavior after upgrades
Safe path
- Pull a copy of current state:
terraform state pull > backup.tfstate
- If you have a known good version in remote history, restore it first.
- If you must push a corrected state (advanced, risky):
terraform state push fixed.tfstate
Only do state pushes when:
- you fully understand the consequences
- you have backups
- no one else is running Terraform
Recovery Scenario 4: Partial apply (some resources created, then failure)
Symptoms
- cloud shows resources exist
- Terraform state may not include them
- next plan wants to create duplicates
Fix
- If resources exist but aren’t in state: import them
terraform import aws_lb.app arn:aws:elasticloadbalancing:...
- If resources exist and Terraform wants replacements, confirm whether it’s safe.
Sometimes it’s just a naming mismatch; other times it’s a real diff.
Pro move: run plan first, and import only what’s missing.
8) The “state safety rules” every team should adopt
Rule 1: Remote state is mandatory for shared environments
Local state is for learning and personal sandboxes only.
Rule 2: Never run apply from laptops for production (if you can avoid it)
Use CI/CD or controlled runners so you get:
- consistent Terraform version
- consistent credentials
- audit trail
- reduced human mistakes
Rule 3: Keep state small by splitting stacks
Separate states for:
- networking
- clusters/platform
- apps
- data stores
Smaller state = faster plans, safer changes.
Rule 4: Use locking (or enforce single-run discipline)
If backend supports locking, turn it on.
If not, enforce one-writer workflow using CI.
Rule 5: Treat state as sensitive
State can contain:
- internal IPs
- resource metadata
- sometimes secrets
Restrict access like you would for production credentials.
Rule 6: Make drift detection routine, not reactive
Schedule drift plans and alert on exit code 2.
9) A complete “gold standard” workflow (copy this)
Daily (automated)
terraform initterraform plan -detailed-exitcode- alert if exit code = 2
Every change (PR pipeline)
- lint/format/validate
- plan and attach output to PR
- require review for large cost/risk changes
Apply (controlled)
- apply only from CI runner
- state locking enabled
- approvals for prod
Monthly (maintenance)
- review untagged/orphaned resources (cost + drift)
- verify state backend versioning and access controls
- rotate credentials if needed
10) Final takeaway (the line to remember)
If Terraform is your infrastructure brain, state is its memory.
- Remote state makes memory shared and durable
- Locking prevents two people from rewriting memory at once
- Drift is reality changing behind your back
- Recovery is having a calm, tested plan when memory breaks
Master state, and Terraform becomes a power tool instead of a roulette wheel.
Implementation: Remote State + Locking + Drift Checks + Recovery (AWS / Azure / GCP, Laptops or CI/CD)
The goal (what “good” looks like)
A production-grade Terraform setup usually has:
- Remote state (durable + encrypted + versioned)
- Locking (prevents concurrent applies)
- Per-environment state (dev/stage/prod separated)
- Drift detection (scheduled
planand alert) - Recovery plan (restore previous state or rebuild safely)
Option 1: AWS (S3 Remote State + DynamoDB Locking)
A) Create your state storage (best-practice checklist)
S3 bucket
- versioning: ON
- encryption: ON
- public access: BLOCKED
- access logs: ON (optional but great)
- lifecycle: keep versions (don’t aggressively expire)
DynamoDB table
- used only for Terraform locking
- partition key:
LockID(string)
B) Terraform backend config (example)
terraform {
backend "s3" {
bucket = "mycompany-terraform-state"
key = "prod/network/terraform.tfstate"
region = "us-east-1"
dynamodb_table = "mycompany-terraform-locks"
encrypt = true
}
}
Then:
terraform init -migrate-state
C) How to structure state keys (avoid mega-state)
Use separate keys per environment and per stack:
dev/network/terraform.tfstatedev/platform/terraform.tfstatedev/apps/terraform.tfstateprod/network/terraform.tfstateprod/platform/terraform.tfstateprod/apps/terraform.tfstate
This keeps plans fast and reduces blast radius.
Option 2: Azure (Azure Blob Remote State)
A) Create your state storage
- Storage account (private)
- Container for state (e.g.,
tfstate) - Encryption enabled (default)
- Soft delete/versioning (if available in your policy; highly recommended)
B) Terraform backend config (example)
terraform {
backend "azurerm" {
resource_group_name = "rg-terraform-state"
storage_account_name = "mystatetfstore"
container_name = "tfstate"
key = "prod/network/terraform.tfstate"
}
}
Then:
terraform init -migrate-state
Option 3: GCP (GCS Remote State)
A) Create your state storage
- GCS bucket (private)
- versioning: ON
- uniform access: ON
- encryption: default or CMEK (org policy dependent)
B) Terraform backend config (example)
terraform {
backend "gcs" {
bucket = "mycompany-terraform-state"
prefix = "prod/network"
}
}
Then:
terraform init -migrate-state
Deploying Terraform: Laptops vs CI/CD (choose your maturity level)
Approach A — Terraform CLI on laptops (OK for early stage)
This is workable if you enforce discipline:
Minimum rules (non-negotiable)
- Remote state only (never local for shared envs)
- Locking enabled (AWS is easiest)
- One apply at a time (respect locks)
- No manual console edits (or document and reconcile immediately)
- Use same Terraform version (pin version)
“Safe laptop apply” workflow
terraform fmt -recursive
terraform validate
terraform plan -out tfplan
terraform apply tfplan
Handling a stuck lock (only if you’re sure no run is active)
terraform force-unlock <LOCK_ID>
When laptops become risky:
The moment you have multiple engineers touching prod weekly, move to CI/CD.
Approach B — CI/CD (recommended for production)
This is the best practice for reliability and auditability.
What CI/CD gives you
- consistent Terraform version and environment
- predictable credentials
- approval gates (especially for prod)
- audit logs
- reduced “works on my machine” failures
Recommended pipeline shape
On Pull Request
fmt/validateplan- publish plan output as build artifact or PR comment
On Merge (or manual approval)
- re-run
plan(optional but best) applyonly with approval for prod
Practical environment protection
- dev: auto-apply on merge
- stage: auto-apply on merge (optional)
- prod: manual approval required
Drift detection (works for any cloud)
The simplest drift detector
Run this daily (or every few hours for critical stacks):
terraform init
terraform plan -detailed-exitcode
Interpretation:
- exit 0 → no drift / no changes
- exit 2 → drift or pending changes detected
- exit 1 → error (broken auth, provider issues, etc.)
Best practice: “refresh-only” when you want to sync state safely
If you suspect the state is stale but you’re not ready to change infra:
terraform plan -refresh-only
terraform apply -refresh-only
This updates Terraform’s “memory” without changing real resources.
Recovery playbook (copy/paste into your internal wiki)
Scenario 1: State deleted or overwritten
Best fix: restore a previous version from your backend (this is why versioning matters).
Then:
terraform init
terraform plan
Scenario 2: Terraform wants to recreate everything after refactor
Usually you renamed/moved resources.
Preferred fix: add moved blocks:
moved {
from = aws_security_group.old_name
to = aws_security_group.new_name
}
Or move state addresses:
terraform state mv 'module.old.x' 'module.new.x'
Scenario 3: Resources exist but are missing from state
Import them:
terraform import <address> <id>
Then re-run:
terraform plan
Scenario 4: Stuck lock
Only if you confirm no apply is running:
terraform force-unlock <LOCK_ID>
What I recommend for you (simple upgrade path)
If you’re on AWS, do this (fastest and strongest):
- S3 + DynamoDB locking
- Move applies to CI/CD for prod (laptops okay for dev/stage initially)
- Schedule daily drift detection per stack
- Split state into
network,platform,apps