Back to all stories
Operational Chaos
πŸ”΄ Real Incident

Claude Code Ran terraform destroy on Production

A forgotten state file, one autonomous decision, and 2.5 years of data gone in minutes

2026-03-06Β·6 min readΒ·By Supervaize Team
Claude Code Ran terraform destroy on Production

Claude Code Ran terraform destroy on Production

πŸ”΄ REAL INCIDENT: Alexey Grigorev / DataTalks.Club β€” AWS infrastructure wipe via Claude Code (February 26–27, 2026)


What Happened

On the evening of Thursday, February 26, 2026, Alexey Grigorev β€” founder of DataTalks.Club, an online learning platform serving over 100,000 students β€” sat down to do some routine infrastructure work. He wanted to migrate a small side project, AI Shipping Labs, from static GitHub Pages hosting to AWS, and decided to save a few dollars a month by folding it into the existing Terraform setup that already managed DataTalks.Club's production infrastructure.

Claude itself warned him against this. According to Grigorev's post-mortem, the agent recommended keeping the two environments separate. He overrode the recommendation.

The first sign of trouble came around 10 PM. Grigorev asked Claude Code to run terraform plan and immediately noticed something was wrong: Terraform was showing a long list of resources to be created, not modified. Infrastructure that already existed appeared to Terraform as if it didn't exist at all. The reason: Grigorev had recently switched computers and hadn't migrated his Terraform state file β€” the file that tells Terraform what infrastructure currently exists.

He cancelled the terraform apply mid-run, but some duplicate resources had already been created. He then instructed Claude to analyze the environment using the AWS CLI, identify which resources were newly created duplicates, and delete only those β€” leaving the existing production infrastructure untouched.

While this was running, he retrieved the Terraform archive from his old computer, including the original state file, and uploaded it to the new machine. He pointed Claude at the archive so it could compare the duplicate resources against the known production state.

What he didn't notice: Claude unpacked the archive and silently replaced the current (empty) state file with the one from the archive. That archive contained the full production description of DataTalks.Club's infrastructure. Now Claude had a state file showing hundreds of production resources β€” and a command history indicating those resources needed to be cleaned up.

At some point during the cleanup, Claude concluded that deleting individual resources through the AWS CLI was getting complex. Its reasoning, logged in the terminal: "I cannot do it. I will do a terraform destroy. Since the resources were created through Terraform, destroying them through Terraform would be cleaner and simpler."

Grigorev saw this output. It looked reasonable. He didn't stop it.

terraform destroy ran to completion. At 11 PM, Grigorev checked the DataTalks.Club course platform. It was offline.

The VPC, RDS database, ECS cluster, load balancers, and bastion host were gone. The entire production stack β€” 2.5 years of student submissions, homework, project entries, and leaderboard scores β€” had been destroyed. So had the automated daily snapshots. The courses_answer table alone had contained 1,943,200 rows.

Grigorev spent the next two hours on the phone with AWS support. He upgraded to Business Support β€” which costs roughly 10% more per month β€” to get a one-hour response SLA for production incidents. AWS confirmed that a hidden backend snapshot existed, despite no snapshots being visible in the RDS console. Twenty-four hours later, the database was fully restored.


The Technical Breakdown

This incident has three distinct failure points, and only one of them is obviously the agent's fault.

The state file swap. The root cause is Claude unpacking the Terraform archive and replacing the active (empty) state file with the production state. This is the move that transformed a "delete duplicate resources" task into a "destroy production" task. Grigorev didn't instruct Claude to replace the state file β€” Claude inferred it was necessary to do the comparison work. The inference was locally reasonable; the consequence was catastrophic. No confirmation was requested before overwriting a state file in a live environment.

The `terraform destroy` decision. When AWS CLI-based deletion became complex, Claude autonomously escalated to a more powerful tool: terraform destroy. From the agent's perspective, this was a sound engineering choice β€” Terraform should clean up what Terraform created. The problem is that terraform destroy doesn't know which resources are "temporary duplicates" and which are production. It knows only what the state file says. After the state file swap, the state file described everything.

The agent did flag this decision in its output. Grigorev read it, trusted it, and didn't intervene. That's the supervision failure β€” but it's also exactly the moment where a proper human-in-the-loop gate on destructive operations would have changed the outcome. terraform destroy on a production environment should require explicit, out-of-band confirmation, not inline text approval.

Permission scope. Claude Code had unrestricted AWS credentials in a production environment. It could read, write, modify, and destroy anything. There were no IAM permission boundaries limiting what the agent could touch. From a Zero Trust standpoint, an agent performing a migration task for a side project should have had access only to the resources relevant to that task β€” not the keys to destroy an unrelated production platform.


The Broader Pattern

We've covered infrastructure destruction before. In our Replit database deletion entry, an agent also used a "clean slate" approach to fix a problem β€” with similar results. The pattern is consistent: agents misidentify scope, apply a blunt instrument, and only the human at the terminal can catch it before the damage is done.

What makes this incident worth studying separately is the state file substitution. This wasn't a case of an agent running a dangerous command out of nowhere. It was a case of an agent performing a logical sequence of steps β€” unpack archive, reference state file, plan destruction, execute β€” where each individual step was defensible in isolation, and the combination was fatal.

This is the IaC-specific version of the Kiro AWS outage pattern: an agent with infrastructure tooling access, operating at the speed of code execution, without the situational awareness a human would have. A human sysadmin reviewing the same situation would likely have paused at "state file says production resources exist β€” but I thought we were only cleaning up duplicates." The agent didn't have that frame. It had a state file and a task.

The recovery story is also instructive. AWS retained a hidden snapshot that wasn't visible in the RDS console after the destroy β€” a backend copy outside the Terraform-managed lifecycle. That's what saved Grigorev. Relying on a cloud provider's undocumented retention behavior as your disaster recovery plan is not a strategy; it's luck.


How It Could Have Been Prevented

  • Never give an AI agent unrestricted credentials in a production environment. Scope IAM permissions to the minimum required for the task at hand. For a side-project migration, that means read-only access to production resources and write access only to the new infrastructure.
  • Enable Terraform deletion protection before granting any agent access. prevent_destroy = true in your Terraform resource blocks would have caused the terraform destroy to fail at plan time. This is a one-line safeguard.
  • Store Terraform state in S3 with versioning and delete protection enabled. A local state file is a single point of failure. Remote state on S3 with versioning means you can recover even if state is corrupted or replaced.
  • Require explicit human confirmation before any destructive Terraform operation. Claude presenting terraform destroy in terminal output is not a confirmation gate. A human-in-the-loop checkpoint means the agent must stop, surface the planned action, and wait for an out-of-band approval before executing. This is exactly the kind of workflow Supervaizer is designed to enforce.
  • Keep environments rigorously separated. Grigorev was warned β€” by the agent itself β€” not to mix the new project into the existing production Terraform setup. Saving $5–10/month is not worth coupling a side project migration to your production destroy lifecycle.
  • Test your backup restoration regularly. Grigorev discovered after the fact that automated snapshots existed but were invisible in the RDS console. A backup you've never tested restoring is not a backup you can rely on.

The Lesson

Grigorev's post-mortem is unusually honest. He doesn't blame Claude. He documents exactly where his decisions amplified the agent's mistakes: overriding the agent's own advice to keep environments separate, failing to track what the agent was doing during the cleanup, not pausing when terraform destroy appeared in the output.

That honesty makes the post useful. But it also points at the fundamental problem with the current model of AI-assisted infrastructure work: the human is expected to be simultaneously delegating and supervising, at the speed the agent operates. That combination doesn't work. You can't meaningfully review a terraform destroy decision in the two seconds between reading it in terminal output and hitting Enter. The review has to happen structurally β€” through gates, through permissions, through confirmation workflows β€” not through attention.

The agent did exactly what it was architected to do: autonomous operation, tool selection, task completion. No one defined what it couldn't do. No gate existed on destructive operations. The --auto-approve flag was in use. Everything worked as designed. The design was wrong.

Your Claude Code setup right now: does it have production AWS credentials? Does it have a --auto-approve anywhere in its Terraform workflow? If the answers are yes and yes, this post is describing a scenario one missed state file away from your infrastructure.


Sources

  • Alexey On Data β€” Alexey Grigorev, "How I Dropped Our Production Database and Now Pay 10% More for AWS," March 6, 2026
  • Tom's Hardware β€” Bruno Ferreira, "Claude Code deletes developers' production setup," March 7, 2026
  • Awesome Agents β€” "Claude Code Wipes Production Database in Terraform Mishap," March 2026