DEV Community

Cover image for Surviving the eviction: How to build interrupt-resilient AI workloads on GKE
Olivier Bourgeois for Google Cloud

Posted on

Surviving the eviction: How to build interrupt-resilient AI workloads on GKE

You did everything right. You containerized your massive model training job, deployed it to Google Kubernetes Engine (GKE), and cleverly routed it to a Spot VM node pool to save up to 90% on compute costs.

Everything is humming along perfectly for 38 hours. Then, a priority on-demand customer needs capacity, Google Cloud reclaims your underlying Spot VM, and your node vanishes.

Whether you are using preemptible Spot VMs to save money, or leveraging the Dynamic Workload Scheduler (DWS) to queue for scarce GPUs, you are building on top of ephemeral compute. The hardware will eventually be taken away. To successfully run critical AI workloads on un-committed capacity, your application architecture must assume failure is a given.

Here is a practical guide to building interruptible workloads on GKE.

1. Trap the warning

When Google Cloud reclaims a Spot VM, it doesn't just pull the power cord immediately. It sends an ACPI signal to the underlying node to begin a power off cycle. Kubernetes intercepts this and translates it into a SIGTERM signal sent directly to your running containers.

You have a grace period (up to 15 seconds for non-system pods) between that SIGTERM and the fatal SIGKILL.

Your application must explicitly listen for this signal. When caught, your code should immediately stop accepting new batches, finish its current loop, flush any in-memory data to disk, and exit with a 0 (success) status.

Here is a simple example on how to catch this signal in Python:

import signal
import sys
import time

def handle_sigterm(signum, frame):
    print("Received SIGTERM. Initiating graceful shutdown...")
    # 1. Stop processing new data
    # 2. Flush memory to persistent storage
    # 3. Save final checkpoint
    print("State saved. Exiting cleanly.")
    sys.exit(0)

# Register the signal handler
signal.signal(signal.SIGTERM, handle_sigterm)

# Your main training loop
print("Starting training loop...")
while True:
    # Train model...
    time.sleep(1) 
Enter fullscreen mode Exit fullscreen mode

2. Externalize your checkpoints

If your container dies, everything inside its local filesystem dies with it. To survive an interruption, you must periodically save your progress (model weights, optimizer states, epoch counters, etc.) to an external storage location.

Cloud Storage (GCS) is a common solution for this on Google Cloud.

  • Save frequently: Decide on a checkpointing interval that balances the cost of lost work against the overhead of writing to storage. Saving every epoch or every few thousand steps is common, but this can vary based on your needs.
  • Keep it local: Ensure your GCS buckets are in the same region as your GKE cluster (e.g., us-central1) to minimize latency and avoid outbound data transfer fees.
  • Resume, don't restart: The first thing your container's startup script should do is to check for that GCS bucket. If a checkpoint exists in the bucket, load it and resume from that exact step.

3. Design for Idempotency

"Idempotency" is a fancy way of saying that doing something twice yields the same result as doing it once.

Imagine a batch inference job that reads an image, processes it, and writes the result to a database. If your pod is preempted milliseconds after writing to the database but before it can mark the task as complete, the rescheduled pod will likely process that image again.

If your database blindly inserts new rows, you now have unintentional, duplicate data.

To build an idempotent pipeline:

  • Use UPSERT (update or insert) operations in your database based on a unique identifier (like an image ID).
  • Check if a record already exists before spending expensive GPU cycles processing it.

4. Decouple work queues for batch processing

If you are running a massive batch processing or inference job across thousands of files, do not write a monolithic Python script that iterates through a static CSV list. If the node dies at row 5,000, managing the state of where to restart is a nightmare.

Instead, decouple the workload:

  1. Publish the work: Break your dataset down into discrete messages and push them into a message broker like Pub/Sub.
  2. Pull the work: Have your Spot VM worker pods pull messages off the queue one by one or as a small chunk (e.g. 10 at a time).
  3. Acknowledge completion: Only send an "ACK" (acknowledgment) back to Pub/Sub once the result is safely stored.

If a Spot node is preempted mid-inference, the worker dies before sending the ACK. After a brief timeout, Pub/Sub will automatically make that specific message available again. Another surviving worker pod will pick it up seamlessly. No data lost, no manual intervention required.

Key takeaways

Running on ephemeral compute like Spot VMs isn't just an infrastructure choice; it is a design choice. By handling termination signals, checkpointing aggressively to GCS, ensuring idempotent operations, and decoupling your queues, you can unlock massive cost savings and tap into scarce GPU pools without sacrificing reliability.

Top comments (6)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

checkpoint granularity is the tricky part - too frequent and training throughput tanks, too sparse and you're replaying hours on eviction. we landed on 30min checkpoints for 6h jobs and it works, but it's never comfortable.

Collapse
 
max_quimby profile image
Max Quimby

The 15-second grace period is the detail that bites people, and it's worth underlining how tight it really is. We learned the hard way that you can't actually flush a multi-GB checkpoint inside SIGTERM — by the time the handler fires you have maybe enough time to write a tiny "resume-from" pointer, not the weights themselves. What worked for us was decoupling the two: a background thread checkpoints on a fixed step interval to GCS asynchronously, and the SIGTERM handler only has to record "last good step = N" and exit clean. The fatal SIGKILL then costs you at most one interval of work, regardless of model size.

The other thing I'd add: make the restore path idempotent and test it as a first-class code path, not an afterthought. Spot reclamation will exercise it far more often than you expect, and a checkpoint you can't reliably reload is just expensive disk I/O. Curious whether you've found DWS queueing to be more predictable than raw Spot for longer training runs — the eviction-frequency tradeoff there is something I keep going back and forth on.

Collapse
 
muskan_8abedcc7e12 profile image
Muskan

The SIGTERM-to-checkpoint handoff is the part most people underestimate, GKE gives you about a 30 second grace period on Spot preemption and a large model checkpoint rarely flushes to Cloud Storage in that window. We ended up checkpointing on a step interval rather than only on the termination signal, so the final 30 seconds is just a small delta flush instead of the whole state. The other thing worth measuring is how often you actually get preempted per node pool, because past a certain churn rate the recompute cost from lost progress quietly eats the Spot discount. Did you find a node pool size where eviction frequency made on-demand cheaper overall?

Collapse
 
alexshev profile image
Alex Shev

Interrupt resilience is going to matter more as AI workloads get longer and more expensive. Checkpointing, resumability, and clear job state are boring until the first expensive run disappears halfway through.

Collapse
 
newtorob profile image
Rob

Good piece, and the right framing: you treated this as an architecture problem, not a flag you flip on the node pool.

Collapse
 
uzoma_uche_3ec83974b4a8a5 profile image
Echo

GKE preemption is the kind of thing you only fix once it bites you in production. The right answer is usually a queue with idempotent jobs, not more nodes.