DEV Community

Cover image for Event-Driven EC2 Isolation in AWS: Building a Minimal Cloud SOAR Without Buying One
Sesank Munukutla (Naga)
Sesank Munukutla (Naga)

Posted on

Event-Driven EC2 Isolation in AWS: Building a Minimal Cloud SOAR Without Buying One

Detection without response is operational noise.

GuardDuty alerts are valuable β€” but if a human has to read, decide, and manually isolate an instance, your blast radius window is still open.

I wanted high-confidence findings to trigger automatic containment.

So I built a minimal AWS-native SOAR pipeline.

No third-party tooling.

No overengineering.

Just deterministic, event-driven response.


🎯 Objective

Build an automated containment workflow that:

  • Responds only to high-severity GuardDuty findings
  • Automatically isolates compromised EC2 instances
  • Preserves forensic access
  • Avoids recursive execution
  • Is observable and debuggable

All event-driven. No polling. No manual trigger.


πŸ— Architecture Overview

GuardDuty Finding
↓
EventBridge Rule (severity >= 7)
↓
Lambda Function (Isolation Logic)
↓
Modify EC2 Security Group β†’ Quarantine SG
↓
SNS Notification (Visibility Layer)
Enter fullscreen mode Exit fullscreen mode

Minimal. Deterministic. Cheap.


Filtering at the Event Layer (Not Inside Lambda)

Instead of checking severity inside the Lambda function, I filtered directly in EventBridge.

Why this matters:

  • Reduces unnecessary Lambda invocations
  • Makes response criteria explicit
  • Improves audit clarity
  • Lowers operational cost

Example event pattern:

{
  "detail-type": ["GuardDuty Finding"],
  "detail": {
    "severity": [ { "numeric": [">=", 7] } ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Only high-confidence findings trigger automation.

Everything else remains visible β€” but not auto-remediated.

Quarantine Security Group Design

Containment is not termination.

Terminating an instance destroys forensic evidence.

My quarantine security group:

  • ❌ No outbound internet

  • ❌ No inbound from public IP ranges

  • βœ… Allow only SOC bastion IP

  • βœ… Allow forensic collection host

  • βœ… Optional: allow VPC Flow Logs / monitoring endpoint

The goal is isolation with controlled investigation access.

Isolation Logic (Lambda Example)

Core logic:

import boto3

ec2 = boto3.client('ec2')

def isolate_instance(instance_id, quarantine_sg_id):
    ec2.modify_instance_attribute(
        InstanceId=instance_id,
        Groups=[quarantine_sg_id]
    )
Enter fullscreen mode Exit fullscreen mode

Additional safeguards added:

  • Check instance state before modification

  • Tag instance Quarantined=true

  • Exit if already isolated

  • Log original security groups for rollback

Containment must be idempotent.

Idempotency: Preventing Recursive Triggers

When Lambda modifies security groups, CloudTrail events may fire.

Without safeguards, you risk infinite loops.

Mitigation:

  • Tag check before modification

  • Structured event filtering

  • Explicit function logging

  • DLQ configured for failure cases

Automation that can repeat blindly is dangerous.

Failure Modes I Modeled

Automation amplifies mistakes.

I explicitly accounted for:

  • IAM permission drift

  • Partial security group modification

  • Concurrent findings on same instance

  • Cross-region GuardDuty setup

  • High-volume alert bursts

Mitigations:

  • Dead Letter Queue

  • Lambda concurrency limits

  • CloudWatch error metrics + alarms

  • Explicit structured logs (JSON format)

  • Permission boundary controls

Automation without observability becomes silent failure.

Impact

This reduced:

  • MTTR from minutes to seconds

  • Human triage fatigue

  • Decision bottlenecks

  • Inconsistent containment actions

But the real improvement was consistency.

Humans improvise during incidents.
Code executes predictably.

Trade-Offs & Risks

Auto-isolating compute is not trivial.

You must consider:

  • False positives at high severity

  • Production-critical workloads

  • Stateful applications

  • Already-compromised lateral movement

  • Multi-account architecture

Severity threshold tuning took longer than writing the Lambda function.

That surprised me.

Lessons Learned

  1. Detection maturity does not equal response maturity.

  2. Event-driven architecture scales better than polling remediation.

  3. Idempotency is mandatory.

  4. Multi-account containment becomes architecture work.

  5. Automation exposes operational blind spots you didn’t know existed.

Next Iterations

If I evolve this into a more mature Cloud SOAR pattern:

  • Step Functions for multi-stage workflows

  • Automated EBS snapshot before isolation

  • Memory capture integration

  • Slack/Jira enrichment with context

  • Cross-account orchestration via AWS Organizations

  • GuardDuty central delegated admin integration

At that point, it becomes a response framework β€” not a script.

Final Thought

You don’t need a commercial SOAR platform to start automating response.

Start with:

  • Deterministic triggers

  • Guardrails

  • Observability

  • Explicit blast radius control

If detection isn’t wired to action, it’s just telemetry.

Top comments (2)

Collapse
 
harsh2644 profile image
Harsh

This is exactly the kind of content I look for! "Detection without response is operational noise" β€” absolutely true. Love how you've implemented event-driven isolation without relying on paid SOAR tools. Definitely trying this in my AWS environment. Thanks for sharing!

Collapse
 
sesank_naga_m_01 profile image
Sesank Munukutla (Naga)

Thanks a lot, Harsh! Really glad it resonated