DEV Community

Saurav Bhattacharya profile picture

Saurav Bhattacharya

Building agent-eval, an open-source framework for testing & evaluating AI agent outputs. Also building WinSentinel for Windows security. I write about agent reliability, eval methodology, & AI safety.

Stop Asserting Equality: How to Test Agents When Every Run Is Different

Stop Asserting Equality: How to Test Agents When Every Run Is Different

1
Comments
5 min read

Want to connect with Saurav Bhattacharya?

Create an account to connect with Saurav Bhattacharya. You can also sign in below to proceed if you already have an account.

Already have an account? Sign in
The Reason Your Agent Demo Isn't in Production Has Nothing to Do With the Model

The Reason Your Agent Demo Isn't in Production Has Nothing to Do With the Model

1
Comments 2
4 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

2
Comments
11 min read
I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

6
Comments 4
8 min read
Hallucination Detection Is Not a Model Problem—It's an Infrastructure Problem

Hallucination Detection Is Not a Model Problem—It's an Infrastructure Problem

1
Comments 1
4 min read
The Alignment Problem Is an HR Problem - And We Should Treat It Like One

The Alignment Problem Is an HR Problem - And We Should Treat It Like One

1
Comments 8
4 min read
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

1
Comments
5 min read
Your AI Agent Drifted Last Night and You Didn't Notice

Your AI Agent Drifted Last Night and You Didn't Notice

1
Comments 1
5 min read
Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Comments
4 min read
loading...