Saurav Bhattacharya

Building agent-eval, an open-source framework for testing & evaluating AI agent outputs. Also building WinSentinel for Windows security. I write about agent reliability, eval methodology, & AI safety.

Seattle WA Joined on Apr 5, 2026 https://github.com/sauravbhattacharya001

Saurav Bhattacharya

Jun 12

Stop Asserting Equality: How to Test Agents When Every Run Is Different

#testing #ai #agents #typescript

5 min read

Want to connect with Saurav Bhattacharya?

Create an account to connect with Saurav Bhattacharya. You can also sign in below to proceed if you already have an account.

Create Account

Already have an account? Sign in

Saurav Bhattacharya

Jun 11

The Reason Your Agent Demo Isn't in Production Has Nothing to Do With the Model

#agents #ai #observability #testing

4 min read

Saurav Bhattacharya

Jun 9

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

#ai #agents #safety #evaluation

11 min read

Saurav Bhattacharya

Jun 8

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

#ai #llm #security #testing

8 min read

Saurav Bhattacharya

Jun 8

Hallucination Detection Is Not a Model Problem—It's an Infrastructure Problem

#ai #observability #testing #typescript

4 min read

Saurav Bhattacharya

Jun 7

The Alignment Problem Is an HR Problem - And We Should Treat It Like One

#ai #agents #safety #evaluation

4 min read

Saurav Bhattacharya

Jun 7

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

#ai #security #evaluation #agents

5 min read

Saurav Bhattacharya

Jun 6

Your AI Agent Drifted Last Night and You Didn't Notice

#ai #agents #testing #devops

5 min read

Saurav Bhattacharya

Jun 5

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

#ai #testing #agents #evaluation

4 min read

DEV Community

Saurav Bhattacharya

Badges

Writing Debut

GitHub Repositories

Stop Asserting Equality: How to Test Agents When Every Run Is Different

Want to connect with Saurav Bhattacharya?

The Reason Your Agent Demo Isn't in Production Has Nothing to Do With the Model

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

I Built an Adversarial Eval Framework and Attacked 5 LLMs — Every Single One Failed

Hallucination Detection Is Not a Model Problem—It's an Infrastructure Problem

The Alignment Problem Is an HR Problem - And We Should Treat It Like One

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

Your AI Agent Drifted Last Night and You Didn't Notice

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation