DEV Community

# evaluation

Posts

👋 Sign in for the ability to sort posts by relevant, latest, or top.
An LLM benchmark is only useful for as long as it's hard

An LLM benchmark is only useful for as long as it's hard

2
Comments
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

2
Comments
11 min read
Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Comments
4 min read
Monitoring vs Evaluation — What's the Difference (and Why It Matters)

Monitoring vs Evaluation — What's the Difference (and Why It Matters)

5
Comments
6 min read
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

1
Comments
5 min read
第一次对AI Agent的精神病学评估

第一次对AI Agent的精神病学评估

1
Comments
1 min read
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

5
Comments
4 min read
The First Psychiatric Evaluation of AI Agents

The First Psychiatric Evaluation of AI Agents

Comments
3 min read
Why I used three different critic roles instead of one (and what the eval taught me)

Why I used three different critic roles instead of one (and what the eval taught me)

Comments 2
6 min read
Building a domain-specific LLM evaluation set from scratch

Building a domain-specific LLM evaluation set from scratch

1
Comments
8 min read
What is an LLM evaluation harness? A deep dive into lm-eval-harness

What is an LLM evaluation harness? A deep dive into lm-eval-harness

1
Comments
7 min read
Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

2
Comments
5 min read
How do you eval LLM output that isn't code?

How do you eval LLM output that isn't code?

Comments 1
3 min read
The Alignment Problem Is an HR Problem - And We Should Treat It Like One

The Alignment Problem Is an HR Problem - And We Should Treat It Like One

1
Comments 8
4 min read
why Cohen's kappa drifts week to week (and what to do about it)

why Cohen's kappa drifts week to week (and what to do about it)

7
Comments 1
1 min read
👋 Sign in for the ability to sort posts by relevant, latest, or top.