Evaluation

👋 Sign in for the ability to sort posts by relevant, latest, or top.

Arthur

Jun 11

An LLM benchmark is only useful for as long as it's hard

#llm #evaluation #benchmarks #humaneval

10 min read

Saurav Bhattacharya

Jun 9

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

#ai #agents #safety #evaluation

11 min read

Saurav Bhattacharya

Jun 5

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

#ai #testing #agents #evaluation

4 min read

Phylis Korir

Jun 3

Monitoring vs Evaluation — What's the Difference (and Why It Matters)

#monitoring #evaluation #projectmanagement #beginners

6 min read

Saurav Bhattacharya

Jun 7

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

#ai #security #evaluation #agents

5 min read

guangda

Jun 6

第一次对AI Agent的精神病学评估

#ai #agents #psychology #evaluation

1 min read

Bala Madhusoodhanan

May 25

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

#aibuilder #powerplatform #evaluation #powerfuldevs

4 min read

guangda

Jun 5

The First Psychiatric Evaluation of AI Agents

#ai #agents #psychology #evaluation

3 min read

Bohyeon Jang

May 31

Why I used three different critic roles instead of one (and what the eval taught me)

#llm #python #ai #evaluation

6 min read

Tech_Nuggets

Jun 4

Building a domain-specific LLM evaluation set from scratch

#llm #ai #evaluation #opensource

8 min read

Tech_Nuggets

Jun 3

What is an LLM evaluation harness? A deep dive into lm-eval-harness

#llm #ai #evaluation #opensource

7 min read

Prakhar Singh

May 13

Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

#llm #codereview #evaluation #ai

5 min read

ur-grue

May 29

How do you eval LLM output that isn't code?

#ai #llm #evaluation #writing

3 min read

Saurav Bhattacharya

Jun 7

The Alignment Problem Is an HR Problem - And We Should Treat It Like One

#ai #agents #safety #evaluation

4 min read

Maya Andersson

Jun 2

why Cohen's kappa drifts week to week (and what to do about it)

#ai #evaluation #machinelearning #statistics

1 min read

👋 Sign in for the ability to sort posts by relevant, latest, or top.

DEV Community

# evaluation

An LLM benchmark is only useful for as long as it's hard

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Monitoring vs Evaluation — What's the Difference (and Why It Matters)

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

第一次对AI Agent的精神病学评估

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

The First Psychiatric Evaluation of AI Agents

Why I used three different critic roles instead of one (and what the eval taught me)

Building a domain-specific LLM evaluation set from scratch

What is an LLM evaluation harness? A deep dive into lm-eval-harness

Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

How do you eval LLM output that isn't code?

The Alignment Problem Is an HR Problem - And We Should Treat It Like One

why Cohen's kappa drifts week to week (and what to do about it)