Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
evaluation
Follow
Hide
Posts
Left menu
👋
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
An LLM benchmark is only useful for as long as it's hard
Arthur
Arthur
Arthur
Follow
Jun 11
An LLM benchmark is only useful for as long as it's hard
#
llm
#
evaluation
#
benchmarks
#
humaneval
2
 reactions
Comments
Add Comment
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 9
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
#
ai
#
agents
#
safety
#
evaluation
2
 reactions
Comments
Add Comment
11 min read
Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 5
Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation
#
ai
#
testing
#
agents
#
evaluation
Comments
Add Comment
4 min read
Monitoring vs Evaluation — What's the Difference (and Why It Matters)
Phylis Korir
Phylis Korir
Phylis Korir
Follow
Jun 3
Monitoring vs Evaluation — What's the Difference (and Why It Matters)
#
monitoring
#
evaluation
#
projectmanagement
#
beginners
5
 reactions
Comments
Add Comment
6 min read
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 7
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
#
ai
#
security
#
evaluation
#
agents
1
 reaction
Comments
Add Comment
5 min read
第一次对AI Agent的精神病å¦è¯„ä¼°
guangda
guangda
guangda
Follow
Jun 6
第一次对AI Agent的精神病å¦è¯„ä¼°
#
ai
#
agents
#
psychology
#
evaluation
1
 reaction
Comments
Add Comment
1 min read
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions
Bala Madhusoodhanan
Bala Madhusoodhanan
Bala Madhusoodhanan
Follow
May 25
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions
#
aibuilder
#
powerplatform
#
evaluation
#
powerfuldevs
5
 reactions
Comments
Add Comment
4 min read
The First Psychiatric Evaluation of AI Agents
guangda
guangda
guangda
Follow
Jun 5
The First Psychiatric Evaluation of AI Agents
#
ai
#
agents
#
psychology
#
evaluation
Comments
Add Comment
3 min read
Why I used three different critic roles instead of one (and what the eval taught me)
Bohyeon Jang
Bohyeon Jang
Bohyeon Jang
Follow
May 31
Why I used three different critic roles instead of one (and what the eval taught me)
#
llm
#
python
#
ai
#
evaluation
Comments
2
 comments
6 min read
Building a domain-specific LLM evaluation set from scratch
Tech_Nuggets
Tech_Nuggets
Tech_Nuggets
Follow
Jun 4
Building a domain-specific LLM evaluation set from scratch
#
llm
#
ai
#
evaluation
#
opensource
1
 reaction
Comments
Add Comment
8 min read
What is an LLM evaluation harness? A deep dive into lm-eval-harness
Tech_Nuggets
Tech_Nuggets
Tech_Nuggets
Follow
Jun 3
What is an LLM evaluation harness? A deep dive into lm-eval-harness
#
llm
#
ai
#
evaluation
#
opensource
1
 reaction
Comments
Add Comment
7 min read
Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"
Prakhar Singh
Prakhar Singh
Prakhar Singh
Follow
May 13
Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"
#
llm
#
codereview
#
evaluation
#
ai
2
 reactions
Comments
Add Comment
5 min read
How do you eval LLM output that isn't code?
ur-grue
ur-grue
ur-grue
Follow
May 29
How do you eval LLM output that isn't code?
#
ai
#
llm
#
evaluation
#
writing
Comments
1
 comment
3 min read
The Alignment Problem Is an HR Problem - And We Should Treat It Like One
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 7
The Alignment Problem Is an HR Problem - And We Should Treat It Like One
#
ai
#
agents
#
safety
#
evaluation
1
 reaction
Comments
8
 comments
4 min read
why Cohen's kappa drifts week to week (and what to do about it)
Maya Andersson
Maya Andersson
Maya Andersson
Follow
Jun 2
why Cohen's kappa drifts week to week (and what to do about it)
#
ai
#
evaluation
#
machinelearning
#
statistics
7
 reactions
Comments
1
 comment
1 min read
👋
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account