DEV Community

Cover image for I Built an Open-Source Multi-Agent Fact-Checker — Here's How It Works
ashg2099
ashg2099

Posted on

I Built an Open-Source Multi-Agent Fact-Checker — Here's How It Works

Problem Statement

We have a misinformation problem. But more specifically, we have a speed problem.
A journalist spots a suspicious claim. They search for sources. Cross-reference databases. Call experts. Write a verdict. Get it edited. Publish, maybe 6 hours later. Maybe 3 days later.
Meanwhile, the original claim has been screenshot, reposted, quoted in newsletters, and cited in arguments across five platforms.
I wanted to build something that closed that gap. Not a chatbot that guesses. A proper pipeline, one that retrieves real evidence, reasons from it, and tells you why it reached a verdict.
That's what Sift is.

What is Sift?

Sift (Source Inspection & Fact-checking Tool) is an open-source multi-agent AI pipeline that takes any text, extracts every factual claim, retrieves grounded evidence, and returns auditable verdicts — TRUE, FALSE, or UNCERTAIN, with cited sources and full reasoning chains.
Paste a news article. A politician's speech. A viral statistic. A WhatsApp forward. Sift breaks it into individual claims and fact-checks each one independently.

Why Multi-Agent?

The naive approach is to ask an LLM: "Is this claim true?"
The problem: LLMs hallucinate. They have knowledge cutoffs. They're confidently wrong in ways that are hard to detect. And critically, they don't show their work.
A single LLM call can't reliably handle the full pipeline of:

  • Extracting structured claims from noisy text
  • Retrieving dated, traceable evidence from live sources
  • Reasoning across conflicting evidence without confabulating
  • Adversarially reviewing its own conclusions for overconfidence
  • Finding corrections when something is wrong

Each of these is a distinct task that benefits from its own prompt, its own tools, and its own failure modes. That's why I built five separate agents, orchestrated with LangGraph.

The 5-Agent Pipeline

Agent 1 — Claim Extractor

A single paragraph can contain 4-5 distinct factual claims. Generic LLMs miss them or conflate them.
This agent uses LLaMA 3.3 70B via Groq with Pydantic structured output to extract every distinct verifiable claim from the input text. The output is a typed list of claims — exact text, no paraphrasing, no hallucination.

Agent 2 — Evidence Hunter

LLMs hallucinate citations. You need real, retrievable, dated evidence.
This agent runs HyDE retrieval across 4,270 indexed Guardian + Wikipedia chunks stored in pgvector, then hits Tavily live web search for recent data.
Why HyDE instead of standard RAG?
Standard RAG embeds the raw claim and searches for similar text. A short factual claim like "The Fed raised rates in March 2024" has a weak semantic signal on its own.
HyDE (Hypothetical Document Embeddings) generates a hypothetical document that would contain the answer — something like a news article excerpt — then embeds that. The result is a richer semantic signal and significantly better retrieval recall on short factual claims.

Agent 3 — Synthesis Agent

This agent reasons strictly from retrieved evidence. It returns TRUE / FALSE / UNCERTAIN with a calibrated confidence score.
Critically — if evidence is thin or conflicting, it returns UNCERTAIN instead of confabulating certainty. This was one of the hardest things to get right. LLMs naturally trend toward false confidence. I had to explicitly prompt for epistemic humility and add Pydantic validators to catch zero-confidence outputs.

Agent 4 — Critic Agent

Synthesis agents tend toward overconfidence when evidence partially supports a claim. You need an adversarial check.
This agent independently reviews every verdict. It flags unsupported reasoning, catches cases where 1.1°C vs 1.19°C is a rounding difference, not a false claim, and adjusts confidence downward when warranted.
This is the step most fact-checking systems skip — and it's the one that matters most for borderline claims.

Agent 5 — Correction Agent

Knowing something is false isn't enough. Users need to know what IS true.
This agent fires only on FALSE or UNCERTAIN verdicts. It runs a targeted live search to find the correct information and surfaces it with a cited source. Conditional — doesn't waste tokens on TRUE verdicts.

Why LangGraph?

The pipeline isn't linear for every claim. Some claims have no evidence — they skip synthesis and go straight to the criticism. Some need multiple retrieval attempts. Some claims loop.
LangGraph's state machine handles conditional branching, loops, and shared state across agents cleanly. The state is typed with TypedDict — every agent reads from and writes to the same state object.

Infrastructure

FastAPI returns a task ID immediately. Celery + Redis runs the pipeline in the background. The client polls for results.
Redis cache stores results for 7 days — the same viral claim doesn't cost tokens twice. Cache hits at the API layer return in under 1 second, before Celery even runs.
LangFuse traces every LLM call — prompt, output, latency, token count — so I can debug agent failures without guessing.

Tech Stack

LLM: LLaMA 3.3 70B via Groq API
Embeddings: all-MiniLM-L6-v2 via HuggingFace Inference API
Orchestration: LangGraph state machine
RAG: HyDE + pgvector hybrid search
Vector DB: PostgreSQL + pgvector
API: FastAPI + Pydantic
Task Queue: Celery + Redis
Evidence Sources: Tavily (live) + Guardian API + Wikipedia
Observability: LangFuse + Prometheus + Grafana

Try It

The project is fully open source and Dockerized. One command runs the entire stack:

git clone https://github.com/ashg2099/Sift.git
cd Sift
cp .env.example .env
# Add your API keys (Groq, Tavily, HuggingFace — all free tiers)
docker compose up
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:8000 and start verifying claims.
I'm actively looking for feedback — especially where it breaks. If you try it, I'd love to know what it gets wrong.

GitHub: https://github.com/ashg2099/Sift
LinkedIn: https://www.linkedin.com/in/ashwin-gururaj-93943816a/

Top comments (2)

Collapse
 
harjjotsinghh profile image
Harjot Singh

A multi-agent fact-checker is a great use case precisely because fact-checking is itself a verify-against-sources pipeline, which is the one thing LLMs must not do from memory. The irony you have to design around: the tool catching misinformation is built on a model that hallucinates, so the architecture has to make the agents incapable of fabricating a verdict. The patterns that make it trustworthy: every claim decomposed into checkable sub-claims, each grounded in a retrieved source the verdict cites (so a human can audit it), and a hard abstain when the evidence is thin rather than a confident maybe-true. The speed argument is real and important, but speed is only valuable if the verdict is right, a fast wrong fact-check is worse than a slow one because it launders misinformation with an authoritative-looking stamp. So the bar is fast AND grounded AND willing to say insufficient evidence. Multi-agent helps if the agents play distinct roles (retriever, skeptic, synthesizer) rather than all doing the same thing, real division of labor beats redundancy. That ground-every-verdict-and-let-it-abstain discipline is core to how I think about verification in Moonshift. How do you stop the synthesis agent from overstating confidence when the retrieved evidence is actually weak?

Collapse
 
ashg2099 profile image
ashg2099

Really appreciate this, this is exactly the kind of comment that pushes the thinking further. Thank you for being the first to engage with it this deeply.

This is exactly the tension I spent the most time on, and honestly still haven't fully solved it.
The synthesis agent's natural failure mode is false confidence. When evidence is thin, a vanilla LLM will still pick a side. It prefers coherence over honesty.

Three things that helped:

1. Explicit epistemic instruction in the prompt: Not just "return TRUE/FALSE/UNCERTAIN" but "if the retrieved evidence does not directly support or contradict the claim, you MUST return UNCERTAIN. A confident wrong answer is worse than no answer." Framing the abstention as the responsible choice — not a failure — changed the output distribution noticeably.
2. Pydantic validators on the output: If the confidence score is above 0.85 but the evidence count is below 2, the validator flags it for the critic. Forces a second look before the verdict is finalized.
3. The critic agent as a structural check: It sees the verdict AND the evidence. Its sole job is to ask: Does this evidence actually warrant this confidence? It adjusts downward when it doesn't. This is the step that catches the "1 supporting chunk → 0.9 confidence" failure mode.

On your point about distinct roles — this was a deliberate architectural decision, not an accident. Each agent is intentionally blind to the others' internals:

  1. The Retriever has no opinion — it fetches, it doesn't reason
  2. The Synthesizer reasons only from retrieved chunks — no live search, no memory
  3. The Critic never sees the Synthesizer's prompt — only the verdict and the evidence it was based on
  4. The Corrector only fires on FALSE/UNCERTAIN — it never touches TRUE verdicts

The goal was to make cross-agent reinforcement of the same mistake structurally impossible. If they all had the full context of each other, they'd just agree. Division of ignorance matters as much as division of labor.

On speed vs accuracy — you're right that a fast wrong fact-check is worse than a slow correct one. It doesn't just miss — it launders misinformation with an authoritative stamp. That's exactly why Sift will return UNCERTAIN rather than force a verdict when evidence is thin. A stamped UNCERTAIN is still useful to a journalist. A stamped FALSE that's wrong is dangerous.
What I haven't solved: when ALL retrieved evidence points one direction, but the retrieval itself was bad — garbage in, confident garbage out. That's a retrieval quality problem, not a synthesis problem. HyDE helps, but doesn't eliminate it.

Curious how Moonshift handles the retrieval side, do you gate on source quality before it even reaches synthesis?