RAG-Based Testing Series — Part 6: Automating RAG Quality Checks in CI/CD
"A test that only runs when you remember to run it isn't really a test. It's a hope."
We've built something real over this series.
In Part 2, we gave retrieval quality a number — Precision@K, Recall@K, MRR.
In Part 3, we gave hallucination detection a number — faithfulness scoring with RAGAS.
In Part 4, we tested the edge cases that break RAG systems in production.
In Part 5, we assembled all of that into a structured, reusable framework with one command to run everything.
But there's still a problem. 🔴
The framework only runs when someone decides to run it.
And in a real team, "someone will run the tests before deploying" is not a guarantee. It's an assumption. And assumptions fail at the worst possible moments.
- Someone updates the knowledge base at 5pm on a Friday.
- Someone tweaks the system prompt and doesn't realise it changed retrieval behaviour.
- Someone upgrades the embedding model and the similarity scores shift quietly.
None of these trigger a test run. None of these get caught. And your users discover the regression before your team does.
Part 6 fixes this.
We're wiring the framework from Part 5 into a GitHub Actions CI/CD pipeline so that RAG quality checks run automatically — on every relevant change, without anyone having to remember. 🤖
🗺️ What We're Building
By the end of this article, you'll have:
.github/
└── workflows/
└── rag_quality_checks.yml ← GitHub Actions workflow
rag_test_framework/
├── config/
│ └── settings.py
├── core/
│ ├── retriever.py
│ ├── evaluator.py
│ └── rag_pipeline.py
├── tests/
│ ├── conftest.py
│ ├── test_retrieval.py
│ ├── test_faithfulness.py
│ └── test_edge_cases.py
├── data/
│ └── test_cases.json
├── reports/
│ └── (auto-generated, uploaded as CI artifacts)
├── run_tests.py
└── requirements.txt
The workflow will:
- Trigger automatically on pushes that touch relevant files
- Install dependencies
- Run the full test suite
- Upload the test report as a downloadable artifact
- Post a summary to the GitHub Actions summary page
- Block the pipeline if any test fails — no silent regressions
Let's build it step by step. 🛠️
⚙️ Step 1 — Store Secrets Safely
Your framework needs an OpenAI API key. You never hardcode secrets in a repository.
In GitHub:
- Go to your repository → Settings → Secrets and variables → Actions
- Click New repository secret
- Name:
OPENAI_API_KEY - Value: your actual OpenAI API key
That's it. GitHub encrypts it. Your workflow accesses it as ${{ secrets.OPENAI_API_KEY }} — never exposed in logs or code.
Now update config/settings.py to read from the environment variable (this already works locally too if you set it with export OPENAI_API_KEY=...):
# config/settings.py
import os
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
raise EnvironmentError(
"OPENAI_API_KEY environment variable is not set.\n"
"Set it locally with: export OPENAI_API_KEY=your-key\n"
"In CI, add it as a GitHub Actions secret."
)
Failing loudly with a clear message is better than failing cryptically with an authentication error three steps later. ✅
📄 Step 2 — The GitHub Actions Workflow
Create this file at .github/workflows/rag_quality_checks.yml:
name: RAG Quality Checks
on:
push:
paths:
# Run when test cases or knowledge base changes
- 'rag_test_framework/data/**'
# Run when any core framework code changes
- 'rag_test_framework/core/**'
# Run when configuration (thresholds, models) changes
- 'rag_test_framework/config/**'
# Run when tests themselves change
- 'rag_test_framework/tests/**'
# Run when dependencies change
- 'rag_test_framework/requirements.txt'
pull_request:
paths:
- 'rag_test_framework/data/**'
- 'rag_test_framework/core/**'
- 'rag_test_framework/config/**'
- 'rag_test_framework/tests/**'
- 'rag_test_framework/requirements.txt'
# Allow manual trigger from the GitHub Actions UI
workflow_dispatch:
jobs:
rag-quality-checks:
name: RAG Quality Checks
runs-on: ubuntu-latest
steps:
# ── 1. Check out the repository ──────────────────────────
- name: Checkout repository
uses: actions/checkout@v4
# ── 2. Set up Python ─────────────────────────────────────
- name: Set up Python 3.11
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip' # cache pip installs between runs to speed up the workflow
# ── 3. Install dependencies ───────────────────────────────
- name: Install dependencies
working-directory: rag_test_framework
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
# ── 4. Run the RAG test suite ─────────────────────────────
- name: Run RAG quality checks
working-directory: rag_test_framework
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
mkdir -p reports
pytest tests/ \
-v \
--tb=short \
--json-report \
--json-report-file=reports/rag_test_report.json \
--json-report-summary
# ── 5. Upload report as a downloadable artifact ───────────
- name: Upload test report
if: always() # upload even if tests failed — you want the report either way
uses: actions/upload-artifact@v4
with:
name: rag-test-report-${{ github.run_number }}
path: rag_test_framework/reports/rag_test_report.json
retention-days: 30
# ── 6. Post summary to GitHub Actions summary page ────────
- name: Post test summary
if: always()
working-directory: rag_test_framework
run: python ci/post_summary.py reports/rag_test_report.json
📋 Step 3 — The Summary Script
The workflow calls ci/post_summary.py to write a clean summary to GitHub's built-in job summary page. Create that file now:
# ci/post_summary.py
import json
import os
import sys
def post_summary(report_path: str):
"""
Read the pytest JSON report and write a markdown summary
to the GitHub Actions step summary page (GITHUB_STEP_SUMMARY).
"""
if not os.path.exists(report_path):
print(f"Report not found at {report_path}")
sys.exit(1)
with open(report_path) as f:
report = json.load(f)
summary = report.get("summary", {})
passed = summary.get("passed", 0)
failed = summary.get("failed", 0)
error = summary.get("error", 0)
total = summary.get("total", 0)
duration = round(report.get("duration", 0), 2)
# Determine overall status
if failed > 0 or error > 0:
status_icon = "❌"
status_label = "FAILED"
else:
status_icon = "✅"
status_label = "PASSED"
# Build the markdown summary
lines = [
f"## {status_icon} RAG Quality Checks — {status_label}",
"",
"| Metric | Value |",
"|--------|-------|",
f"| Total tests | {total} |",
f"| Passed | {passed} |",
f"| Failed | {failed} |",
f"| Duration | {duration}s |",
"",
]
if failed > 0 or error > 0:
lines.append("### ❌ Failed Tests")
lines.append("")
for test in report.get("tests", []):
if test["outcome"] in ("failed", "error"):
lines.append(f"- `{test['nodeid']}`")
# Include the failure message if available
if "call" in test and "longrepr" in test["call"]:
# Truncate long failure output for readability
longrepr = test["call"]["longrepr"]
preview = longrepr[:500] + "..." if len(longrepr) > 500 else longrepr
lines.append(f" ```
{% endraw %}
\n {preview}\n
{% raw %}
```")
lines.append("")
lines += [
"### Test Breakdown",
"",
"| Test File | Tests | Status |",
"|-----------|-------|--------|",
]
# Group tests by file for the breakdown table
file_results: dict = {}
for test in report.get("tests", []):
file_name = test["nodeid"].split("::")[0]
if file_name not in file_results:
file_results[file_name] = {"total": 0, "failed": 0}
file_results[file_name]["total"] += 1
if test["outcome"] in ("failed", "error"):
file_results[file_name]["failed"] += 1
for file_name, counts in file_results.items():
icon = "✅" if counts["failed"] == 0 else "❌"
lines.append(f"| `{file_name}` | {counts['total']} | {icon} |")
summary_text = "\n".join(lines)
# Write to GitHub step summary if running in CI
github_summary_path = os.environ.get("GITHUB_STEP_SUMMARY")
if github_summary_path:
with open(github_summary_path, "a") as f:
f.write(summary_text)
print("✅ Summary written to GitHub Actions step summary.")
else:
# Running locally — just print it
print(summary_text)
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python ci/post_summary.py <path-to-report.json>")
sys.exit(1)
post_summary(sys.argv[1])
🔍 Step 4 — Understanding the Trigger Strategy
The paths filter in the workflow is one of the most important design decisions here. Let me explain why it's set up this way.
on:
push:
paths:
- 'rag_test_framework/data/**' # knowledge base changed
- 'rag_test_framework/core/**' # retrieval or pipeline logic changed
- 'rag_test_framework/config/**' # thresholds or models changed
- 'rag_test_framework/tests/**' # tests themselves changed
- 'rag_test_framework/requirements.txt'
Why not trigger on every push?
RAG quality tests are expensive. Each test run calls the OpenAI API for embeddings and RAGAS evaluation. Running on every push to every file — including README changes, frontend code, unrelated scripts — wastes time and money.
What actually warrants a RAG quality check?
| Change | Should trigger? | Why |
|---|---|---|
data/test_cases.json updated |
✅ Yes | Ground truth changed — verify scores still hold |
| New document added to knowledge base | ✅ Yes | Retrieval behaviour may shift |
config/settings.py thresholds changed |
✅ Yes | You're redefining what "passing" means |
| Embedding model changed | ✅ Yes | Similarity scores will shift |
| System prompt changed | ✅ Yes | LLM behaviour may change |
| README.md updated | ❌ No | Documentation only |
| Frontend code changed | ❌ No | No impact on RAG pipeline |
The paths filter implements exactly this logic. Only relevant changes trigger the quality gate. 🎯
🚦 Step 5 — What Happens When Tests Fail
This is important to understand clearly.
When pytest exits with a non-zero return code (i.e., any test fails), GitHub Actions automatically marks the job as failed. You don't need to add any special logic for this.
What that means in practice:
On a push to main:
The commit is recorded but the workflow run is marked ❌ Failed. Your team sees it immediately in the repository's commit history.
On a pull request:
The PR's status checks show ❌ RAG Quality Checks — Failed. You can configure branch protection rules to block merging until this passes.
Setting up branch protection (strongly recommended):
- Go to repository → Settings → Branches
- Add a branch protection rule for
main - Enable Require status checks to pass before merging
- Add
RAG Quality Checksas a required check
Now no one can merge a change that breaks RAG quality — not accidentally, not under deadline pressure. The gate is automated. 🔒
📊 Step 6 — Viewing Results
After a workflow run you have three places to check results:
1. GitHub Actions job logs
Full pytest output, line by line. Best for debugging a specific failure.
2. GitHub Actions step summary
The clean markdown table from post_summary.py. Best for a quick pass/fail overview. Visible directly on the workflow run page without opening logs.
3. Downloaded artifact
The full rag_test_report.json. Best for tracking scores over time or doing deeper analysis. Download it from the workflow run's Artifacts section.
💰 Step 7 — Managing API Costs in CI
Running RAGAS evaluations in CI means calling the OpenAI API on every trigger. Here's how to keep costs under control.
Use a Smaller Evaluation Dataset in CI
Your full ground truth dataset might have 50+ test cases. In CI, you don't need to run all of them on every push.
Create a separate, smaller CI dataset:
// data/test_cases_ci.json
{
"knowledge_base": [ ... ],
"retrieval_test_cases": [
// Keep your 5 highest-signal retrieval cases
// These should represent the most common and most critical query types
],
"faithfulness_test_cases": [
// 3-4 cases that cover your main faithfulness scenarios
],
"edge_case_queries": {
"out_of_scope": ["What is the capital of France?"],
"empty_retrieval": ["What is the pricing for the enterprise plan?"],
"leading_questions": [ ... ]
}
}
Then in conftest.py, read from an environment variable to decide which dataset to use:
# tests/conftest.py
import json
import os
import pytest
from core.retriever import build_collection
from core.evaluator import build_evaluator
@pytest.fixture(scope="session")
def test_data():
# In CI, use the smaller dataset. Locally, use the full one.
ci_mode = os.environ.get("CI", "false").lower() == "true"
dataset_path = "data/test_cases_ci.json" if ci_mode else "data/test_cases.json"
with open(dataset_path) as f:
return json.load(f)
@pytest.fixture(scope="session")
def collection(test_data):
kb = test_data["knowledge_base"]
return build_collection(
collection_name="rag_test_kb",
documents=[doc["text"] for doc in kb],
doc_ids=[doc["id"] for doc in kb]
)
@pytest.fixture(scope="session")
def evaluator():
llm, embeddings = build_evaluator()
return llm, embeddings
GitHub Actions automatically sets CI=true in every workflow run — no extra configuration needed.
Result: CI runs a fast, cost-efficient subset. Full runs happen locally or on scheduled nightly jobs (see below). ✅
🌙 Step 8 — Scheduled Full Runs
For a complete quality audit — run the full dataset on a schedule, not just on push:
# Add this to the `on:` section of your workflow
schedule:
# Run every day at 2 AM UTC
# This uses the full dataset, not the CI subset
- cron: '0 2 * * *'
And in your workflow, pass an environment variable to tell conftest to use the full dataset:
- name: Run RAG quality checks
working-directory: rag_test_framework
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
# Override CI mode for scheduled runs — use full dataset
CI: ${{ github.event_name != 'schedule' && 'true' || 'false' }}
run: |
mkdir -p reports
pytest tests/ -v --tb=short --json-report --json-report-file=reports/rag_test_report.json
The result:
| Trigger | Dataset | Purpose |
|---|---|---|
| Push / PR |
test_cases_ci.json (small) |
Fast gate — catch obvious regressions |
| Scheduled (nightly) |
test_cases.json (full) |
Full quality audit — track score trends |
🧩 The Complete Final Architecture
Here's the full picture — everything we've built across all six parts:
rag_test_framework/
│
├── .github/workflows/
│ └── rag_quality_checks.yml ← CI/CD trigger + orchestration
│
├── ci/
│ └── post_summary.py ← GitHub Actions summary writer
│
├── config/
│ └── settings.py ← all thresholds, model names, API keys
│
├── core/
│ ├── retriever.py ← retrieval + Precision@K, Recall@K, MRR
│ ├── evaluator.py ← RAGAS faithfulness + answer_relevancy
│ └── rag_pipeline.py ← end-to-end RAG call
│
├── tests/
│ ├── conftest.py ← shared session-scoped fixtures
│ ├── test_retrieval.py ← Part 2 tests
│ ├── test_faithfulness.py ← Part 3 tests
│ └── test_edge_cases.py ← Part 4 tests
│
├── data/
│ ├── test_cases.json ← full ground truth dataset
│ └── test_cases_ci.json ← smaller CI subset
│
├── reports/
│ └── (timestamped JSON reports)
│
├── run_tests.py ← local single-command runner
└── requirements.txt
One push. One workflow. Automated quality gate on every relevant change. 🎯
✅ End-to-End Flow — What Happens on Every Relevant Push
Let's walk through exactly what happens when a developer updates data/test_cases.json:
Developer pushes a commit that updates data/test_cases.json
│
▼
GitHub detects the push matches a path filter
│
▼
GitHub Actions spins up ubuntu-latest runner
│
▼
Python 3.11 installed, pip cache restored
│
▼
pip install -r requirements.txt
│
▼
pytest tests/ runs with CI=true (uses test_cases_ci.json)
│
├── test_retrieval.py — Precision@K, Recall@K, MRR asserted
├── test_faithfulness.py — Faithfulness, no critical hallucinations
└── test_edge_cases.py — Empty retrieval, out-of-scope, leading questions
│
▼
rag_test_report.json written to reports/
│
▼
post_summary.py writes markdown table to GitHub step summary
│
▼
Report uploaded as downloadable artifact (kept 30 days)
│
├── All tests pass → ✅ Pipeline green, PR can merge
└── Any test fails → ❌ Pipeline blocked, team notified
🔖 Key Takeaways From Part 6
- Automation removes the "someone will remember" assumption — the gate runs regardless of deadline pressure or human error
-
pathsfiltering keeps costs under control — only trigger on changes that can actually affect RAG quality - Separate CI and full datasets — fast feedback on push, deep audit on schedule
-
scope="session"fixtures + CI dataset = fast CI runs — no repeated expensive setup, no unnecessary API calls - Branch protection rules complete the gate — automated tests mean nothing if merging is still allowed when they fail
- Reports as artifacts — every run is recorded; you can track quality score trends over time
-
if: always()on artifact upload — you always want the report, especially when tests fail
🏁 Series Complete — What You've Built
Let's take a moment to look at how far we've come.
Six parts ago, most engineers testing RAG systems had no framework, no metrics, and no automated gate. They were hoping the final answer "looked right."
You now have something completely different. 👇
Part 1 ✅ — Understood what RAG is and why traditional testing breaks down
Part 2 ✅ — Gave retrieval quality a number: Precision@K, Recall@K, MRR
Part 3 ✅ — Gave hallucination detection a number: faithfulness scoring with RAGAS
Part 4 ✅ — Tested the edge cases that break RAG systems in production
Part 5 ✅ — Assembled everything into a structured, reusable framework
Part 6 ✅ — Automated the framework in CI/CD with GitHub Actions
You can plug this into any RAG system. Swap the vector database. Swap the LLM. The tests stay the same. The gate stays active. The quality stays measurable. 🎯
This is what production-grade RAG testing looks like.
🚀 What's Next — Beyond This Series
The framework you've built is a foundation, not a ceiling. Here's where to take it from here:
NDCG implementation — We covered NDCG conceptually in Part 2. Adding a proper implementation using sklearn.metrics.ndcg_score is a natural next step for more sophisticated retrieval ranking tests.
Alternative vector databases — The framework currently uses ChromaDB. If your production system uses Pinecone, Weaviate, or pgvector, the only change is in core/retriever.py. The tests are untouched.
Score trend tracking — Each run produces a JSON report. Building a simple script to parse historical reports and plot score trends over time will tell you if your RAG quality is improving or degrading with each knowledge base update.
Latency testing — We tested quality but not speed. Retrieval latency and end-to-end response time are worth measuring, especially as your knowledge base grows.
Custom RAGAS metrics — RAGAS supports custom metrics beyond faithfulness and answer relevancy. Context precision and context recall are worth exploring as your test suite matures.
Thank you for following this series all the way to the end. 🙏
Every part was built with real QA engineering principles — not just AI hype. The goal was always to make RAG testing feel like engineering, not magic.
I hope it does. 🎯
Drop a comment below 👇
- Have you wired this into your own CI/CD pipeline? How did it go?
- Which part of the series was most useful for your specific situation?
- What would you like me to cover next — NDCG implementation, alternative vector DBs, score trend tracking?
All questions and feedback welcome. Let's keep building. 🙌
Faizal Shaikh | Senior Automation Engineer | AI & RAG-Based Testing
Connect with me on LinkedIn
Top comments (1)
This is the step that makes RAG testing real. Retrieval quality can drift from content changes, prompt edits, embedding upgrades, or even chunking tweaks, so tying tests only to app code misses the actual risk surface. The useful CI trigger is “anything that can change the answer,” not just “anything that changes the code.”