Sahil Singh

Posted on Feb 18 • Originally published at getglueapp.com

I Built a CLI Tool to Score Codebase Health -- Here's What I Learned

#devops #opensource #codequality #python

Three months ago, I built a CLI tool to score codebase health. I called it codebase-health-score (original, I know). The goal was simple: give a team a single number that represents the overall health of their system.

The project started as a weekend hack. It turned into something people actually use. And in the process, I learned that "codebase health" is a much more slippery concept than it initially appeared.

Why I Built This

Our team was going through the classic scenario: we had inherited a codebase from an acquisition. It was "messy." But how messy? Nobody could quantify it.

When we tried to plan the cleanup work, we'd get debates:

"How bad is it really?"
"Should we refactor X before Y?"
"Is this codebase maintainable, or do we need a rewrite?"

Everyone had opinions. Nobody had data.

I thought: what if I built a tool that analyzed the codebase and produced a score? Not a vanity metric, but something that actually correlates to real engineering friction.

The Challenge: What Actually Determines Health?

I quickly discovered that codebase health isn't a single property. It's a combination of properties, all of which matter, and none of which tells the whole story:

Complexity

The first instinct is to measure code complexity: cyclomatic complexity, lines of code per function, nesting depth.

# Bad complexity example
def process_order(order_id, customer_id, apply_discount=False,
                  skip_inventory=False, force_shipping=None,
                  override_price=None, is_internal=False):
    customer = get_customer(customer_id)
    if not customer:
        return {"error": "Customer not found"}

    order = get_order(order_id)
    if not order:
        return {"error": "Order not found"}

    # ... 200 more lines mixing pricing logic, shipping logic,
    # inventory logic, and error handling

High complexity correlates with bugs. But it's not perfectly predictive--sometimes simple code is wrong, and sometimes complex code is necessary.

Weight in my scoring: 15%

The reason it's not higher: complexity is easy to hide. You can refactor a function from 100 lines to 50 lines by moving the complexity to three other functions. The system-level complexity hasn't improved; it's just been redistributed.

Churn

Churn is the frequency with which a file or function changes. High-churn code is code that:

Is still being figured out
Has bugs being fixed repeatedly
Has changing requirements
Is in the critical path of the system

High churn is a strong predictor of future bugs. If a file changes 50 times in a quarter, it probably contains unresolved design decisions.

# Check churn on a file
git log --oneline --follow path/to/file.py | wc -l

I discovered something interesting: files with high churn are often high-complexity files, and they correlate strongly with production incidents. When we looked at our incident logs and traced them to file-level, the top 5 files by incident involvement were in the top 10 files by churn.

Weight in my scoring: 25%

This is why churn matters more than static complexity. Churn tells you where the system is still in flux. Where there's flux, there are bugs.

Contributor Concentration

This one surprised me. But it's actually one of the strongest health signals.

If one person understands 80% of the codebase, your system is fragile. If that person leaves, goes on vacation, or gets promoted, you're in trouble.

Conversely, if knowledge is distributed across 5+ people, the system is more resilient.

I measure this by calculating what percentage of commits come from the top N contributors:

def calculate_contributor_concentration(repo_path):
    """
    Calculate what % of commits come from top 5 contributors.
    Healthy = spread across many people.
    Unhealthy = concentrated in 1-2 people.
    """
    commits_by_contributor = count_commits_by_author(repo_path)
    total_commits = sum(commits_by_contributor.values())

    top_5_commits = sum(sorted(commits_by_contributor.values(),
                                reverse=True)[:5])
    concentration = (top_5_commits / total_commits) * 100

    # Less than 60% in top 5 = healthy
    # 60-80% = concerning
    # 80%+ = critical

    return concentration

The correlation here is almost spooky. I measured this against several teams and found:

Teams with >80% concentration in top 5 had 3.2x more onboarding issues
Teams with >80% concentration had 2.7x higher risk of critical incidents when knowledge-holder was unavailable
Teams with <60% concentration had 40% better retention of junior engineers

Weight in my scoring: 20%

Test Coverage

I was initially going to make this 30% of the score. But then I realized: test coverage is a vanity metric. You can have 95% coverage with bad tests. You can have 40% coverage on the critical paths (which is what matters).

What actually matters: are the critical paths tested? Are failure cases tested? Or is the coverage just hitting lines without testing actual behavior?

Still, some correlation exists: codebases with <30% coverage tend to have higher incident rates than codebases with >70% coverage.

Weight in my scoring: 10%

Documentation Coverage

This is tricky to measure automatically. I settled on a heuristic: what percentage of exported functions/classes have associated comments or docstrings?

def measure_documentation_ratio(codebase_path):
    """
    Count public functions with docstrings vs total public functions.
    """
    public_items = find_public_apis(codebase_path)
    documented_items = [item for item in public_items
                        if has_docstring(item)]

    return len(documented_items) / len(public_items)

In practice, I found this correlates weakly with actual system health. Some of the most stable, well-understood codebases have minimal documentation. Some heavily documented codebases are disasters because the code changed and the docs didn't.

But absence of documentation does correlate with onboarding difficulty. New engineers in undocumented codebases take 50% longer to ramp.

Weight in my scoring: 10%

Technical Debt Markers

Some patterns are markers of accumulated technical debt:

TODO/FIXME comments that haven't been touched in 6+ months
Dead code (imports that aren't used, functions that aren't called)
Dependency version obsolescence (packages that haven't been updated in 2+ years)

These are weak individual signals, but together they form a pattern. A codebase with many of these markers is showing signs of neglect.

Weight in my scoring: 10%

Coupling & Modularity

This one requires actual analysis of dependencies: how many modules import from how many other modules?

A healthy architecture has:

Clear boundaries (modules don't reach deep into other modules' internals)
Limited cross-cutting dependencies
A recognizable structure (layered, services-based, whatever pattern you chose)

An unhealthy architecture has:

Circular dependencies
Every file importing from every other file
No clear separation of concerns

def calculate_coupling_metrics(repo_path):
    """
    Measure how tightly coupled modules are.
    """
    import_graph = build_import_graph(repo_path)
    cycles = find_cycles(import_graph)
    overly_connected = [
        module for module in import_graph
        if len(import_graph[module]) > 10
    ]
    avg_imports_per_module = sum(
        len(deps) for deps in import_graph.values()
    ) / len(import_graph)

    return {
        'cycles': len(cycles),
        'overly_connected': len(overly_connected),
        'avg_connectivity': avg_imports_per_module
    }

This is computationally expensive for large codebases, but it's one of the strongest predictors of system health.

Weight in my scoring: 10%

The Scoring Formula

I settled on this weighting:

Health Score = (0.25 x Churn) + (0.20 x Contributor Concentration) + (0.15 x Complexity) + (0.10 x Coupling) + (0.10 x Test Coverage) + (0.10 x Documentation) + (0.10 x Technical Debt Markers)

Scale: 0-100, where:

0-30: Critical. High risk of incidents. Hard to change. Fragile.
30-50: Unhealthy. Velocity is degraded. Rework is common.
50-70: Acceptable. Maintainable. Some areas need attention.
70-85: Healthy. Good balance of velocity and stability.
85-100: Excellent. Well-maintained, stable, easy to change.

Real-World Findings

I've now run this tool on about 30 different codebases. Some patterns surprised me:

1. Age Doesn't Predict Score

Some 10-year-old codebases score in the 80s. Some 2-year-old codebases score in the 30s. What matters: has anyone been actively maintaining it?

2. Size Doesn't Predict Score

A 500-file monolith can score higher than a microservices architecture with 50 services. Modularity matters more than scale.

3. Contributor Concentration is Underrated

I moved it from 15% to 20% because the correlation with real problems was stronger than I expected.

4. Coupling is More Predictive Than Complexity

Systems with moderate complexity but high coupling were worse than systems with high complexity but low coupling. Complexity can be refactored locally. Coupling forces you to refactor everything.

5. Open Source Codebases Score Surprisingly High

Most scored 65+. Contributor concentration is naturally distributed, dead code gets cleaned up, tests are emphasized, and documentation is required.

What I'd Do Differently

Weight churn more heavily. It's the strongest single predictor of problems.
Focus on critical paths. Not all code is equal. Future versions should allow path-specific scoring.
Include incident correlation. Correlate scores against actual incident history.
Make it continuous. Run on every commit and trend the score over time.

Open Source & Beyond

The tool is available at github.com/glue-tools-ai/codebase-health-score. It's MIT licensed. PRs welcome.

Teams want to understand their codebase health. They want signals for where to invest cleanup effort. This tool sits in the middle: give me a repo, and I'll tell you in 2 minutes whether your codebase is healthy, where the problems are, and what actions would help most.

The Bigger Picture

Codebases are alive. They change. They decay. But unlike biological systems, they can be resurrected and healed.

The teams that do well treat their codebase health like they treat physical fitness: they check their metrics regularly, understand what drives health, and invest in the fundamentals rather than emergency measures.

The tool won't save you. But it'll tell you the truth about your codebase. And that truth is the foundation for fixing anything.

Resources & References

github.com/glue-tools-ai/codebase-health-score - The open source tool
DORA Metrics: 4 Key Metrics for Software Delivery - Gold standard for measuring engineering health
Refactoring Guru: Code Smell Detection - On identifying problematic code patterns

DEV Community

I Built a CLI Tool to Score Codebase Health -- Here's What I Learned

Why I Built This

The Challenge: What Actually Determines Health?

Complexity

Churn

Contributor Concentration

Test Coverage

Documentation Coverage

Technical Debt Markers

Coupling & Modularity

The Scoring Formula

Real-World Findings

1. Age Doesn't Predict Score

2. Size Doesn't Predict Score

3. Contributor Concentration is Underrated

4. Coupling is More Predictive Than Complexity

5. Open Source Codebases Score Surprisingly High

What I'd Do Differently

Open Source & Beyond

The Bigger Picture

Resources & References

Top comments (0)