DEV Community

Cover image for What if ML pipelines had a lock file?
Offisong Emmanuel for Hackmamba

Posted on

What if ML pipelines had a lock file?

I spent two hours last month staring at identical Git commits trying to figure out why my model retrain had different results.

The code was the same. The hyperparameters were the same. I was even running on the same machine. But the validation metrics had shifted by 12%, and I couldn't explain why. I checked everything twice: my random seeds were fixed, my dependencies were pinned, my Docker image hadn't changed. Then I looked at the data.

Someone had added a column to an upstream table and backfilled it. Nothing broke. The pipeline kept running. Training succeeded. But the feature distribution had shifted, and the model had learned from data that no one realized was different.

That experience changed how I think about ML pipelines. We can lock dependencies. We can lock infrastructure. But the computation itself has no identity. Pipelines are still scripts that read mutable data, assume schemas that drift, and depend on execution details that change quietly.

In this article, we’ll walk through why that makes ML pipelines hard to reproduce, what a pipeline lock file actually needs to capture, and how treating computation as an artifact changes how we debug, audit, and build models.

Why ML pipelines are hard to reproduce

When an ML pipeline fails to reproduce, the code is rarely the problem. Most teams already version their training scripts, feature logic, and model code using Git. The issue is that the meaning of that code depends on far more than what lives in the repository.

Consider a fraud detection pipeline use-case. The code reads transaction data, joins it with user profiles, applies feature transformations, and trains a model. The Python script and SQL queries are tracked in Git. The model architecture is documented. Everything looks reproducible.

After a while, fraud detection accuracy drops in production, and you are tasked to recreate the training run for an audit, but you can't. The code runs, but the model comes out different. Something changed, but what?

The problem is that ML pipelines don't just depend on code. They depend on data, schemas, and execution details that live outside the repository and change without anyone noticing.

Data
Pipelines usually read from tables that change over time. Most of these tables are stored in a data warehouse like Amazon Redshift or Google BigQuery. Rows are added or removed. Backfills happen. A column gets renamed or its meaning changes. Even when teams snapshot data, those snapshots are often implicit, not recorded as part of the pipeline run itself.

In this fraud pipeline, training data comes from a warehouse table like transactions. Between the original training run and the reproduction attempt, the data team backfilled several months of historical records to fix a reporting bug. The pipeline query didn’t change:

SELECT * FROM transactions WHERE date >= '2025-01-01'
Enter fullscreen mode Exit fullscreen mode

But the rows returned did.

The original model was trained on one set of data (transaction amounts, merchant categories, and user behavior), while the reproduced run was trained on a different set. Even though both runs used the same code, neither recorded which specific data version was used.

From the outside, it looks like “the same pipeline.” In reality, two different datasets flowed through it.

The problem is even worse with derived tables. If the fraud model depends on a shared feature table maintained by another team, and that team fixes a bug in their aggregation logic and recomputes the table, our pipeline can keep running and silently consume the updated features. There is no error or warning, just different inputs flowing into the same code.

Schemas
Schemas add another layer of fragility. Many pipelines assume schemas rather than enforce them.
During the fraud detection data backfill, the schema changed, too. A new column, merchant_risk_score, was added to the transactions table. It was nullable at first because historical data didn’t have values for it yet.

The feature pipeline didn’t break. It simply treated missing values as zero during normalization. That meant older transactions effectively had no merchant risk, while newer ones suddenly did. The feature still existed. The code still ran. But the meaning of the feature changed.

As a result, the model learned two different behaviors depending on when a transaction occurred. Recent data emphasized merchant risk. Older data didn’t. Overall metrics looked fine during training, but once deployed, the model began misclassifying edge cases in production.

When accuracy dropped, the team assumed normal data drift and retrained. The retrain succeeded, but the new model still didn’t match the original. The schema change had rewritten the semantics of the features, and nothing in the pipeline recorded that shift or made it visible.

Dependencies and execution details
Dependencies and execution details add another layer of instability. A query planner may choose a different plan. A caching layer may reuse an old result. A User Defined Function (UDF) can change behavior because one of its dependencies was updated. None of this shows up in git, and very little of it is visible in logs.

Caching sometimes alters your model performance. They speed things up, which is good. But they also introduce a hidden state that can change results between runs. For example, your pipeline caches a feature table. Someone updates the upstream logic. Your cache is now stale, but nothing tells you that. You're training on a mix of old features and new data.

Even the runtime version matters. The original model artifact had been serialized with Python 3.9, but the reproduction ran under Python 3.11. The model loaded successfully, but downstream behavior wasn’t identical.

The result
The pipeline was reproducible in theory, but not in practice. The same code ran. A different computation happened.

There was no single artifact to inspect. No receipt that captured the data that was read, the schemas that were assumed, the UDF logic that executed, or the cache state that influenced the result. The team spent weeks reconstructing the run from logs, guesses, and tribal knowledge.

This is the gap lock files solved for software dependencies. And it’s the same gap ML pipelines still have today.

Why existing tools don’t fix this

At this point, most teams reach for familiar fixes.

They add more logging. They version datasets manually. They pin library versions. They introduce orchestrators, lineage tools, and experiment trackers. Each tool helps in isolation, but none of them answer the one question that matters during an incident or an audit:

What actually ran?
Logs tell you that a job executed, not which data it read. Git tells you what the code looked like, not how it resolved at runtime. Lineage graphs show connections, but not the concrete inputs, schemas, or cached state used in a specific run. Experiment tracking stores metrics and artifacts, but not the computation that produced them. So when something goes wrong, teams are left reconstructing history from fragments and guesswork.

The deeper issue is that ML pipelines don’t produce a durable artifact of the computation itself. The code is versioned, but the resolved execution is not. Data is mutable. Schemas drift. Execution details change. And none of that has a stable identity you can point to later.

Software engineering solved this problem years ago. We didn’t fix reproducibility by writing better README files or adding more logs. We fixed it by introducing lock files. Lock files are machine-readable artifacts that capture the fully resolved state of a system at execution time, representing the actual thing that ran rather than configuration.

The missing piece in ML is the same idea, applied to computation.

What an ML pipeline lock file actually is

An ML pipeline lock file is not a configuration file. It is not another place to declare what you want to run. It is a record of what actually ran.

In software, a lock file answers a simple question: What was installed? Not which dependencies were requested, but which ones were resolved, down to exact versions and hashes. An ML pipeline lock file needs to answer the same kind of question, but for computation. What computation is this?

That requires three things:

  • An explicit computation graph
  • Content identities
  • Roundtrippability

An explicit computation graph
The lock file must capture the computation as a concrete object. Not a Python script that does things, but the actual reads, transformations, joins, aggregations, UDFs, and caches that make up the pipeline.

For example, when you look at package-lock.json, you don't see installation scripts. You see the resolved dependency tree. Each package, each version. The lock file for an ML pipeline needs the same clarity.

Content identities
Every piece of the computation needs an identity based on its content. The inputs you read. The UDFs you execute. The dependencies you use. The cached artifacts you produce. Same inputs should mean the same identity and different inputs should mean different identities.

If two runs have the same content identities for their inputs, UDFs, and dependencies, they're running the same computation. If any of those identities differ, something changed. You don't have to guess. You can check the hashes.

Roundtrippability
One of the core features of an ML lock file is roundtrippability. A real pipeline lock file must be runnable on its own. Given the lock file and its associated artifacts, you should be able to rerun the pipeline without relying on a particular machine, environment, or set of hidden caches.

If your lock files have these features, you can diff computations the way you diff lock files. You can verify that a rerun is actually running the same thing. You can cache based on content, not guesses. You can bisect regressions by comparing hashes instead of reading through logs.

Git vs. Manifests

A useful way to understand the value of manifests is to compare what traditional version control captures with what a build manifest records. Git excels at tracking how a pipeline is written, but it stops short of describing the fully resolved computation that actually executed. The manifest (expr.yaml) fills in that missing layer by freezing the execution-time reality of the pipeline.

Code (git) Manifest (expr.yaml)
Pipeline definition
Resolved inputs at execution time
Schema contracts
UDF and UDXF content hashes
Cached artifacts
What actually ran

Git is excellent at tracking the source code that defines a pipeline. The manifest goes further by recording the resolved state of that pipeline at execution time.

Create an ML lock file using Xorq

Once you understand what a pipeline lock file is and why it matters, the next step is seeing it in action. Xorq makes it straightforward to turn a declarative pipeline into a reproducible, versioned artifact with a lock file.

To get started, install Xorq using pip or uv:

pip install "xorq[examples]"
Enter fullscreen mode Exit fullscreen mode

or

uv add "xorq[examples]"
Enter fullscreen mode Exit fullscreen mode

Next, download the financial fraud dataset from Kaggle and place the CSV file in your working directory. This example uses a simplified fraud detection pipeline, but the structure mirrors what you would build in a real production system.

Create a file main.py with the following content:

import xorq.api as xo
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from xorq.caching import ParquetCache
from xorq.config import options
import os
# specifies cache directory as current directory/cache
options.cache.default_relative_path=f"{os.getcwd()}/cache"
con = xo.connect()
cache = ParquetCache.from_kwargs()
# 1. Load the dataset
data = xo.read_csv('synthetic_fraud_dataset.csv')
# 2. Train / test split
train, test = xo.train_test_splits(data, test_sizes=0.2)
sk_pipeline = Pipeline([
    ("model", RandomForestClassifier(
        n_estimators=200,
        max_depth=10,
        random_state=42
    ))
])
# 3. Define the model
model = xo.Pipeline.from_instance(sk_pipeline)
# 4. Fit the model
fitted = model.fit(
    train,
    features=[
        'amount',
        'hour',
        'device_risk_score',
        'ip_risk_score'
    ],
    target='is_fraud'
)
# 5. Generate predictions (deferred execution)
predictions = fitted.predict(test).cache(cache=cache)
# 6. Execute the computation
print(predictions.execute())
Enter fullscreen mode Exit fullscreen mode

A few important things are happening here. The entire pipeline is defined declaratively, with each step clearly described: data ingestion, train–test splitting, model configuration, and a cached prediction stage. Nothing runs until execution is requested. When it does run, Xorq has enough information to capture the full computation as an explicit graph.

At this point, you have a working ML pipeline. In the next step, instead of just running it, we will build it. That build step is what produces the lock file: a manifest that records the resolved computation, the data it read, the schemas it assumed, the cached artifacts it created, and the exact logic that ran.

If your project directory is not already a Git repository, you need to initialize one before building an expression. Xorq records the git state as part of the build metadata, so a repository with at least one commit is required.

Run the following commands in your project folder:

git init
git add .
git commit -m "initial commit"
Enter fullscreen mode Exit fullscreen mode

Once the repository is initialized, you can build the expression and generate the lock file by running:

xorq build main.py -e predictions 
Enter fullscreen mode Exit fullscreen mode

If you are using uv, the equivalent command is:

uv run xorq build main.py -e predictions 
Enter fullscreen mode Exit fullscreen mode

This build step is what turns your pipeline from a runnable script into a versioned artifact, complete with a manifest that records the resolved computation.

The output of the run is shown in the image below:

Output of the build expression

After the build completes, you should see two new directories: builds and cache. The cache directory holds cached intermediate results created during execution. The builds directory contains the build artifacts themselves. Inside builds, you will find a directory named with a content derived hash, for example 78ff43314468. This directory is the lock file in practice. It is the concrete, portable representation of the pipeline run.

Build folders

Within that directory, several files are generated automatically, including expr.yaml, metadata.json, and profiles.yaml. The most important of these is expr.yaml. This file is the receipt for what actually ran. It describes the computation graph, the resolved inputs, the schema contracts, the cached nodes, and the content hashes that give the pipeline its identity.

Taken together, the build directory is a versioned, cached, and portable artifact. Once it exists, workflows that were previously fragile or manual become straightforward: reproducible runs, diffable computation, bisectable regressions, portable artifacts, and, importantly, composition.

The expression file
At first glance, expr.yaml looks intimidating. It contains many components, but its purpose is simple. It describes the computation itself, explicitly and completely.

Below is an abridged example:

nodes:
    '@read_4d6c147c9486':
      op: Read
      method_name: read_parquet
      name: ibis_read_csv_nepinfk5dzbxja2bo4kycwisyq
      profile: 846181d9920579c7c1b10dd45b3ab9b2_0
      read_kwargs:
      - - path
        - builds/78ff43314468/database_tables/917eccee9a442913a8c1afca12cf69b0.parquet
      - - table_name
        - ibis_read_csv_nepinfk5dzbxja2bo4kycwisyq
      normalize_method: fvfvfvfvf
      schema_ref: schema_c4a0925bdfca
      snapshot_hash: 4d6c147c9486fe2f5140558ff6860b60
Enter fullscreen mode Exit fullscreen mode

This first node answers a deceptively important question: What data was read? Not “which table name,” and not “which query,” but the exact data source. The Read node points to a concrete file, often materialized into the build directory itself. That means the pipeline is tied to the data that was actually used, not whatever that table happens to contain today.

The schema_ref is part of the plan. If the schema changes, this node no longer matches, and the computation’s identity changes with it.

Now look at how transformations are represented:

    '@filter_d5f72ffce15d':
      op: Filter
      parent:
        node_ref: '@read_4d6c147c9486'
      predicates:
      - op: LessEqual
        left:
          op: Multiply
          left:
            op: Cast
     predicted:
          op: ExprScalarUDF
          class_name: _predicted_18c1451165c
Enter fullscreen mode Exit fullscreen mode

The code above describes the filter. The predicate itself is part of the graph, not hidden inside a function call or a SQL string. The filter is explicitly connected to its parent node, so there is no ambiguity about ordering or dependencies.

Every transformation builds on a previous node, forming a complete expression tree:
Read → Filter → Aggregate → Cache

Later in the file, you’ll see nodes like this:

'@cachednode_e7b5fd7cd0a9':
  op: CachedNode
  parent:
    node_ref: '@remotetable_9a92039564d4'
  cache:
    type: ParquetCache
Enter fullscreen mode Exit fullscreen mode

Caching is also part of the computation. Because the cache appears in the graph, it is reproducible and portable. There are no hidden cache keys, no local assumptions, and no silent reuse of stale results. If the upstream logic changes, the cache node’s identity changes too.

Finally, notice the node names themselves:

@read_4d6c147c9486
@filter_d5f72ffce15d
@cachednode_e7b5fd7cd0a9
Enter fullscreen mode Exit fullscreen mode

These identifiers are content-derived. They are hashes of the node’s inputs, logic, schema, and configuration. Change anything meaningful, and the identifier changes. That change propagates through the graph.

This is what makes expr.yaml a lock file. Instead of saying “run this Python script,” it records what computation resolved, what data it read, what schemas it assumed, and where caching occurred. The hash of the build becomes the identity of the computation itself.

Treating pipelines as building blocks

So far, we’ve looked at how Xorq turns a pipeline into a versioned artifact. The payoff comes because these artifacts are composable. When you build a pipeline with Xorq, the output isn’t just a model or a metric. It’s a versioned computation artifact with a stable hash e.g. xyz123. That hash represents the fully resolved training run: data, schemas, feature logic, and execution details.

Because that artifact has an identity, it can be reused. An inference pipeline can explicitly reference the training artifact it depends on. Instead of “load the latest model,” it loads the model produced by build *xyz123*, along with the exact feature definitions and schema contracts that training used. If training changes, inference doesn’t silently drift. The composition produces a new hash.

This also makes deployment seamless. You can easily rollback to previous hashes without guesswork.

Why is this different from experiment tracking?
Tools like MLflow track artifacts. DVC versions data. Both are useful but neither gives you composable, versioned computation graphs.

  • MLflow can tell you which model file was produced, but not the resolved computation that created it.
  • DVC can version datasets, but not how those datasets were transformed, joined, cached, and consumed end-to-end.

Xorq’s unit of composition is the computation itself. Training pipelines produce artifacts that inference pipelines can depend on directly, without re-encoding assumptions in glue code.

What do we gain from this?

The most immediate gain is reproducibility. With a pipeline lock file, rerunning a pipeline means rerunning the same computation, not just the same code. The inputs are fixed, the schemas are known, the logic is explicit, and cached artifacts are part of the record. “Works on my machine” stops being a concern because the computation has a concrete identity.

You can easily run builds by:

xorq run builds/<build-hash>
Enter fullscreen mode Exit fullscreen mode

Another advantage is portability. This means you can take a build produced on a developer’s laptop and execute it in CI, inside a container, or on a different execution engine with confidence that it will behave the same way.

Also, when a model regresses, you can diff runs. Two builds produce two manifests. Instead of guessing what changed, you get a semantic diff: data sources, schema changes, UDF content, planner decisions, cached nodes. This turns multi-week investigations into focused comparisons.

Schema drift becomes visible early. Because schemas are part of the contract, drift shows up at boundaries rather than leaking silently into downstream logic. Pipelines fail fast, in the right place, instead of producing subtly wrong models.

Finally, there is an organizational gain. When computation is explicit and versioned, teams move faster with less risk. Audits become tractable because training runs are reproducible.

Closing insights

Lock files changed how we think about software. They gave us a stable unit we could diff, ship, and trust. ML pipelines have needed the same thing for a long time, but until now, there has been nothing concrete to lock.

By giving computation an identity, pipeline manifests turn runs into artifacts. They capture what actually ran, not just what the code described. Once that exists, reproducibility, debugging, audits, and collaboration stop being fragile processes and start becoming mechanical.

Xorq provides a practical and robust foundation for building reproducible, auditable, and production-grade ML workflows. This makes it easy to generate an ML lock file that captures not just what was written, but what actually ran, including resolved inputs, content hashes, and cached artifacts.

For more information about Xorq, head over to their GitHub or their official documentation.

Top comments (0)