Bhagirath

Posted on Feb 18

Exactly Once Is a Lie: Managing Financial Invariants Under Concurrency

#fintech #backend #web3 #infrastructure

Most engineers believe in “exactly once” execution — until they build a money movement system.

Then reality teaches them otherwise.

In distributed financial systems, requests time out. Providers retry. Webhooks arrive late. Networks drop responses. And somewhere in between, money is in motion.

The real danger isn’t failure.
The real danger is misclassifying the unknown as failure.

This article breaks down:
Why exactly-once is a lie
Why UNKNOWN is more dangerous than FAILURE
How to model financial invariants correctly
How to diagnose and scale when settlement delays increase

Part 1: The Problem — Fast UX, Inconsistent Ledger

Imagine this scenario:

Aman has ₹1000. He transfers ₹1000 to Naman.
Your system sends a debit request to Naman’s bank provider.
The provider doesn’t respond within timeout.
Now you have three possible realities:
The provider processed the transfer successfully.
The provider dropped the request.
The provider is still processing it.

You don’t know which one.
That’s the UNKNOWN state.

Most systems simplify this into:
->SUCCESS
->FAILURE
->UNKNOWN

And then they treat UNKNOWN as FAILURE.
So the system refunds ₹1000 to Aman.
Aman sees ₹1000 available again and sends another transfer.

Later, the original provider request succeeds.
Total transferred: ₹2000. Actual balance: ₹1000.

Now someone must absorb the ₹1000 loss.

This isn’t a UX glitch.
This is a violation of a financial invariant.

The Invariant That Must Never Break

In financial systems:

Total debits must never exceed available balance.
Or more formally:
Available Balance + Locked Balance = Ledger Balance must never go negative due to concurrency.

If your state model allows temporary misclassification of UNKNOWN as FAILURE, you’re silently enabling double-spend.
That’s how real financial losses happen.

Why “Exactly Once” Is a Lie

Exactly-once semantics don’t exist across network boundaries. What you actually get is at-least-once delivery with delayed or duplicated signals.

Safety doesn’t come from transport guarantees. It comes from system design.

You compensate using:
Idempotency keys
Deduplication logic
Reconciliation jobs
Strict ledger invariants
Exactly-once is a transport fantasy. Financial safety is an accounting discipline.

Part 2: Modeling the UNKNOWN Correctly

UNKNOWN isn’t failure.
UNKNOWN is unsettled liability.
That means funds must not be considered available until final confirmation.
Instead of collapsing states into three buckets, model them as:

INITIATED
PENDING_EXTERNAL_CONFIRMATION
SETTLED_SUCCESS
SETTLED_FAILURE

The key rule:

Until success or confirmed failure, funds must remain locked.
So instead of refunding immediately, the system:
Moves ₹1000 from Available → Locked
Displays:
Available: ₹0
Locked: ₹1000 (Pending)
Total: ₹1000

Now Aman understands reality: his money is in motion.

If the provider later confirms failure → release locked funds.
If the provider confirms success → settle permanently.

No invariant is broken.

Why This Is Also a Product Decision

Many teams chase “fast UX.”

They show:
Immediate success
Or immediate rollback

Because users “don’t like waiting.”
But showing fake certainty creates real risk.

Financial UX must reflect system reality.
It’s better to show “Waiting for external confirmation” than “Success” → “Oops, failed” — or worse, silent financial exposure.
This is where engineering and product must align.
Accuracy over illusion.

What a Minimal Safe Architecture Looks Like

Ledger Service (Source of Truth)

Maintains Available and Locked balances
Enforces balance invariants
Owns all state transitions that affect money
Ensures: Available + Locked = Ledger Balance

Transfer Service

Generates a unique transfer ID before any external call
Persists the transaction in INITIATED state
Moves funds from Available → Locked
Calls the provider only after the state is safely stored

Provider Adapter Layer

Attaches idempotency key to every outbound request
Handles retries safely
Never assumes timeout equals failure

Webhook Handler

Processes provider callbacks idempotently
Validates transfer ID before any state change
Transitions PENDING_EXTERNAL_CONFIRMATION → SETTLED_SUCCESS or SETTLED_FAILURE
Never updates balances without going through ledger rules

Reconciliation Worker

Periodically scans stale PENDING transactions
Queries provider for final status
Resolves drift between internal state and external settlement

The critical rule is simple:
Only the ledger is allowed to change balances.
External systems influence state — they do not define truth.
When this boundary is clear, invariants remain enforceable even under retries, delays, or duplicate callbacks.

Part 3: When UNKNOWN Ratio Starts Increasing

Now let’s move from correctness to operations.
Suppose your metrics show:

Error rate: flat
CPU: normal
DB load: normal
But UNKNOWN transactions increasing

This is where most engineers start guessing.
Instead, reason through signals.

Scenario 1: Provider Delayed Finality

If webhooks arrive in bursts after delay and system load is normal:
Likely cause:

Provider queue backlog
Network jitter
Provider-side throttling This is delayed finality — not failure. The risk here is exposure accumulation.

If you process 10,000 transfers/hour and even 0.1% remain UNKNOWN, that’s 10 pending transfers per hour.
At ₹40,000 average ticket size → ₹4 lakh exposure accumulating per hour.

Delayed finality becomes financial exposure.

A Real Pattern Seen in Production

In one production environment, the UNKNOWN ratio increased from 0.2% to nearly 3% within 30 minutes.

Error rate remained flat.
CPU usage was stable.
Database load looked normal.

At first glance, nothing appeared broken.

The root cause was provider-side queue congestion during peak traffic. Webhooks were delayed by 8–12 minutes due to backlog.
System throughput was approximately 12,000 transfers per hour.

Average ticket size was around ₹35,000.
Within 45 minutes, locked exposure crossed ₹1 crore.

No invariant had broken. No double-spend occurred.
But financial exposure was accumulating silently.

Traffic was throttled before exposure crossed internal safety thresholds.
The lesson wasn’t about error handling.

It was about understanding that UNKNOWN is time-sensitive risk.
When settlement latency stretches, exposure grows — even if error metrics stay green.

Scenario 2: Internal Bottleneck
If:

Webhook arrival steady
UNKNOWN increasing
DB locks increasing
Queue depth rising

Then the problem is internal.

Possible causes:

Lock contention
Ledger write bottleneck
Idempotency table hotspot
Serialization conflict

This isn’t provider delay. This is internal conflict.
And if you misdiagnose it as provider delay, you scale the wrong layer.

What Mature Systems Do

They don’t just monitor “error rate.”

They monitor:

UNKNOWN ratio
Settlement P95 time
Exposure amount (₹ locked)
Reconciliation lag
Webhook processing latency
DB lock wait time Because financial correctness isn’t binary — it’s time-sensitive.

Part 4: Scaling Without Breaking Invariants

Now comes the hard part: How do you scale without relaxing safety?
You can’t remove UNKNOWN state. You must contain it.

Exposure Caps

Limit total locked funds per provider.
If exposure crosses threshold:
Slow down new transfers
Or route via secondary provider This is risk-based throttling.

Circuit Breakers

If settlement latency crosses threshold:

Stop initiating new transfers
Notify operations Better to be temporarily unavailable than financially insolvent.

Automated Reconciliation

Scheduled job:

Re-check all PENDING > X minutes
Query provider status
Auto-settle where possible Never rely solely on webhook arrival.

Idempotency Everywhere

Every outbound request:

Unique transfer ID
Stored before calling provider
Used to reconcile duplicates But remember: Idempotency prevents duplicates. It doesn’t guarantee exactly-once.

Your ledger invariant does that.

UX That Reflects Truth

Instead of:
Balance: ₹1000
After transfer: ₹0
Show:
Available: ₹0
Locked: ₹1000 (Pending confirmation)
Total: ₹1000
User sees reality. No fake certainty. No silent exposure.

The Real Lesson

UNKNOWN isn’t a temporary inconvenience.

It’s a state that tests whether your system respects financial invariants.

If you treat UNKNOWN as FAILURE, you risk double-spend.
If you treat UNKNOWN as SUCCESS, you risk false confirmation.
If you ignore UNKNOWN growth, you risk accumulating exposure silently.

Exactly-once execution is a comforting myth.
What actually protects financial systems:

Strict ledger invariants
Locked funds modeling
Exposure monitoring
Delayed finality handling
Honest UX

Most systems don’t collapse because engineers misunderstand distributed systems.

They collapse because they optimize for perceived speed before protecting financial invariants.

Exactly-once execution isn’t real.
Delayed signals are.
Retries are.
Unknown states are.

In financial systems, invariants don’t care about your timeout values or UX shortcuts.

They either hold — or money leaks.

And when money leaks, theory stops mattering.

DEV Community

Exactly Once Is a Lie: Managing Financial Invariants Under Concurrency

Part 1: The Problem — Fast UX, Inconsistent Ledger

Part 2: Modeling the UNKNOWN Correctly

Part 3: When UNKNOWN Ratio Starts Increasing

Part 4: Scaling Without Breaking Invariants

Top comments (0)