Most engineers believe in “exactly once” execution — until they build a money movement system.
Then reality teaches them otherwise.
In distributed financial systems, requests time out. Providers retry. Webhooks arrive late. Networks drop responses. And somewhere in between, money is in motion.
The real danger isn’t failure.
The real danger is misclassifying the unknown as failure.
This article breaks down:
Why exactly-once is a lie
Why UNKNOWN is more dangerous than FAILURE
How to model financial invariants correctly
How to diagnose and scale when settlement delays increase
Part 1: The Problem — Fast UX, Inconsistent Ledger
Imagine this scenario:
Aman has ₹1000. He transfers ₹1000 to Naman.
Your system sends a debit request to Naman’s bank provider.
The provider doesn’t respond within timeout.
Now you have three possible realities:
The provider processed the transfer successfully.
The provider dropped the request.
The provider is still processing it.
You don’t know which one.
That’s the UNKNOWN state.
Most systems simplify this into:
->SUCCESS
->FAILURE
->UNKNOWN
And then they treat UNKNOWN as FAILURE.
So the system refunds ₹1000 to Aman.
Aman sees ₹1000 available again and sends another transfer.
Later, the original provider request succeeds.
Total transferred: ₹2000. Actual balance: ₹1000.
Now someone must absorb the ₹1000 loss.
This isn’t a UX glitch.
This is a violation of a financial invariant.
The Invariant That Must Never Break
In financial systems:
Total debits must never exceed available balance.
Or more formally:
Available Balance + Locked Balance = Ledger Balance must never go negative due to concurrency.
If your state model allows temporary misclassification of UNKNOWN as FAILURE, you’re silently enabling double-spend.
That’s how real financial losses happen.
Why “Exactly Once” Is a Lie
Exactly-once semantics don’t exist across network boundaries. What you actually get is at-least-once delivery with delayed or duplicated signals.
Safety doesn’t come from transport guarantees. It comes from system design.
You compensate using:
Idempotency keys
Deduplication logic
Reconciliation jobs
Strict ledger invariants
Exactly-once is a transport fantasy. Financial safety is an accounting discipline.
**
Part 2: Modeling the UNKNOWN Correctly
**
UNKNOWN isn’t failure.
UNKNOWN is unsettled liability.
That means funds must not be considered available until final confirmation.
Instead of collapsing states into three buckets, model them as:
- INITIATED
- PENDING_EXTERNAL_CONFIRMATION
- SETTLED_SUCCESS
- SETTLED_FAILURE
The key rule:
Until success or confirmed failure, funds must remain locked.
So instead of refunding immediately, the system:
Moves ₹1000 from Available → Locked
Displays:
Available: ₹0
Locked: ₹1000 (Pending)
Total: ₹1000
Now Aman understands reality: his money is in motion.
If the provider later confirms failure → release locked funds.
If the provider confirms success → settle permanently.
No invariant is broken.
Why This Is Also a Product Decision
Many teams chase “fast UX.”
They show:
Immediate success
Or immediate rollback
Because users “don’t like waiting.”
But showing fake certainty creates real risk.
Financial UX must reflect system reality.
It’s better to show “Waiting for external confirmation” than “Success” → “Oops, failed” — or worse, silent financial exposure.
This is where engineering and product must align.
Accuracy over illusion.
What a Minimal Safe Architecture Looks Like
- Ledger Service (Source of Truth)
- Maintains Available and Locked balances
- Enforces balance invariants
- Owns all state transitions that affect money
- Ensures: Available + Locked = Ledger Balance
- Transfer Service
- Generates a unique transfer ID before any external call
- Persists the transaction in INITIATED state
- Moves funds from Available → Locked
- Calls the provider only after the state is safely stored
- Provider Adapter Layer
- Attaches idempotency key to every outbound request
- Handles retries safely
- Never assumes timeout equals failure
- Webhook Handler
- Processes provider callbacks idempotently
- Validates transfer ID before any state change
- Transitions PENDING_EXTERNAL_CONFIRMATION → SETTLED_SUCCESS or SETTLED_FAILURE
- Never updates balances without going through ledger rules
- Reconciliation Worker
- Periodically scans stale PENDING transactions
- Queries provider for final status
- Resolves drift between internal state and external settlement
The critical rule is simple:
Only the ledger is allowed to change balances.
External systems influence state — they do not define truth.
When this boundary is clear, invariants remain enforceable even under retries, delays, or duplicate callbacks.
**
Part 3: When UNKNOWN Ratio Starts Increasing
**
Now let’s move from correctness to operations.
Suppose your metrics show:
- Error rate: flat
- CPU: normal
- DB load: normal
- But UNKNOWN transactions increasing
This is where most engineers start guessing.
Instead, reason through signals.
Scenario 1: Provider Delayed Finality
If webhooks arrive in bursts after delay and system load is normal:
Likely cause:
- Provider queue backlog
- Network jitter
- Provider-side throttling This is delayed finality — not failure. The risk here is exposure accumulation.
If you process 10,000 transfers/hour and even 0.1% remain UNKNOWN, that’s 10 pending transfers per hour.
At ₹40,000 average ticket size → ₹4 lakh exposure accumulating per hour.
Delayed finality becomes financial exposure.
A Real Pattern Seen in Production
In one production environment, the UNKNOWN ratio increased from 0.2% to nearly 3% within 30 minutes.
Error rate remained flat.
CPU usage was stable.
Database load looked normal.
At first glance, nothing appeared broken.
The root cause was provider-side queue congestion during peak traffic. Webhooks were delayed by 8–12 minutes due to backlog.
System throughput was approximately 12,000 transfers per hour.
Average ticket size was around ₹35,000.
Within 45 minutes, locked exposure crossed ₹1 crore.
No invariant had broken. No double-spend occurred.
But financial exposure was accumulating silently.
Traffic was throttled before exposure crossed internal safety thresholds.
The lesson wasn’t about error handling.
It was about understanding that UNKNOWN is time-sensitive risk.
When settlement latency stretches, exposure grows — even if error metrics stay green.
Scenario 2: Internal Bottleneck
If:
- Webhook arrival steady
- UNKNOWN increasing
- DB locks increasing
- Queue depth rising
Then the problem is internal.
Possible causes:
- Lock contention
- Ledger write bottleneck
- Idempotency table hotspot
- Serialization conflict
This isn’t provider delay. This is internal conflict.
And if you misdiagnose it as provider delay, you scale the wrong layer.
What Mature Systems Do
They don’t just monitor “error rate.”
They monitor:
- UNKNOWN ratio
- Settlement P95 time
- Exposure amount (₹ locked)
- Reconciliation lag
- Webhook processing latency
- DB lock wait time Because financial correctness isn’t binary — it’s time-sensitive.
**
Part 4: Scaling Without Breaking Invariants
**
Now comes the hard part: How do you scale without relaxing safety?
You can’t remove UNKNOWN state. You must contain it.
- Exposure Caps
- Limit total locked funds per provider.
- If exposure crosses threshold:
- Slow down new transfers
- Or route via secondary provider This is risk-based throttling.
- Circuit Breakers
If settlement latency crosses threshold:
- Stop initiating new transfers
- Notify operations Better to be temporarily unavailable than financially insolvent.
- Automated Reconciliation
Scheduled job:
- Re-check all PENDING > X minutes
- Query provider status
- Auto-settle where possible Never rely solely on webhook arrival.
- Idempotency Everywhere
Every outbound request:
- Unique transfer ID
- Stored before calling provider
- Used to reconcile duplicates But remember: Idempotency prevents duplicates. It doesn’t guarantee exactly-once.
Your ledger invariant does that.
- UX That Reflects Truth
Instead of:
Balance: ₹1000
After transfer: ₹0
Show:
Available: ₹0
Locked: ₹1000 (Pending confirmation)
Total: ₹1000
User sees reality. No fake certainty. No silent exposure.
The Real Lesson
UNKNOWN isn’t a temporary inconvenience.
It’s a state that tests whether your system respects financial invariants.
If you treat UNKNOWN as FAILURE, you risk double-spend.
If you treat UNKNOWN as SUCCESS, you risk false confirmation.
If you ignore UNKNOWN growth, you risk accumulating exposure silently.
Exactly-once execution is a comforting myth.
What actually protects financial systems:
- Strict ledger invariants
- Locked funds modeling
- Exposure monitoring
- Delayed finality handling
- Honest UX
Most systems don’t collapse because engineers misunderstand distributed systems.
They collapse because they optimize for perceived speed before protecting financial invariants.
Exactly-once execution isn’t real.
Delayed signals are.
Retries are.
Unknown states are.
In financial systems, invariants don’t care about your timeout values or UX shortcuts.
They either hold — or money leaks.
And when money leaks, theory stops mattering.

Top comments (0)