Payment System — Idempotency, double-entry ledger, sagas, reconciliation

Architecture

Capacity Estimation

Metric	Value	Notes
Charges/s peak (Black Friday)	~5 K	10× daily mean
Ledger entries/charge	4–8	auth, capture, fee, payout
Ledger writes/s peak	~30 K	append-only
Ledger size growth	~1 TB/year	at 100 M txns/yr
p99 charge latency	< 1 s	end-to-end with 3DS step-up excluded
Acquirer batch settlement	T+1 to T+3	nightly NACHA/SEPA file

Idempotency Keys

The merchant's SDK retries on network error. Without idempotency, you charge twice. The Idempotency-Key header (Stripe convention) is a client-supplied UUID; the server stores (key, request_hash, response_payload, status) for 24 hours. On retry: same key + same hash → replay the saved response. Same key + different hash → 409 Conflict (the merchant has a bug).

Storage: a single Redis hash with TTL works for low volume; at scale, Postgres with a unique constraint on (merchant_id, key) is more durable. The server's logic must be: read-or-create-then-execute-then-finalize. If the process crashes between create and finalize, the next retry sees an in-flight record and must wait or take over — never start a parallel charge.

Double-Entry Ledger

Every business event is two entries: a debit on one account and a credit on another, sums to zero. A $100 charge produces:

+$100 to customer's funds-receivable, −$100 to merchant's pending-balance.
On capture: +$100 to merchant available-balance, −$100 to pending-balance.
Fee: +$2.90 to platform revenue, −$2.90 to merchant balance.

The ledger is append-only: never UPDATE or DELETE. Corrections are reversing entries that net to the desired state. This makes audit trivially possible: SELECT SUM(amount) WHERE account_id = ? always gives the current balance, and you can compute the balance at any historical timestamp.

Schema: (entry_id, txn_id, account_id, direction, amount, currency, ts). Index by account_id, ts for balance queries; cluster by txn_id for atomic two-row writes. Postgres handles billions of rows here; for higher volume, use a custom append-only store (Stripe wrote one; TigerBeetle is open source).

Stripe-style Architecture: Charges → Balances

The user-facing API exposes resources (Charge, Refund, Payout, Dispute). Each resource has a state machine: a Charge moves requires_action → processing → succeeded or failed. Behind each transition, the orchestrator writes ledger entries.

Balances are a materialized view over the ledger: do not trust the cached balance, trust the ledger. The cache is for read latency only; the truth is reconstructable from the journal. This distinction is what lets Stripe say "we have never lost a cent."

Saga Pattern for Refunds and Multi-step Flows

A refund touches: (a) the card network (reverse the auth or issue a refund), (b) the ledger (negative entry), (c) the merchant's payout (clawback if already paid out), (d) the customer notification. These steps cannot be wrapped in a single DB transaction — they span external services.

The saga: model each step as a transaction with a compensating step. Drive the saga from a durable workflow engine (Temporal, AWS Step Functions, or a custom workflow scheduler). On any step failure, run compensations in reverse order. The workflow engine's job is to never lose a saga in flight, even if the host crashes.

Choreography (events trigger next step) vs orchestration (central coordinator): pick orchestration for payments. The visibility into a single workflow execution (which step failed, what compensations ran) is worth the central coupling.

Dispute Handling

A chargeback flips the money flow: the customer's bank claws back the charge, the merchant must respond with evidence within ~7 days. The system tracks disputed as a state on the original charge, holds funds in a frozen sub-balance, and exposes evidence-upload APIs (receipts, shipping proofs). On won: release frozen funds + reverse network fee. On lost: the chargeback is final; merchant balance debited.

Disputes are a fraud signal; high dispute rate triggers acquirer placement on a watchlist (Visa VDMP/VAMP). The system must monitor and throttle high-risk merchants automatically.

PCI DSS Scope Minimization

Storing PAN (full card number) puts your entire infrastructure in PCI scope: quarterly external scans, annual SAQ-D, network segmentation, the works. Mitigation: tokenize at the edge. The card-input form posts directly to a vault (Stripe Elements, Braintree Hosted Fields), which returns a token (tok_xxx) that flows through your systems. Only the vault is in scope.

Tokenization vs encryption: encryption keeps reversible PAN; tokenization replaces it with a random value mapped server-side. Always tokenize; encryption is the second line if you must persist.

The Reconciliation Pipeline

Every morning the acquirer drops a settlement file: list of authorizations, captures, refunds, fees, with their amounts. Reconciliation matches each line against the ledger:

Match: amounts agree, mark txn reconciled.
Acquirer-only: a charge in the file with no ledger entry — you accepted money you did not record. Investigation required.
Ledger-only: an entry with no acquirer record — you booked revenue that did not arrive. Either lag or a real loss.
Amount mismatch: usually fee calculation or FX; reconcile to a documented adjustment account.

The output is a break report reviewed by ops daily. Unreconciled balances above a threshold halt new captures. This is the single most important control in the system; it catches integration bugs, fraud, and accounting errors before they compound.

Failure Modes

Network split during capture — charged the customer but the ledger never wrote. Idempotency + retry resolves; the saga must be designed so the network call is the last step or has its own dedup table.
Currency rounding — never use float; store integer minor units (cents). One float bug eats your reconciliation report.
Replay attack on idempotency keys — rotate keys per merchant API key; reject keys older than the TTL; bind keys to the merchant identity, not just to the request.
Time travel — clock skew across servers causes misordered settlement. Use a single trusted clock for ledger sequencing (or a Lamport-style monotonic counter).

FAQ

Why not use 2PC across services?

Two-phase commit needs all participants online and willing to block. External card networks do not participate; bank settlements are batch. Sagas with compensations match the real-world failure model.

Postgres or a specialized ledger DB?

Postgres handles single-region payment volume up to billions of entries with partitioning. Move to TigerBeetle / a purpose-built ledger only when you exceed ~50 K writes/s sustained or need cross-region active-active.

Should the ledger be the source of truth or a derived view?

The ledger is the source of truth. Balances, payouts, and reports are derived. This is the inversion that makes correctness debuggable.

How do you handle currencies?

Each ledger entry has an explicit currency; never mix currencies in the same account. FX is a separate transaction with two ledger entries crossing currency walls and a rate snapshot.