Real-Time Fraud Detection Architecture at 12K TPS: The Full Reference Stack for 2026
A reference architecture for real-time fraud detection at 12K TPS: event-driven design with Go, PostgreSQL partitioning, ML model serving, and the decisions that hold sub-50ms latency at scale.
Here is a reference architecture for the shape of system a fintech team needs when every transaction must be scored before it clears. At 12,000 transactions per second during peak hours, the target is sub-50ms end-to-end latency with zero dropped events. The design below is the pattern we use when scoping this kind of build.
The architecture has three layers. The ingestion layer receives transaction events via a Go service fronted by an AWS ALB. We chose Go over Node.js because goroutines gave us 10x better concurrency for I/O-bound work at this scale. Each event is immediately written to a Kafka topic partitioned by merchant ID, giving us ordering guarantees per merchant while allowing parallel processing across merchants.
The scoring layer runs a two-stage pipeline. Stage one applies deterministic rules: velocity checks (same card used 5 times in 60 seconds), geographic impossibility (transaction in Istanbul, then London 10 minutes later), and amount anomalies. These rules alone catch 40% of fraud with near-zero latency because they only require in-memory lookups against a Redis cluster.
Stage two runs the ML model. We serve a LightGBM model via a custom Go service, not a Python Flask wrapper. The model was trained on 18 months of labeled transaction data (2.3M samples, 0.8% fraud rate). Feature engineering was the hard part: we compute 47 features per transaction including rolling averages, time-of-day patterns, merchant category risk scores, and device fingerprint similarity. Model inference takes 8-12ms per transaction.
PostgreSQL handles the persistence layer, but not in the way most people use it. We partition the transactions table by date (monthly) with sub-partitions by risk score range. This keeps hot queries fast: investigating recent high-risk transactions hits a tiny partition instead of scanning terabytes. We also use BRIN indexes on timestamp columns, which are 100x smaller than B-tree indexes for time-series data.
The key engineering decision was making the system eventually consistent rather than strongly consistent. The transaction is approved or flagged within 50ms based on the real-time score. But the full audit trail, merchant risk profile update, and compliance report generation happen asynchronously via Kafka consumers. This separation is what makes the latency target achievable.
What this architecture makes possible, once tuned: p95 latency in the high 30s of milliseconds, uptime targets above 99.9% are realistic with the ingestion + Kafka + Go serving design, and the ML model catches fraud patterns that static rules miss. A canonical example is slow-drip card testing where small transactions probe card validity over days before a large charge. Static rules alone typically cover a meaningful share of fraud; the ML stage is what unlocks the harder cases.
The lesson: real-time systems at scale are not about choosing the fastest language or database. They are about designing the right data flow, separating hot path from cold path, and making deliberate trade-offs between consistency and latency. Every millisecond in the hot path was earned through profiling and measurement, not guessing.
Key Takeaways
- 01Ingestion: Go service behind a load balancer, events immediately written to a Kafka topic partitioned by merchant ID for ordering plus parallelism.
- 02Scoring stage 1: deterministic rules (velocity, geographic impossibility, amount anomalies) running off Redis in near-zero latency; catches a meaningful share of fraud before ML runs.
- 03Scoring stage 2: LightGBM served via a custom Go service (not a Python Flask wrapper), with carefully engineered features per transaction. Inference in single-digit to low double-digit ms.
- 04Persistence: PostgreSQL partitioned by date with sub-partitions by risk score range, plus BRIN indexes on timestamps, keeps investigation queries fast even at terabyte scale.
- 05Latency comes from eventual consistency: the score and approve-or-flag decision is synchronous and fast, while audit trail, merchant risk profile and compliance reports run asynchronously via Kafka consumers.
Frequently Asked Questions
Why Go instead of Node.js for the ingestion layer?
For I/O-bound work at this scale, Go's goroutines give significantly better concurrency per core than Node.js. On a 12K TPS target with strict tail-latency budgets, that headroom matters. Node.js is fine at lower TPS, especially if the rest of your stack is JavaScript.
Can I use Python and Flask to serve the ML model?
Technically yes, but at this scale a Python web framework in front of the model adds non-trivial latency and concurrency overhead. Serving the model from a compiled language like Go with direct memory access keeps p95 inference under the latency budget.
Why partition PostgreSQL by date and sub-partition by risk score?
Fraud investigation queries are almost always 'recent high-risk transactions'. Partitioning by date makes the query hit a tiny partition instead of scanning terabytes. Sub-partitioning by risk score means the query can skip low-risk rows entirely. BRIN indexes on timestamps keep index size small.
Do I need all of this if my TPS is much lower?
No. Below roughly 500 TPS, a simpler stack (Node.js or Python with Redis, a single well-indexed PostgreSQL instance, synchronous scoring) is easier to operate and plenty fast. This architecture is for when you have proven volume, strict latency budgets, and fraud loss numbers that justify the operational complexity.
Let's discuss your project
15 minutes, no commitment.