What is a realistic p99 latency target for real-time fraud scoring?

For card-present authorisation, 100 ms end-to-end from API ingress to accept/decline is the hard ceiling — card networks impose it. Card-not-present (e-commerce) allows 250-300 ms before checkout abandonment climbs. Account opening can go to 2 s. In practice, a rules engine adds 2-5 ms, a gradient-boosted model served via REST adds 15-40 ms, and network + serialisation accounts for the rest. Budget 20-30 ms for each Redis lookup on the hot path.

Should I use rules, ML, or both for fraud detection?

Both, always. Rules fire in under 5 ms and handle known fraud patterns with zero false-negative risk for those patterns — velocity limits, BIN blocklists, impossible geographies. ML models generalise to novel patterns and combinations that rules miss. A common production split: rules block ~15% of fraud volume immediately; the ML model handles the remaining scoring and catches 60-70% of what rules miss. Running ML alone means your fraud team cannot explain decisions to regulators; running rules alone means you are perpetually losing to new attack vectors.

How do I handle the cold-start problem for a new merchant or geography?

Three-layer defence: first, fall back to peer-group models trained on similar merchant categories (MCC code, average ticket size, geography) — these generalise reasonably well. Second, apply conservative rule-based thresholds while collecting labelled data. Third, use transfer learning from your global model fine-tuned on the merchant's early transaction history after 2-4 weeks of data accumulation. Do not wait for 6 months of data; a model trained on 500 labelled fraud cases is already better than pure rules.

Build vs buy: when does a SaaS fraud vendor make sense?

Buy when your fraud volume is under ~$500M GMV/year, your team has no ML engineering capacity, or you need to go live in under 3 months. Stripe Radar, Sift, and Riskified have pre-trained network models that you cannot replicate from a cold start — they see signals across millions of merchants. Build when you have proprietary data signals that vendors cannot access (internal behavioural data, CRM history), when your vertical has unusual patterns (B2B invoicing, telco top-ups), or when vendor fees (typically 0.1-0.5% of GMV) exceed in-house engineering cost.

What is the single most predictive feature for card fraud?

Velocity features consistently rank highest across published benchmarks and our own deployments: specifically, the count of distinct merchants charged by the same card in the past 1 hour and 24 hours. This single feature, with appropriate thresholds, flags a disproportionate share of card testing and account takeover patterns. Device fingerprint combined with IP reputation is the second-most reliable signal for card-not-present. Neither is sufficient alone — the combination of velocity + device + behavioural deviation from historical baseline is what drives AUC above 0.95.

How does the PSD2 SCA exemption work in practice?

PSD2 Article 18 allows issuers and PSPs to skip Strong Customer Authentication for transactions below certain value thresholds IF their real-time Transaction Risk Analysis (TRA) keeps fraud rates below specific ceilings: 0.13% for exemptions up to €100, 0.06% up to €250, 0.01% up to €500. The TRA must be certified by your competent authority. Operationally, your fraud scoring system must output a risk score in real time and the issuer's authorisation system must consume it before routing to the SCA step-up flow. Maintaining the fraud rate below the ceiling is a continuous monitoring obligation.

How often should I retrain my fraud model?

Weekly retraining typically outperforms monthly by 8-12% AUC on drifting fraud populations, based on published research and our observed deployments. Daily retraining adds marginal lift (1-3%) at significantly higher compute and pipeline complexity. The practical recommendation: weekly scheduled retraining with automated champion/challenger evaluation; trigger ad-hoc retraining when a population shift detector (PSI > 0.2 on key features) fires. Never deploy a new model directly to 100% traffic — shadow or champion/challenger with 5-10% allocation first.

What metrics matter beyond AUC and precision/recall?

AUC is necessary but insufficient. Track: (1) $-weighted precision — blocking $1,000 in fraud that costs $0.50 in false positives is not the same as blocking $10 in fraud with the same false positive; (2) false-positive rate on your highest-value customers, who churn fastest when incorrectly declined; (3) chargeback rate as a lagging confirmation of model quality; (4) time-to-decision p99 in production, not just benchmark environments; (5) precision@k, where k is the number of cases your fraud operations team can actually review in a day — a model that generates 10,000 alerts when you can work 200 is operationally broken regardless of its AUC.

What is a realistic p99 latency target for real-time fraud scoring?

For card-present authorisation, 100 ms end-to-end from API ingress to accept/decline is the hard ceiling — card networks impose it. Card-not-present (e-commerce) allows 250-300 ms before checkout abandonment climbs. Account opening can go to 2 s. In practice, a rules engine adds 2-5 ms, a gradient-boosted model served via REST adds 15-40 ms, and network + serialisation accounts for the rest. Budget 20-30 ms for each Redis lookup on the hot path.

Should I use rules, ML, or both for fraud detection?

Both, always. Rules fire in under 5 ms and handle known fraud patterns with zero false-negative risk for those patterns — velocity limits, BIN blocklists, impossible geographies. ML models generalise to novel patterns and combinations that rules miss. A common production split: rules block ~15% of fraud volume immediately; the ML model handles the remaining scoring and catches 60-70% of what rules miss. Running ML alone means your fraud team cannot explain decisions to regulators; running rules alone means you are perpetually losing to new attack vectors.

How do I handle the cold-start problem for a new merchant or geography?

Three-layer defence: first, fall back to peer-group models trained on similar merchant categories (MCC code, average ticket size, geography) — these generalise reasonably well. Second, apply conservative rule-based thresholds while collecting labelled data. Third, use transfer learning from your global model fine-tuned on the merchant's early transaction history after 2-4 weeks of data accumulation. Do not wait for 6 months of data; a model trained on 500 labelled fraud cases is already better than pure rules.

Build vs buy: when does a SaaS fraud vendor make sense?

Buy when your fraud volume is under ~$500M GMV/year, your team has no ML engineering capacity, or you need to go live in under 3 months. Stripe Radar, Sift, and Riskified have pre-trained network models that you cannot replicate from a cold start — they see signals across millions of merchants. Build when you have proprietary data signals that vendors cannot access (internal behavioural data, CRM history), when your vertical has unusual patterns (B2B invoicing, telco top-ups), or when vendor fees (typically 0.1-0.5% of GMV) exceed in-house engineering cost.

What is the single most predictive feature for card fraud?

Velocity features consistently rank highest across published benchmarks and our own deployments: specifically, the count of distinct merchants charged by the same card in the past 1 hour and 24 hours. This single feature, with appropriate thresholds, flags a disproportionate share of card testing and account takeover patterns. Device fingerprint combined with IP reputation is the second-most reliable signal for card-not-present. Neither is sufficient alone — the combination of velocity + device + behavioural deviation from historical baseline is what drives AUC above 0.95.

How does the PSD2 SCA exemption work in practice?

PSD2 Article 18 allows issuers and PSPs to skip Strong Customer Authentication for transactions below certain value thresholds IF their real-time Transaction Risk Analysis (TRA) keeps fraud rates below specific ceilings: 0.13% for exemptions up to €100, 0.06% up to €250, 0.01% up to €500. The TRA must be certified by your competent authority. Operationally, your fraud scoring system must output a risk score in real time and the issuer's authorisation system must consume it before routing to the SCA step-up flow. Maintaining the fraud rate below the ceiling is a continuous monitoring obligation.

How often should I retrain my fraud model?

Weekly retraining typically outperforms monthly by 8-12% AUC on drifting fraud populations, based on published research and our observed deployments. Daily retraining adds marginal lift (1-3%) at significantly higher compute and pipeline complexity. The practical recommendation: weekly scheduled retraining with automated champion/challenger evaluation; trigger ad-hoc retraining when a population shift detector (PSI > 0.2 on key features) fires. Never deploy a new model directly to 100% traffic — shadow or champion/challenger with 5-10% allocation first.

What metrics matter beyond AUC and precision/recall?

AUC is necessary but insufficient. Track: (1) $-weighted precision — blocking $1,000 in fraud that costs $0.50 in false positives is not the same as blocking $10 in fraud with the same false positive; (2) false-positive rate on your highest-value customers, who churn fastest when incorrectly declined; (3) chargeback rate as a lagging confirmation of model quality; (4) time-to-decision p99 in production, not just benchmark environments; (5) precision@k, where k is the number of cases your fraud operations team can actually review in a day — a model that generates 10,000 alerts when you can work 200 is operationally broken regardless of its AUC.

Leer este articulo en espanol

Articles

AI & ML

Real-Time Fraud Detection Architecture: A Production Guide

abemon

| April 15, 2026 | 30 min read

Global card fraud losses hit $33.8 billion in 2023 (Nilson Report); 60-70% occurs in card-not-present channels where real-time scoring is the last line of defence.
Decisioning budget: card payments need p99 < 100 ms end-to-end; card-not-present < 250 ms; account opening < 2 s — latency drives your entire stack selection.
Use rules + ML in combination: rules handle known patterns in < 5 ms; gradient-boosted models add 15-40 ms for coverage; GNNs catch mule networks that neither alone detects.
PSD2 Article 18 TRA exemption lets issuers skip SCA for low-risk transactions if fraud rate stays below 0.01% for < €100 — operationalising this requires a certified real-time scoring path.
Retraining weekly beats retraining monthly by 8-12% AUC on drifting populations; champion/challenger deployment with 5-10% shadow traffic is the lowest-risk production pattern.

Fraud losses are not an abstraction

The Nilson Report puts global card fraud losses at $33.8 billion in 2023, on track to exceed $40 billion by 2027. ACI Worldwide’s 2025 Global eCommerce Fraud Report estimates that card-not-present fraud accounts for 60-70% of those losses — the channel where a real-time scoring decision is the only line of defence before money leaves the ecosystem. Juniper Research projects total payment fraud losses at $91 billion annually by 2028 when account takeover, synthetic identity fraud, and authorised push payment scams are included.

These numbers are not the point. The point is that fraud is a real-time adversarial problem. Fraudsters adapt to yesterday’s rules within hours. Batch fraud analysis — reviewing transactions after settlement — is useful for pattern discovery and chargeback recovery, but it does not stop fraud. It documents it.

This guide covers what it actually takes to build a fraud detection system that makes decisions in milliseconds, generalises to novel attack patterns, and does not destroy customer experience with false positives. The scope is deliberately broad: payments, e-commerce, telco, insurance, and fintech. A deeper treatment of banking-specific AI fraud signals is in the AI fraud detection in banking guide.

What is real-time fraud detection? (and what is NOT)

Real-time fraud detection is the practice of making an accept, decline, or step-up decision on an in-flight transaction before that transaction completes — typically within 100-500 ms from the moment a payment is initiated. The decision happens while the transaction is still pending in the authorisation flow; reversing it after settlement is far more expensive.

What it is:

Sub-100-500 ms decisioning depending on channel (see latency budget below)
In-flight transaction — the card authorisation has been initiated, the API call is in progress, the checkout session is open
Three possible outputs: approve (no friction), step-up (require additional authentication), decline (reject)

What it is NOT:

Batch fraud analytics — running daily SQL jobs to flag suspicious accounts is useful for investigation and rule calibration, but it does not stop the fraud event
Post-settlement review — monitoring chargebacks and disputes gives you labelled training data, but you are measuring historical fraud rate, not intercepting it
Offline risk scoring — computing a risk score in a nightly batch job and attaching it to a customer profile for the next day’s transactions is near-real-time at best; a fraudster with stolen credentials can empty an account in the 14 hours before the score is updated

The operational distinction matters: batch analytics teams and real-time decisioning systems have fundamentally different infrastructure, latency budgets, failure modes, and organisational ownership. Mixing them up is the source of most architectural mistakes in fraud platform design.

The decisioning budget: latency, cost, accuracy

Every real-time fraud system operates under a latency budget that is determined by the channel, not the engineering team. Exceed it and you either cause checkout abandonment or violate card network SLAs. Miss it and the transaction authorises without a fraud check.

Use case	p99 latency budget	Notes
Card-present (POS)	100 ms	Visa/Mastercard authorisation SLA; this is the hard ceiling
Card-not-present (e-commerce)	250 ms	Empirical: checkout conversion drops ~1.5% per 100 ms beyond 300 ms
Account opening	2 s	User expects a loading state; allows richer KYC signals
Login / account takeover prevention	500 ms	Session stays open; user tolerates a brief check
Authorised push payment (A2A transfer)	1 s	Banks typically allow 1-2 s for outbound payment screening
Telco SIM swap	3 s	User is in a store or IVR; higher tolerance

The latency budget determines your entire stack. At 100 ms, you cannot afford:

Model inference on GPU instances behind a REST API with cold starts (50-150 ms GPU warm-up alone)
More than 2-3 sequential network calls to external enrichment services
Synchronous database reads without an in-memory cache layer

The accuracy vs latency trade-off is real: a gradient-boosted model served from a pre-loaded, in-memory runtime adds 10-30 ms. A transformer-based sequence model for session behaviour adds 50-150 ms. A GNN traversal over a real-time graph adds 100-300 ms. At 100 ms total budget, you must choose. At 2 s, you can compose all three.

Cost is the third constraint. Running ML inference on every transaction is expensive at scale. A payment processor handling 1,000 transactions per second running a $0.0005/call inference endpoint spends $1.8M per year on scoring alone. Tiered scoring — rules first, lightweight model second, expensive deep model only for ambiguous cases — reduces inference cost by 60-80% with minimal AUC impact.

Reference architecture

A production fraud detection system is not a single model. It is a pipeline of specialised components, each handling a distinct function, connected to minimise latency and maximise observability.

Transaction ingress. Every transaction enters through an API gateway (Kong, AWS API Gateway, or a custom reverse proxy). The gateway validates the request, enforces rate limits, and publishes the raw event to a Kafka topic. Using Kafka here is important: it decouples the ingress from the scoring pipeline, provides durability if the scoring system has a brief outage, and enables replay for debugging and model validation.

Feature store. The most complex component. Features needed for fraud scoring come from two sources: real-time features computed at inference time (transaction velocity in the last 1 hour, device fingerprint match, IP reputation) and batch features pre-computed from historical data (customer lifetime value, historical fraud rate by MCC, typical transaction time-of-day). A proper feature store (Feast, Tecton, or a custom Redis-backed service) serves both: it reads pre-computed batch features from a low-latency store (Redis, DynamoDB) and computes real-time features against a sliding window maintained in Redis Sorted Sets or Apache Flink. At 100 ms budget, every feature lookup needs to return in under 5-10 ms; this means the feature store must be co-located with the scoring service, not accessed over a WAN.

Rules engine. Before ML scoring, a rules engine handles deterministic fraud signals: card on blocklist, BIN country mismatch with shipping address, transaction amount above configured threshold, more than 5 declined transactions in the last 10 minutes on the same card. Rules fire in 1-5 ms. Common implementations are Drools (JVM, feature-rich, high operational overhead), Open Policy Agent (lightweight, policy-as-code, easier to test and deploy), or a custom in-memory rule evaluator. Rules are not a legacy alternative to ML — they are a complement. They handle known patterns with zero variance and give compliance and operations teams a tool they can reason about without data science involvement.

ML scoring service. The rules engine passes non-blocked transactions to an ML scoring service. The scorer loads a pre-trained model (XGBoost, LightGBM, or a neural model depending on use case) into memory at startup and performs synchronous inference. Serving infrastructure options: KServe on Kubernetes (most flexible, good for multi-model serving), BentoML (Python-native, lower ops overhead), SageMaker Endpoints (managed, AWS-native, higher cost). The scorer outputs a risk score in [0, 1] and, optionally, a model explanation (SHAP values for the top-3 contributing features). The explanation is not decorative — fraud analysts need it to investigate cases and regulators increasingly require it.

Orchestration layer. A lightweight orchestrator (Temporal, AWS Step Functions, or a custom state machine) sequences the rules engine, feature enrichment, and ML scoring, handles timeouts and fallbacks (if ML scoring is unavailable, fall back to rules-only), and emits a final decision: approve / step-up / decline. The orchestrator also handles the step-up flow: for transactions that score in the ambiguous range (0.3-0.7), it can trigger an OTP, biometric challenge, or 3DS redirect before returning a final decision.

Case management. Transactions that score above the decline threshold but below a hard-block ceiling go to a case management queue for human review. This is the human-in-the-loop component. Analysts review flagged transactions, mark them as fraud or legitimate, and those labels feed back into the training pipeline. Tools like Hummingbird, Unit21, or a custom internal tool handle this queue. The feedback loop closes here: labelled cases become new training data.

Feedback loop. The training pipeline pulls labelled outcomes from case management, confirmed chargebacks from the payment network, and analyst corrections. It retrains the champion model on a schedule, runs automated evaluation against a held-out validation set, and publishes the candidate model to a staging environment. Champion/challenger deployment (5-10% of live traffic routed to the challenger model) validates production performance before full rollout.

The flow, described textually: a transaction enters the API gateway → published to Kafka → consumed by the fraud orchestrator → feature store called (parallel reads for real-time and batch features) → rules engine evaluates → if not blocked, ML scoring service scores → orchestrator applies decision thresholds → result returned to calling system → decision and features logged to event store → human review queue populated for ambiguous cases → analyst labels flow back to feature store and training pipeline.

Models that actually work in production

Gradient-boosted trees (XGBoost, LightGBM). The workhorse of fraud detection. They handle tabular data (transaction amount, MCC, device fingerprint hash, velocity counts) natively, train in minutes on millions of examples, serve inference in 5-15 ms, and produce SHAP explanations without additional infrastructure. In published benchmarks and our own deployments, LightGBM with proper feature engineering achieves AUC 0.94-0.97 on card fraud datasets. Start here. The failure mode is feature distribution shift — these models degrade silently when fraud patterns change, requiring robust drift monitoring.

Isolation forest and autoencoders (unsupervised). Valuable for anomaly detection when labelled fraud data is sparse — new merchant onboarding, new geography, novel fraud pattern that has not yet been labelled. Isolation forest builds a distribution of normal transaction patterns and flags outliers; it adds no labelled data requirement but produces a coarser signal. Autoencoders learn a compressed representation of normal behaviour and flag reconstruction errors as anomalies. Use these as a first-pass signal for cold-start scenarios or as a supplementary score alongside the supervised model, not as a primary classifier.

Graph neural networks (GNNs). Fraud rings and mule networks are graph problems: a set of accounts, devices, phone numbers, and addresses connected in patterns that look innocuous individually but reveal organised fraud collectively. GNNs learn embeddings that capture these structural patterns. A node (account or transaction) is flagged not because its own features are suspicious but because it sits in a subgraph with known bad actors. Production GNN serving is expensive (100-300 ms for a graph traversal on a moderately-sized subgraph) and complex — reserved for account opening fraud, synthetic identity detection, and authorised push payment scams where the latency budget allows it. GraphSAGE and PyTorch Geometric are the standard tooling.

Transformer-based sequence models. Session behaviour — the sequence of clicks, page views, form interactions, and API calls leading up to a transaction — encodes strong fraud signals. A human who owns an account navigates it differently from a bot or account takeover actor. Transformers model this sequential dependency better than tabular models. The input is a sequence of events (each event encoded as a vector of features) and the output is a session-level risk score. Serving latency: 50-150 ms depending on sequence length and model size. Use when you have a web or mobile session context and the latency budget allows it; card-not-present e-commerce is the primary use case.

Feature engineering for fraud (with examples)

The quality of your feature set matters more than the choice of model. A well-engineered feature set with LightGBM will consistently outperform a poorly-engineered feature set with a deep learning model.

Velocity features — count and sum of transactions over sliding time windows. The most predictive feature category:

card_txn_count_1h: number of transactions on this card in the last hour
card_merchant_distinct_1h: number of distinct merchants charged in the last hour
card_decline_count_24h: number of declines in the last 24 hours
ip_txn_count_5m: number of transactions from this IP address in the last 5 minutes
device_id_card_distinct_7d: number of distinct cards used from this device fingerprint in the last 7 days

Device fingerprint features — signals derived from the device initiating the transaction:

device_first_seen_days: days since this device fingerprint was first observed on the platform
device_country_mismatch: Boolean — device IP geolocation country differs from card billing country
browser_automation_score: probability that the browser session is automated (headless Chrome, Selenium indicators)

Behavioural biometrics features — how the user interacts with the session:

keystroke_anomaly_score: deviation of typing rhythm from the account holder’s historical baseline
session_field_autofill_ratio: fraction of form fields filled by autofill vs manual typing (bots autofill at higher rates)
time_to_checkout_seconds: seconds from page load to checkout submission (unusually fast = likely automated)

Graph features — structural signals from the transaction graph:

shared_device_fraud_count_30d: number of confirmed fraud transactions from accounts sharing this device fingerprint in the last 30 days
ip_subnet_fraud_rate_7d: fraud rate among all transactions from the same /24 IP subnet in the last 7 days

The critical implementation requirement: velocity and graph features must be computed from a real-time window, not a daily batch snapshot. A fraudster card-testing 50 cards in the last 20 minutes does not show up in yesterday’s card_txn_count_1d batch feature. Use Redis Sorted Sets with TTL-based expiry for real-time velocity counters updated at ingestion time.

Cold-start and concept drift

Cold-start refers to the absence of historical data for a new entity: a new merchant processing their first transactions, a customer making their first purchase in a new geography, a new fraud pattern that has no labelled examples yet.

For new merchants: use peer-group models segmented by MCC code and average ticket size. A new restaurant with MCC 5812 in Madrid behaves similarly enough to other Madrid restaurants that a peer-group model trained on those merchants generalises adequately for the first 2-4 weeks. Set conservative velocity thresholds during the cold-start period and escalate ambiguous transactions to human review rather than declining them — new merchants are sensitive to false positives.

For new geographies: global models trained on diverse geographies generalise better than single-geography models. If expanding to a new market, fine-tune the global model on the closest available market data (similar payment culture, similar fraud typology) rather than training from scratch.

For new fraud patterns: unsupervised anomaly detection (isolation forest, autoencoder) is your first detector. When a new attack vector emerges, it produces anomaly scores before labelled examples exist. Combine anomaly scores with fast rule deployment — your fraud operations team needs a mechanism to deploy a new rule within minutes of identifying a pattern, without a model retraining cycle.

Concept drift — the statistical properties of fraud patterns changing over time — is the persistent maintenance challenge. Drift manifests as: AUC declining on recent data relative to historical validation; PSI (Population Stability Index) exceeding 0.2 on key input features; fraud rate rising faster than the model’s recall improvement. Mitigation: instrument your feature distributions in production (a Prometheus histogram per feature per day), alert on PSI thresholds, and trigger ad-hoc retraining when alerts fire. Do not wait for the weekly schedule if the deployment environment is shifting.

PSD2/SCA and fraud detection in EU/Spain

The Revised Payment Services Directive (PSD2), implemented across the EU and the UK, requires Strong Customer Authentication (SCA) for most electronic payments. SCA demands at least two factors from: something you know (PIN), something you have (phone/token), something you are (biometric). For e-commerce, this typically means 3D Secure 2.x.

Article 18 of the EBA Regulatory Technical Standards on SCA grants a Transaction Risk Analysis (TRA) exemption: issuers and PSPs may skip SCA for low-value, low-risk transactions if their fraud rate on such transactions remains below defined ceilings:

Exemption value ceiling	Maximum fraud rate (card-not-present)
Up to €100	0.13%
Up to €250	0.06%
Up to €500	0.01%

Operationalising this exemption requires:

A certified real-time TRA scoring system that outputs a risk decision before the payment is routed
Continuous monitoring of fraud rates per exemption tier, broken down by the PSP or issuer claiming the exemption
Automatic throttling: if the fraud rate for the €250 tier rises above 0.06%, the system must stop claiming that exemption tier and route affected transactions to SCA until the rate stabilises
Audit logs for every exemption claimed and the TRA score that justified it

In Spain, the Banco de España and the Comisión Nacional del Mercado de Valores (CNMV) supervise compliance. Fintechs operating under EMI or PI licences must document their TRA methodology and be prepared to provide it on request. The practical implication for engineering: your fraud scoring system is not just a business tool — it is regulated infrastructure. Latency, accuracy, and explainability are compliance requirements, not preferences.

Evaluation metrics beyond AUC

AUC (Area Under the ROC Curve) measures overall discriminative power but tells you nothing about where the model performs on the distribution that matters to your business. Replace it as the primary metric with:

$-weighted precision. Weight each true positive and false positive by transaction amount. Blocking a $5,000 fraudulent wire is not equivalent to blocking a $5 streaming subscription charge, and your metrics should reflect this. Compute fraud-dollar recall and false-positive-dollar precision as primary KPIs.

Precision@k. Your fraud operations team can review k cases per day. Precision at that capacity ceiling is the operationally relevant metric — a model that generates 10,000 alerts for a 200-analyst team is unusable regardless of its AUC.

False-positive rate on VIP segment. High-value customers who get incorrectly declined churn at 3-5x the rate of average customers. Track false-positive rate specifically for customers in the top 10% by lifetime value or top 5% by transaction frequency.

Chargeback rate. A lagging indicator (chargebacks arrive 30-120 days after fraud), but the cleanest external label you have. Rising chargeback rate when model score distribution is stable means the fraud typology has shifted to patterns your model is not covering.

Time-to-decision p99 in production. Not the benchmark latency, not the staging latency — the production p99, tracked continuously via Prometheus. Models that look fast in testing routinely degrade under production load when feature store connections queue up.

False-negative cost model. Enumerate the cost of a missed fraud event (chargeback fee + transaction amount + operations cost) and the cost of a false positive (refund processing + customer service + churn probability × LTV). Set your decision threshold to minimise total expected cost, not to maximise precision/recall symmetrically.

Build vs buy

Dimension	In-house ML platform	SaaS vendor (Stripe Radar, Sift, Riskified, Ravelin, Adyen Protect)
Time to production	6-18 months	2-8 weeks
Network effect signals	None at start	Strong — trained on millions of merchants
Proprietary signal access	Full	Partial — vendor cannot see your CRM data
Explainability	Full control	Varies; often limited
Vendor fees	Engineering cost only	0.1-0.5% of GMV, or per-transaction flat fee
Cold-start performance	Poor	Good — pre-trained network model
Customisation depth	Unlimited	Limited — typically rules + weight tuning
Regulatory control	Full	Dependency on vendor’s certification
Operational overhead	High	Low

Buy when: GMV is under $500M/year; team has no ML engineering capacity; you need production fraud controls in weeks; your fraud typology is mainstream (card-not-present e-commerce, standard card present).

Build when: vendor fees exceed in-house engineering cost (typically crosses over around $500M GMV); you have proprietary data signals vendors cannot access; your fraud typology is unusual (B2B invoice fraud, telco prepaid top-up abuse, insurance claims fraud); regulatory requirements mandate full model ownership and auditability.

Hybrid (most common in practice): use a SaaS vendor as a first-pass filter while building your own model in parallel. Once your model outperforms the vendor on your specific data distribution, route traffic to it. The vendor model remains as a cold-start and fallback layer.

Common pitfalls

Training on chargeback data alone. You miss 40-60% of fraud that never gets disputed — accounts drained by automated transfers, SIM swap attacks where the victim never realises, friendly fraud that is not reported. Train on all available labels: confirmed chargebacks, analyst-labelled cases from your review queue, and account closures for policy violations. Chargeback data is the tip of the iceberg.

Hardcoding decision thresholds. A threshold calibrated at model training time degrades as fraud patterns shift. Build a threshold management system that allows real-time threshold adjustment by your fraud operations team without a model redeployment cycle. Keep thresholds as configuration, not code.

Feature leakage from the future. Computing training features on the entire dataset before the train/validation split allows features that incorporate future information to leak into training. In fraud, this produces AUC scores of 0.99 in offline evaluation and 0.78 in production. Split by time first, compute features only from data available at prediction time.

Ignoring the feedback loop latency. Chargebacks arrive 30-120 days after the transaction. If you retrain weekly using only recent chargebacks, you are training on severely incomplete labels. Use a combination of early signals (fraud analyst labels, velocity-rule triggers confirmed as fraud within 24-48 hours) alongside delayed chargeback labels to maintain label freshness.

Optimising for fraud rate in isolation. A fraud system that declines 5% of legitimate transactions to achieve a 0.1% fraud rate is a business failure even if it looks good on a security dashboard. Track false-positive rate, revenue impact of declined legitimate transactions, and customer complaints alongside fraud rate.

One model for all channels. Card-present fraud, card-not-present e-commerce fraud, account opening fraud, and authorised push payment fraud are different problems with different feature distributions, different latency budgets, and different regulatory regimes. Train separate models per channel; a combined model optimises for the dominant channel and underperforms on the others.

Neglecting model explainability until a regulator asks. Retrofitting explainability onto a black-box model in production under regulatory pressure is expensive and disruptive. Build SHAP value generation into the scoring pipeline from day one; store explanations alongside decisions for every transaction above the review threshold.

Siloed fraud ops and data science. Fraud analysts who cannot submit feedback directly into the training pipeline represent a broken feedback loop. Every investigation result that does not flow back to the model degrades your system’s ability to learn. Build tooling that makes labelling fast and ensures those labels reach the feature store within 24 hours.

Sectors: payments, e-commerce, telco, insurance, fintech

Payments and card networks. The highest-volume, lowest-latency environment. Primary fraud types: card testing (automated small transactions to validate stolen card details), card-not-present fraud, BIN attacks. The 100 ms card network SLA is non-negotiable. Velocity features and device fingerprint are the highest-signal inputs. For banking-specific AI fraud detection, including behavioural biometrics and open banking signals, see the AI fraud detection in Spanish banking guide.

E-commerce. Account takeover (ATO), promo abuse, and refund fraud are primary concerns alongside card fraud. Session behaviour features (time-on-page, click sequences, autofill ratio) add significant signal beyond pure payment features. The 250 ms latency budget allows richer feature sets. Graph features for coupon abuse rings add value here — promo abuse is fundamentally a graph problem.

Telco. SIM swap fraud, premium-rate number abuse, and international revenue share fraud. SIM swap is the enabler of downstream financial fraud (the victim’s 2FA codes route to the attacker’s SIM). Detection: anomalous port-in requests, device IMEI changes without corresponding handset upgrade, account change patterns that precede financial account activity. Telcos have unusually rich network-level data (call graphs, SMS patterns) that feeds GNN models effectively.

Insurance. Claims fraud: staged accidents, phantom injuries, inflated repair costs. The latency budget is generous (hours to days), enabling richer analytical models. The challenge is sparse labels — confirmed fraud investigations complete months after the claim. Network analysis of repair shops, claimants, and legal representatives reveals organised fraud rings. Anomaly detection on claim characteristics (claim amount distribution, injury code combinations, time-to-claim after policy inception) provides strong baseline signals.

Fintech / open banking. Authorised push payment (APP) fraud — where the victim is socially engineered into authorising a transfer — is the dominant threat. Unlike card fraud, the transaction is technically authorised by the legitimate account holder. Detection requires behavioural anomaly signals: transaction amount outside historical range, new beneficiary, unusual time-of-day, preceded by unusual login patterns. For real-time data pipelines supporting these systems, stream processing infrastructure must handle sub-second enrichment. Also see real-time payment reconciliation architecture for the downstream settlement layer.

Sources

Nilson Report — Global Card Fraud Losses 2023
ACI Worldwide — Global eCommerce Fraud Report 2025
Juniper Research — Payment Fraud Losses Forecast 2024–2028
EUR-Lex — PSD2 Directive (EU) 2015/2366 — Strong Customer Authentication and TRA exemptions
European Banking Authority — EBA RTS on Strong Customer Authentication under PSD2
Banco de España — Estadísticas de fraude en medios de pago
European Payments Council — SEPA Instant Credit Transfer Scheme

How abemon can help

Building a production fraud detection system requires expertise across data engineering, ML platform, real-time infrastructure, and regulatory compliance — rarely concentrated in a single team. abemon’s AI/ML engineering practice designs and implements fraud detection architectures from reference architecture through production deployment: feature store design, model training pipelines, real-time serving infrastructure, and integration with your existing payment stack.

Our data engineering team specialises in the real-time streaming infrastructure (Kafka, Flink, Redis) that fraud detection depends on. We have delivered fraud platforms across payments, fintech, and e-commerce verticals in Spain and LATAM.

If you are evaluating whether to build or buy, starting a new fraud programme, or scaling an existing system that is not keeping pace with attack evolution, contact us to discuss your specific requirements.

Frequently asked questions

What is a realistic p99 latency target for real-time fraud scoring?: For card-present authorisation, 100 ms end-to-end from API ingress to accept/decline is the hard ceiling — card networks impose it. Card-not-present (e-commerce) allows 250-300 ms before checkout abandonment climbs. Account opening can go to 2 s. In practice, a rules engine adds 2-5 ms, a gradient-boosted model served via REST adds 15-40 ms, and network + serialisation accounts for the rest. Budget 20-30 ms for each Redis lookup on the hot path.
Should I use rules, ML, or both for fraud detection?: Both, always. Rules fire in under 5 ms and handle known fraud patterns with zero false-negative risk for those patterns — velocity limits, BIN blocklists, impossible geographies. ML models generalise to novel patterns and combinations that rules miss. A common production split: rules block ~15% of fraud volume immediately; the ML model handles the remaining scoring and catches 60-70% of what rules miss. Running ML alone means your fraud team cannot explain decisions to regulators; running rules alone means you are perpetually losing to new attack vectors.
How do I handle the cold-start problem for a new merchant or geography?: Three-layer defence: first, fall back to peer-group models trained on similar merchant categories (MCC code, average ticket size, geography) — these generalise reasonably well. Second, apply conservative rule-based thresholds while collecting labelled data. Third, use transfer learning from your global model fine-tuned on the merchant's early transaction history after 2-4 weeks of data accumulation. Do not wait for 6 months of data; a model trained on 500 labelled fraud cases is already better than pure rules.
Build vs buy: when does a SaaS fraud vendor make sense?: Buy when your fraud volume is under ~$500M GMV/year, your team has no ML engineering capacity, or you need to go live in under 3 months. Stripe Radar, Sift, and Riskified have pre-trained network models that you cannot replicate from a cold start — they see signals across millions of merchants. Build when you have proprietary data signals that vendors cannot access (internal behavioural data, CRM history), when your vertical has unusual patterns (B2B invoicing, telco top-ups), or when vendor fees (typically 0.1-0.5% of GMV) exceed in-house engineering cost.
What is the single most predictive feature for card fraud?: Velocity features consistently rank highest across published benchmarks and our own deployments: specifically, the count of distinct merchants charged by the same card in the past 1 hour and 24 hours. This single feature, with appropriate thresholds, flags a disproportionate share of card testing and account takeover patterns. Device fingerprint combined with IP reputation is the second-most reliable signal for card-not-present. Neither is sufficient alone — the combination of velocity + device + behavioural deviation from historical baseline is what drives AUC above 0.95.
How does the PSD2 SCA exemption work in practice?: PSD2 Article 18 allows issuers and PSPs to skip Strong Customer Authentication for transactions below certain value thresholds IF their real-time Transaction Risk Analysis (TRA) keeps fraud rates below specific ceilings: 0.13% for exemptions up to €100, 0.06% up to €250, 0.01% up to €500. The TRA must be certified by your competent authority. Operationally, your fraud scoring system must output a risk score in real time and the issuer's authorisation system must consume it before routing to the SCA step-up flow. Maintaining the fraud rate below the ceiling is a continuous monitoring obligation.
How often should I retrain my fraud model?: Weekly retraining typically outperforms monthly by 8-12% AUC on drifting fraud populations, based on published research and our observed deployments. Daily retraining adds marginal lift (1-3%) at significantly higher compute and pipeline complexity. The practical recommendation: weekly scheduled retraining with automated champion/challenger evaluation; trigger ad-hoc retraining when a population shift detector (PSI > 0.2 on key features) fires. Never deploy a new model directly to 100% traffic — shadow or champion/challenger with 5-10% allocation first.
What metrics matter beyond AUC and precision/recall?: AUC is necessary but insufficient. Track: (1) $-weighted precision — blocking $1,000 in fraud that costs $0.50 in false positives is not the same as blocking $10 in fraud with the same false positive; (2) false-positive rate on your highest-value customers, who churn fastest when incorrectly declined; (3) chargeback rate as a lagging confirmation of model quality; (4) time-to-decision p99 in production, not just benchmark environments; (5) precision@k, where k is the number of cases your fraud operations team can actually review in a day — a model that generates 10,000 alerts when you can work 200 is operationally broken regardless of its AUC.

Sources

Nilson Report — Nilson Report — Global Card Fraud Losses 2023
ACI Worldwide — ACI Worldwide Global eCommerce Fraud Report 2025
Juniper Research — Juniper Research — Payment Fraud Losses Forecast 2024–2028
EUR-Lex — PSD2 Directive (EU) 2015/2366 — Strong Customer Authentication and TRA exemptions
European Banking Authority — EBA Regulatory Technical Standards on SCA and TRA (EBA/RTS/2017/02)
Banco de España — Banco de España — Estadísticas de fraude en medios de pago
European Payments Council — SEPA Instant Credit Transfer Scheme (EPC) — Rules and guidelines

About the author

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.

Follow: LinkedIn GitHub

Articles

Real-Time Fraud Detection Architecture: A Production Guide

Fraud losses are not an abstraction

What is real-time fraud detection? (and what is NOT)

The decisioning budget: latency, cost, accuracy

Reference architecture

Models that actually work in production

Feature engineering for fraud (with examples)

Cold-start and concept drift

PSD2/SCA and fraud detection in EU/Spain

Evaluation metrics beyond AUC

Build vs buy

Common pitfalls

Sectors: payments, e-commerce, telco, insurance, fintech

Sources

How abemon can help

Frequently asked questions

Sources

Tags

About the author

Related articles

AI agents in production: lessons learned after 18 months

4 Autonomous AI Agent Architecture Patterns for Production

RAG vs Fine-Tuning: Choosing the Right Approach for Your Business