Batch-based anomaly detection introduces unacceptable delays for use cases like fraud prevention, infrastructure monitoring, and cybersecurity. We built a real-time streaming ML pipeline to detect anomalies within seconds of occurrence and tested it under production-scale loads.
Architecture Overview
The pipeline consists of four layers: ingestion (Apache Kafka), feature engineering (Apache Flink), model serving (a custom ensemble deployed on KServe), and alerting (PagerDuty integration). Each layer is horizontally scalable and independently deployable.
Data Ingestion
Kafka topics receive events from application logs, transaction systems, and infrastructure metrics. We configured 32 partitions per topic to parallelize processing. Schema Registry enforces event contracts and enables schema evolution without pipeline breaks.
Streaming Feature Engineering
Flink jobs compute rolling statistical features over multiple time windows: 1-minute, 5-minute, and 1-hour aggregations. Features include transaction velocity, deviation from historical baselines, geographic anomaly scores, and cross-entity correlation metrics. The feature store maintains low-latency read access with a p99 latency of 3ms.
Model Architecture
We deployed an ensemble combining an Isolation Forest for statistical anomalies, a gradient-boosted model for pattern-based detection, and a small transformer model for sequential pattern analysis. The ensemble votes with configurable thresholds per anomaly type to balance precision and recall.
Load Testing Results
At 50,000 events per second, end-to-end latency from event ingestion to alert delivery averaged 1.2 seconds (p50) and 3.4 seconds (p99). False positive rate held at 0.3% with a true positive rate of 94.7%. The system consumed 48 CPU cores and 96GB RAM across the Flink cluster.
Lessons Learned
Feature engineering is the bottleneck, not model inference. Invest heavily in efficient streaming aggregations. Deploy models behind a feature flag so you can roll back without pipeline changes. Monitor feature drift in real-time alongside model predictions. And always have a manual override mechanism for high-impact alerts.
What We Would Do Differently
Start with simpler statistical methods (Z-score, IQR) as a baseline before introducing ML models. The added complexity of ML is only justified when statistical methods plateau. In our case, the ML ensemble improved detection by 12% over statistical baselines, which was worth the complexity for this use case.