[DEVCONF.CZ 2022] Building data pipelines for Anomaly Detection

Jan 28, 2022·

Tuhin Sharma

· 2 min read

Abstract

Cloud-native applications. Multiple Cloud providers. Hybrid Cloud. 1000s of VMs and containers. Complex network policies. Millions of connections and requests in any given time window. This is the typical situation faced by a Security Operations Control (SOC) Analyst every single day. In this talk, the speaker talks about the high-availability and highly scalable data pipelines.

Event

Devconf CZ 2022

Location

Virtual

Description

Denial of Service: A device in the network stops working.
Data Loss : An example is a rogue agent in the network transmitting IP data outside the network
Data Corruption : A device starts sending erroneous data.

The above can be solved through anomaly detection models. The main challenge here is the data engineering pipeline. With almost 7 Billion events occurring every day, processing and storing that for further analysis is a significant challenge. The machine learning models (for anomaly detection) has to be updated every few hours and requires the pipeline to create the feature store in a significantly small time window. The core components of the data engineering pipeline are:

Apache Flink
Apache Kafka
Apache Pinot
Apache Spark
Mlflow
Apache Superset

The event logs are stored in Pinot through Kafka topic. Pinot supports apache kafka based indexing service for realtime data ingestion. Pinot has primitive capabilities to create sliding time window statistics. More complex real-time statistics are computed using Flink. Apache Flink is a stream-processing engine and provides high throughput and low latency. Spark jobs are used for batch processing. Mlflow is used for machine learning model management. Superset is used for visualization.

The speaker talks through the architectural decisions and shows how to build a modern real-time stream processing data engineering pipeline using the above tools.

Outline

The problem: overview
Different Architecture Choices
The final architecture - a brief explanation
Real-Time Processing
Apache Kafka
Message broker vs Message Queue
RabitMQ vs Kafka
Why Kafka?
Apache Flink
Micro-batching vs Streaming
Flink vs Spark Streaming
Why Flink?
Apache Pinot
OLAP vs OLTP
Why Pinot?
Batch Processing
Apache Spark
Anomaly detection
Models
Data Engineering + Machine Learning
ML and MLLIB
Mlflow - Model management
Visualization - Superset
A short demo

Presentation Video

Last updated on Sep 9, 2024