SAP BTP AI Best Practices #11: Anomaly Detection

Post Content

For More Information: https://sap.to/6054fIFR4

Description
Anomaly detection is the process of identifying data points, events, or patterns that deviate significantly from the expected or normal behavior within a dataset. In the SAP ecosystem, this involves leveraging tools within SAP HANA ML (PAL, hana-ml) to find data points that “do not follow the collective common pattern of the majority of data points”. This practice covers implementing these techniques effectively.

Expected Outcome
To successfully identify and flag unusual behavior or outliers in various types of data (e.g., transactional data, sensor readings, time series, API traffic) residing within or connected to the SAP landscape. This enables proactive responses to potential risks or opportunities.

Benefits
Mitigate Risks: Detect fraud, system failures, security breaches, or compliance violations early.
Optimize Processes: Identify operational inefficiencies, improve data quality, understand unexpected process variations, and enable predictive maintenance.
Enhance Decision Making: Gain insights from unexpected deviations, understand emerging trends, and react faster to critical events.
Key Algorithms
General Anomaly Detection Functions
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm. It groups together points that are closely packed together (having many neighbors within a certain distance) and marks points that lie alone in low-density regions as outliers. It can discover clusters of arbitrary shape and is robust to noise.
Isolation Forest is an unsupervised anomaly detection algorithm. It works by randomly partitioning the data space and isolating observations. The core idea is that anomalies are “few and different”, making them easier to isolate compared to normal points. Points that require fewer random partitions to be isolated are considered more likely to be anomalies and receive a higher anomaly score.
One-Class Support Vector Machine (SVM) is an unsupervised algorithm primarily used for novelty or outlier detection. It learns a decision boundary that encompasses the majority of the training data (the “normal” points). New data points falling outside this boundary are classified as anomalies or outliers.
The algorithm aims to find a hyperplane in a high-dimensional feature space (potentially transformed by a kernel function) that separates the data points from the origin with maximum margin. Points lying on the “wrong” side of the hyperplane or too far from it are considered outliers.
K-Means clustering can be used as a basis for outlier detection. The core idea is that outliers will typically be far away from the centroids (centers) of the clusters formed by the majority of the data. The function first performs K-Means clustering and then calculates a distance-based score for each point. Points with the highest scores (i.e., farthest from cluster centers, considering the specified distance metric and aggregation method) are flagged as outliers based on a specified contamination fraction or distance threshold.
Timeseries Anomalies Detection Functions
Time series outlier detection identifies data points that deviate significantly from the general pattern of the series. The OutlierDetectionTS algorithm works in two steps:
Residual Extraction: A model (e.g., smoothing, seasonal decomposition) is fitted to the time series, and the residuals (the differences between the actual values and the fitted values) are calculated.
Outlier Detection on Residuals: An outlier detection method (like z-score, IQR, MAD, Isolation Forest, DBSCAN) is applied to the residuals. Points whose residuals have high outlier scores are flagged as anomalies in the original time series.
Regression Anomalies Detection Functions
Outlier Detection Regression identifies data points that deviate significantly from the expected pattern in regression models. The algorithm works in two steps:
Residual Extraction: A regression model (either linear or tree-based) is fitted to the data, and residuals (differences between actual and predicted values) are calculated.
Outlier Scoring: Each data point receives an outlier score based on its residual. For linear models, the score is the deleted studentized residual. For tree models, the score is the z-score of the residual. Points whose scores exceed a specified threshold are flagged as outliers.
This approach recognizes that outliers in regression are points that don’t follow the general behavior pattern established by the model, making them distinguishable through their unusually large residuals. Read More SAP Developers