Hence blindly applying explainers designed for classification or regression to time series models ignores the significance of past events and features throughout the sequence, attributing importance only to features of the current input. To effectively explain time series models, new methods are needed.

Furthermore, designing an explainer for AutoML pipelines introduces an additional constraint that it must not be specific to any particular model. It needs to accommodate a variety of pipeline scenarios, including multiple time series models and combinations of processing operators. Additionally, it is preferable that the method does not require re-training the model or access to training data, and can be applied to already trained architectures, potentially in a production environment. This restricts the approach to be model-agnostic and only to rely on post-hoc access to a model’s predictions, such as perturbations of its input.

While the final method focuses on feature-based explanations, it could be beneficial to incorporate capabilities for time series decomposition as well. This can be easily achieved by applying a standard decomposition technique, such as STL decomposition, to the prediction results. In summary, the explanation approach for AutoML time series pipelines is a model-agnostic, post-hoc, and feature-based procedure.

Method Step-by-Step

The Predictive Analysis Library (PAL) in SAP HANA Cloud has recently introduced a new method for time series explanations, which builds upon the strong theoretical foundations and empirical results of KernelSHAP, while extending it to the time series data. This approach maintains the three desirable properties of importance attribution stemming from the Shapley values: local accuracy (ensuring the explanation model matches the complex model locally), missingness (ensuring missing features have no impact on predictions), and consistency (ensuring that a feature’s attributed importance does not decrease when its contribution increases or stays the same).

Our new method consists of four key steps:

A binary vector (aka simplified feature) is drawn from {0, 1}M where an entry of 1 means that the corresponding feature is present in coalition and 0 that it is absent. The weight for each is then computed using the SHAP kernel.

For each simplified feature, the prediction is obtained by first converting it to the original feature space and applying the model. The prediction on the background data is then subtracted from this result.

These first two steps are repeated a certain number of times, resulting in a binary matrix (each row equals one of the simplified features) and a vector of shifted predictions.

The shifted predictions are regressed onto simplified features matrix, with the constraint that the sum of the coefficients equals the difference between the prediction of the observation being explained and the average prediction on the background data (the expected value of the model output). The resulting coefficients from the linear model are the estimated Shapley values.

Feature Perturbation

Because KernelSHAP perturbs only the current instance, the resulting attributions would only explain the contribution of feature values of the current instance, but disregards the rest of the sequence. This causes a mismatch between the data KernelSHAP attributes importance to and the data the time series model actually relies on. To address this, we adapt KernelSHAP to the time series setting by redefining feature-wide perturbations, allowing for the calculation of attributions for features throughout a full sequence of instances.

To perturb the input vector, all values of a simplified feature must be converted back to their original values in the input space such that a value of 1 indicates the feature retains its original value, while a value of 0 indicates the feature is replaced with an uninformative background value to represent its removal. The input perturbation function is given by,

h_x(z)=x⊙z+b⊙(1-z)

where x is the input vector, z is simplified feature, and ⊙ is the Hadamard product. The vector b represents an uninformative input, which is to be composed of the average for numerical features and mode for categorical features respectively in the input dataset. In our setting, this can be further formalized by,

h_X(z)=XD_z+B(I-D_z), D_z=diag(z)

A perturbation along the features axis of the input matrix X is the result of mapping a simplified vector z to the original space such that zi=1 means that column i takes its original value X:,i, and zi=0 means that column i takes the background value B:,i.

Sampling Strategy

Calculating the exact Shapley values would require generating all potential coalitions of discriminative features in the input. A discriminative feature is one that changes when masked. By focusing on discriminative features instead of all features, explainers can avoid re-evaluating the model when the features to be masked are invariant. For instance, the brute force computation would need to enumerate the entire 2^M sample space, which becomes increasingly impractical as the number of discriminative features grows. To manage this, it is more feasible to approximate the exact values by randomly sampling feature coalitions.

We use the parameter SAMPLESIZE to control the number of times the model is re-evaluated when explaining each prediction. To obtain more accurate estimates within this sampling budget, we have developed a coalition sampling strategy based on the weight the coalition would receive in the Shapley value estimation. This strategy begins with all possible coalitions containing 1 and M-1 features, resulting in a total of 2 times M coalitions, which covers at least 75% of the mass of the kernel weight distribution. These vectors are listed and used in subsequent calculations according to their weights. If there is sufficient remaining budget (current budget is SAMPLESIZE-2M), we then consider including coalitions with 2 features and with M-2 features, which would cover at least 92% of the mass of the weight distribution when added. For the remaining coalition sizes, we fill the remaining weight by sampling according to readjusted weights for the values not yet covered by previous steps. It is important to note that we always utilize a paired sampling strategy where each sample zi is paired with its complement 1–zi.

Our strategy prioritizes exact calculations to cover the majority of mass. It should be emphasized that the method uses a sampling approximation for large values of M, but for smaller values it is exact. If M is sufficiently small, all possible 2^M-2 simplified vectors can be evaluated, eliminating the need for sampling and ensuring that the algorithm returns exact Shapley values with respect to the given background data. By default, the setting of SAMPLESIZE uses 2M+2048 samples, allowing for the calculation of exact Shapley values for up to 11 features. For larger M, increasing the parameter to use more samples can lead to lower variance estimates of the Shapley values, but will also result in increased computation time.

Example

Here is an example demonstrating the basic use of the algorithm.

–############### PAL_PIPELINE_EXPLAIN FOR TIME SERIES SQL ###############

–########## PRE-CLEANUP ##########
DROP TABLE PAL_PIPELINE_TS_DATA_TBL__0;
DROP TABLE PAL_PARAMETER_TBL__0;
DROP TABLE PAL_PIPELINE_PREDICT_TS_DATA_TBL__0;
DROP TABLE PAL_PREDICT_PARAMETER_TBL__0;
DROP TABLE PAL_PREDICT_RESULT_TBL__0;

–########## COLUMN TABLE CREATION ##########
CREATE COLUMN TABLE PAL_PIPELINE_TS_DATA_TBL__0 (“ID” INTEGER, “Y” DOUBLE, “X” DOUBLE, “CA” NVARCHAR(100));
CREATE COLUMN TABLE PAL_PARAMETER_TBL__0 (“PARAM_NAME” NVARCHAR(256), “INT_VALUE” INTEGER, “DOUBLE_VALUE” DOUBLE, “STRING_VALUE” NVARCHAR(1000));
CREATE COLUMN TABLE PAL_PIPELINE_PREDICT_TS_DATA_TBL__0 (“ID” INTEGER, “X” DOUBLE, “CA” NVARCHAR(100));
CREATE COLUMN TABLE PAL_PREDICT_PARAMETER_TBL__0 (“PARAM_NAME” NVARCHAR(256), “INT_VALUE” INTEGER, “DOUBLE_VALUE” DOUBLE, “STRING_VALUE” NVARCHAR(1000));
CREATE COLUMN TABLE PAL_PREDICT_RESULT_TBL__0 (“ID” NVARCHAR(10), “SCORE” NVARCHAR(100), “CONFIDENCE” DOUBLE, “REASON_CODE” NCLOB, “PH_1” NVARCHAR(10), “PH_2” NVARCHAR(10));

–########## TABLE INSERTS ##########

–########## PAL_PIPELINE_TS_DATA_TBL__0 DATA INSERTION ##########
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (1, 2.0, 30, ‘A’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (2, 2.5, 28.3, ‘B’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (3, 3.2, 27.2, ‘C’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (4, 2.8, 24.1, ‘A’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (5, 2.4, 28, ‘B’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (6, 2.9, 21, ‘C’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (7, 3.1, 24, ‘A’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (8, 3, 25, ‘B’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (9, 3.8, 18, ‘C’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (10, 4.2,33.1, ‘A’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (11, 4.0, 34, ‘B’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (12, 4.3, 32.1, ‘C’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (13, 10.7, 50.1, ‘A’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (14, 5.1,40.2, ‘B’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (15, 5.3, 43.3, ‘C’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (16, 5.0, 39, ‘A’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (17, 4.6, 42, ‘B’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (18, 4.4, 41.1, ‘C’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (19, 4.8, 44.8, ‘A’);
INSERT INTO PAL_PIPELINE_TS_DATA_TBL__0 VALUES (20, 5.1, 43.3, ‘B’);

–########## PAL_PARAMETER_TBL__0 DATA INSERTION ##########
INSERT INTO PAL_PARAMETER_TBL__0 VALUES (‘PIPELINE’, null, null, ‘{“HGBT_TimeSeries”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:0,”ETA”:0.4,”LAG”:5}, “inputs”:{“data”:”ROWDATA”}}}’);
INSERT INTO PAL_PARAMETER_TBL__0 VALUES (‘USE_EXPLAIN’, 1, null, null);
INSERT INTO PAL_PARAMETER_TBL__0 VALUES (‘BACKGROUND_SIZE’, 10, null, null);
INSERT INTO PAL_PARAMETER_TBL__0 VALUES (‘BACKGROUND_SAMPLING_SEED’, 1, null, null);

–########## PAL_PIPELINE_PREDICT_TS_DATA_TBL__0 DATA INSERTION ##########
INSERT INTO PAL_PIPELINE_PREDICT_TS_DATA_TBL__0 VALUES (21, 55, ‘C’);
INSERT INTO PAL_PIPELINE_PREDICT_TS_DATA_TBL__0 VALUES (22, 44.6, ‘A’);
INSERT INTO PAL_PIPELINE_PREDICT_TS_DATA_TBL__0 VALUES (23, 37, ‘B’);
INSERT INTO PAL_PIPELINE_PREDICT_TS_DATA_TBL__0 VALUES (24, 24.8, ‘C’);
INSERT INTO PAL_PIPELINE_PREDICT_TS_DATA_TBL__0 VALUES (25, 35.1, ‘A’);

–########## PAL_PREDICT_PARAMETER_TBL__0 DATA INSERTION ##########
INSERT INTO PAL_PREDICT_PARAMETER_TBL__0 VALUES (‘TOP_K_ATTRIBUTIONS’, 10, null, null);
INSERT INTO PAL_PREDICT_PARAMETER_TBL__0 VALUES (‘SAMPLESIZE’, 0, null, null);
INSERT INTO PAL_PREDICT_PARAMETER_TBL__0 VALUES (‘SEED’, 1, null, null);

–########## PAL_PIPELINE_EXPLAIN FOR TIME SERIES CALL ##########
DO BEGIN
lt_data = SELECT * FROM PAL_PIPELINE_TS_DATA_TBL__0;
lt_para = SELECT * FROM PAL_PARAMETER_TBL__0;
CALL _SYS_AFL.PAL_PIPELINE_FIT (:lt_data, :lt_para, lt_model, lt_info);
lt_pdata = SELECT * FROM PAL_PIPELINE_PREDICT_TS_DATA_TBL__0;
lt_ppara = SELECT * FROM PAL_PREDICT_PARAMETER_TBL__0;
CALL _SYS_AFL.PAL_PIPELINE_EXPLAIN (:lt_pdata, :lt_model, :lt_ppara, lt_res, lt_stat);
INSERT INTO PAL_PREDICT_RESULT_TBL__0 SELECT * FROM :lt_res;
END;

–########## SELECT * TABLES ##########
SELECT * FROM PAL_PREDICT_RESULT_TBL__0;

Recent topics on HANA machine learning:

Inference Acceleration – Random Decision Tree Models for Text Classification
Advancing to Multi-task Multilayer Perceptron: a new Neural Network design in SAP HANA Cloud
Global Explanation Capabilities in SAP HANA Machine Learning
Exploring ML Explainability in SAP HANA PAL – Time Series
Exploring ML Explainability in SAP HANA PAL – Classification and Regression
Fairness in Machine Learning – A New Feature in SAP HANA Cloud PAL

Time series data presents a distinct challenge for explaining AI models. Time series forecasts are influenced not only by the immediate data point but also by the longer sequence and patterns of previous data points.Hence blindly applying explainers designed for classification or regression to time series models ignores the significance of past events and features throughout the sequence, attributing importance only to features of the current input. To effectively explain time series models, new methods are needed.Furthermore, designing an explainer for AutoML pipelines introduces an additional constraint that it must not be specific to any particular model. It needs to accommodate a variety of pipeline scenarios, including multiple time series models and combinations of processing operators. Additionally, it is preferable that the method does not require re-training the model or access to training data, and can be applied to already trained architectures, potentially in a production environment. This restricts the approach to be model-agnostic and only to rely on post-hoc access to a model’s predictions, such as perturbations of its input.While the final method focuses on feature-based explanations, it could be beneficial to incorporate capabilities for time series decomposition as well. This can be easily achieved by applying a standard decomposition technique, such as STL decomposition, to the prediction results. In summary, the explanation approach for AutoML time series pipelines is a model-agnostic, post-hoc, and feature-based procedure. Method Step-by-StepThe Predictive Analysis Library (PAL) in SAP HANA Cloud has recently introduced a new method for time series explanations, which builds upon the strong theoretical foundations and empirical results of KernelSHAP, while extending it to the time series data. This approach maintains the three desirable properties of importance attribution stemming from the Shapley values: local accuracy (ensuring the explanation model matches the complex model locally), missingness (ensuring missing features have no impact on predictions), and consistency (ensuring that a feature’s attributed importance does not decrease when its contribution increases or stays the same).Our new method consists of four key steps:A binary vector (aka simplified feature) is drawn from {0, 1}M where an entry of 1 means that the corresponding feature is present in coalition and 0 that it is absent. The weight for each is then computed using the SHAP kernel.For each simplified feature, the prediction is obtained by first converting it to the original feature space and applying the model. The prediction on the background data is then subtracted from this result.These first two steps are repeated a certain number of times, resulting in a binary matrix (each row equals one of the simplified features) and a vector of shifted predictions.The shifted predictions are regressed onto simplified features matrix, with the constraint that the sum of the coefficients equals the difference between the prediction of the observation being explained and the average prediction on the background data (the expected value of the model output). The resulting coefficients from the linear model are the estimated Shapley values. Feature PerturbationBecause KernelSHAP perturbs only the current instance, the resulting attributions would only explain the contribution of feature values of the current instance, but disregards the rest of the sequence. This causes a mismatch between the data KernelSHAP attributes importance to and the data the time series model actually relies on. To address this, we adapt KernelSHAP to the time series setting by redefining feature-wide perturbations, allowing for the calculation of attributions for features throughout a full sequence of instances.To perturb the input vector, all values of a simplified feature must be converted back to their original values in the input space such that a value of 1 indicates the feature retains its original value, while a value of 0 indicates the feature is replaced with an uninformative background value to represent its removal. The input perturbation function is given by,h_x(z)=x⊙z+b⊙(1-z)where x is the input vector, z is simplified feature, and ⊙ is the Hadamard product. The vector b represents an uninformative input, which is to be composed of the average for numerical features and mode for categorical features respectively in the input dataset. In our setting, this can be further formalized by,h_X(z)=XD_z+B(I-D_z), D_z=diag(z)A perturbation along the features axis of the input matrix X is the result of mapping a simplified vector z to the original space such that zi=1 means that column i takes its original value X:,i, and zi=0 means that column i takes the background value B:,i. Sampling StrategyCalculating the exact Shapley values would require generating all potential coalitions of discriminative features in the input. A discriminative feature is one that changes when masked. By focusing on discriminative features instead of all features, explainers can avoid re-evaluating the model when the features to be masked are invariant. For instance, the brute force computation would need to enumerate the entire 2^M sample space, which becomes increasingly impractical as the number of discriminative features grows. To manage this, it is more feasible to approximate the exact values by randomly sampling feature coalitions.We use the parameter SAMPLESIZE to control the number of times the model is re-evaluated when explaining each prediction. To obtain more accurate estimates within this sampling budget, we have developed a coalition sampling strategy based on the weight the coalition would receive in the Shapley value estimation. This strategy begins with all possible coalitions containing 1 and M-1 features, resulting in a total of 2 times M coalitions, which covers at least 75% of the mass of the kernel weight distribution. These vectors are listed and used in subsequent calculations according to their weights. If there is sufficient remaining budget (current budget is SAMPLESIZE-2M), we then consider including coalitions with 2 features and with M-2 features, which would cover at least 92% of the mass of the weight distribution when added. For the remaining coalition sizes, we fill the remaining weight by sampling according to readjusted weights for the values not yet covered by previous steps. It is important to note that we always utilize a paired sampling strategy where each sample zi is paired with its complement 1–zi.Our strategy prioritizes exact calculations to cover the majority of mass. It should be emphasized that the method uses a sampling approximation for large values of M, but for smaller values it is exact. If M is sufficiently small, all possible 2^M-2 simplified vectors can be evaluated, eliminating the need for sampling and ensuring that the algorithm returns exact Shapley values with respect to the given background data. By default, the setting of SAMPLESIZE uses 2M+2048 samples, allowing for the calculation of exact Shapley values for up to 11 features. For larger M, increasing the parameter to use more samples can lead to lower variance estimates of the Shapley values, but will also result in increased computation time. ExampleHere is an example demonstrating the basic use of the algorithm.–############### PAL_PIPELINE_EXPLAIN FOR TIME SERIES SQL ###############

–########## TABLE INSERTS ##########

–########## SELECT * TABLES ##########
SELECT * FROM PAL_PREDICT_RESULT_TBL__0;

–########## TABLES CLEANUP ##########
DROP TABLE PAL_PIPELINE_TS_DATA_TBL__0;
DROP TABLE PAL_PARAMETER_TBL__0;
DROP TABLE PAL_PIPELINE_PREDICT_TS_DATA_TBL__0;
DROP TABLE PAL_PREDICT_PARAMETER_TBL__0;
DROP TABLE PAL_PREDICT_RESULT_TBL__0; Recent topics on HANA machine learning:Inference Acceleration – Random Decision Tree Models for Text ClassificationAdvancing to Multi-task Multilayer Perceptron: a new Neural Network design in SAP HANA CloudGlobal Explanation Capabilities in SAP HANA Machine LearningExploring ML Explainability in SAP HANA PAL – Time SeriesExploring ML Explainability in SAP HANA PAL – Classification and RegressionFairness in Machine Learning – A New Feature in SAP HANA Cloud PAL Read More Technology Blogs by SAP articles

#SAP

#SAPTechnologyblog