1. Introduction
In this blog post, we will continue our journey of the Machine learning explainability with SAP HANA Predictive Analysis Library (PAL) and delve into the realm of Automated Machine Learning (AutoML) following other two explorations of ML Explainability in classification and regression, and time series analysis. We hope you can uncover how HANA PAL AutoML simplifies and democratizes the process of building predictive models, while maintaining the high standards of explainability and ethical AI that SAP is renowned for.
In our previous articles, we’ve established the foundation of ML Explainability, showcasing how HANA PAL integrates this crucial feature into its algorithms, ensuring that the models we create are not only accurate but also understandable and trustworthy. By the end of this exploration, you will gain:
A comprehensive understanding of what AutoML is and its significance in the field of MLInsight into AutoML of HANA PAL, particularly in the context of ML ExplainabilityPractical example of AutoML with the Python Machine Learning Client for SAP HANA (hana-ml)
2. AutoML Explainability
AutoML stands for Automated Machine Learning, a technology that revolutionizes the development process of ML models by automating tasks such as data preprocessing, feature engineering, model selection, and hyperparameter tuning. A promising area of research is exploring the application of AutoML to Large Language Models (LLMs), aiming to enhance their training, optimization, and inference processes. Traditional workflows, from raw data to deployable ML models, require extensive expertise and time from data scientists. AutoML elevates efficiency and productivity, enabling anyone to build and deploy machine learning models.
As we have discussed in our previous blogs regarding classification, regression, and time series, the model explainability is also an important and a challenging aspect within AutoML. Lack of explainability can undermine trust in model decisions. Providing explainability is beneficial for enhancing model transparency and trust, promoting fairness, safety, and compliance.
A significant challenge that AutoML faces is that the models it generates are often considered black boxes, especially after a series of processing steps that complicate the connection between the original raw data and the resulting model. Such complex models might perform exceptionally well, but understanding how they make decisions can be difficult. Another challenge is that AutoML heavily relies on the quality of the input data, which may perpetuate or amplify biases present in the training data. These issues can lead to regulatory compliance and ethical concerns.
There is a variety of AutoML platforms on the market, each with differences in usability, scalability, integration capabilities, and cost factors. Some platforms focus on optimizing deep neural networks, but not all provide explainability features. The implementation of explainability in these products varies, reflecting the ongoing challenge of making AutoML both powerful and transparent.
3. PAL AutoML
Before we delve into the explainability of AutoML, we first describe how AutoML works within PAL in HANA Cloud. PAL which has a rich set of ML algorithms, integrated with some built-in operators, to provide a powerful AutoML to support multi-objective optimization for tabular data for Classification, Regression, and Time Series problems.
The fundamental elements in our AutoML framework are operators, which include several categories such as preprocessors (e.g., imputer), transformers (e.g., Principal Component Analysis), and estimators (specific algorithms for different types of problems, such as tree ensembles, Naive Bayes, and ARIMA). These operators can be composed to form a pipeline, which serves as the smallest unit of optimization. Pipeline is a composite estimator using a chain of transformers and estimators. Each operator contains a group of hyperparameters. For each hyperparameter, either a fixed value or a search space of values is considered during hyperparameter tuning.
In the search for the optimal pipeline with the best composite of operators with hyperparameters, we utilize Genetic Programming (GP) and Multi-Objective Non-Dominated Sorting Genetic Algorithm II (NSGA-II) to reduce complexity and optimize the search space according to specific problems. The search space is determined by the Config Dict (short for configuration dictionary), for which we provide a default or light version, but users can also flexibly customize their own Config dict. Furthermore, PAL AutoML also provides mechanisms such as early-stop iterative training, success halving, and connection constraints to enhance the training speed and model performance.
4. AutoML Explainability
PAL’s focus on explainability in AutoML is geared towards post-modeling analysis of the factors influencing the effects, specifically local explainability. A SQL procedure named PIPELINE_EXPLAIN offers different methods for model explainability on the best pipeline or any pipeline. Additionally, within Python API, we provide visualization of the AutoML training process, enabling better monitoring of the iterative process and a clear view of specifics.
When designing an explainer for AutoML context, one of the constraints is that it must not be specific to any particular model, as it needs to accommodate a variety of pipeline scenarios, including multiple time series models and combination of processing operators. Additionally, it is preferable that our method does not require re-training the model or access to training data, and can be applied to already trained architectures, potentially in a production environment. This restricts our approach to be model-agnostic and only relies on post-hoc access to a model’s predictions, such as perturbations of its input.
If you wish to enable interpretability for PAL Auto ML, please set ‘use_explain=True’ during the training process. As model explanation requires a significant amount of computation, we provide this parameter to control whether to initiate this process, with the default set as False.
In addition, there are several other important parameters available for providing model explanation.
explain_method : specifies which explanation method will be used.background_size : the row size of the background data.background_sampling_seed : the seed for the random number generator in background sampling.
4.1 Classification and Regression
For Classification and Regression, we employ Kernel SHAP and global surrogate models using Random Decision Trees (RDT) for explainability. These methods provide insights into the contribution of each feature to the model’s output, helping to understand which features are most influential in the decision-making process. Kernel SHAP, for instance, is a powerful technique that estimates the impact of each feature on the prediction by computing the expected change in the model’s output when the feature is hidden from the model. This approach allows for a detailed understanding of the model’s behavior and can be particularly useful in identifying potential biases or anomalies in the model’s predictions.
Global surrogate models like RDT, on the other hand, offer a more holistic view of the model’s decision-making process. They approximate the complex relationships within the model by constructing a simpler, interpretable model, such as a decision tree, that mimics the behavior of the original model. This surrogate model can then be analyzed to understand the decision paths and the importance of various features in the predictions.
4.2 Time Series
Time series data presents unique challenges for explainable AI models, as forecasts depend on both immediate data points and the sequences of previous data. Hence, PAL has introduced a new time series explanation method that extends Kernel SHAP, maintaining key properties of Shapley values. This method involves drawing binary vectors for simplified features, computing weights, obtaining predictions, and regressing shifted predictions to estimate Shapley values. Additionally, a sampling strategy focuses on discriminative features to approximate Shapley values efficiently. For more detailed information on TS explain and its implementation, you can refer to the blog post Demystifying Pipeline Explanation for Time Series.
5. Use Case
In this section, we will show an example to demonstrate how to get explainability in AutoML with a timeseries example. Note that the code provided is purely for illustrative purposes and is not intended for production use.
The dataset is the Beijing PM2.5 data from the UCI Machine Learning Repository. It comprises hourly recordings of PM2.5 levels (airborne particles with aerodynamic diameters less than 2.5 μm) collected by the US Embassy in Beijing between January 1, 2010, and December 31, 2014. Additionally, meteorological data from Beijing Capital International Airport is included. The objective is to predict PM2.5 concentrations using various input features.
This dataset contains 43,824 rows and 11 columns. During preprocessing, the year, month, day, and hour columns were merged into a single ‘date’ column, and rows with missing values were addressed. The restructured dataset included the following 9 columns.
date: Timestamp of the recordpollution: PM2.5 concentration (ug/m^3)dew: Dew Pointtemp: Temperaturepress: Pressure (hPa)wnd_dir: Combined wind directionwnd_spd: Cumulated wind speed (m/s)snow: Cumulated hours of snowrain: Cumulated hours of rain
For demonstration purposes, we’ve simplified the process by selecting the first 1,000 instances. Out of these, 990 instances were assigned to the training set, leaving the remaining 10 for the testing set. Recognizing that factors such as holidays can impact air quality levels due to variations in vehicle travel and industrial production, we enriched the original dataset by adding a new column named ‘holiday’. This column specifically identifies the Chinese national holidays.
>>> print(df_train.head(5).collect())
Figure 1. The first 5 lines of HANA DataFrame df_train
>>> print(df_test.head(3).collect())
Figure 2. The first 3 lines of HANA DataFrame df_test
Below we show how to run invoke the class AutomaticTimeSeries to get the optimal time series model. During the training process of fit() function, please remember to set ‘use_explain=True’. Currently, AutomaticTimeSeries only supports Kernel SHAP, so there is no need to specifically set ‘explain_method’.
Moreover, hana-ml provides a visualization tool, PipelineProgressStatusMonitor, to display the computation progress. The Python code for this is shown below, and a display of the AutomaticTimeSeries progress is illustrated in Figure 3, highlighting the Pipeline Progress Status.
>>> from hana_ml.algorithms.pal.auto_ml import AutomaticTimeSeries
>>> from hana_ml.visualizers.automl_progress import PipelineProgressStatusMonitor
>>> import uuid
>>> progress_id = “automl_ts_{}”.format(uuid.uuid1())
>>> auto_ts = AutomaticTimeSeries(generations=5,
population_size=20,
progress_indicator_id=progress_id)
>>> progress_status_monitor = PipelineProgressStatusMonitor(connection_context=dataframe.ConnectionContext(url, port, user, pwd), automatic_obj=auto_ts)
>>> progress_status_monitor.start()
>>> auto_ts.fit(data=df_train, key=’date’, endog=’pollution’, exog=[‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’, ‘holiday’],
categorical_variable=[‘wnd_dir’], use_explain=True, background_size=500)
Figure 3. Pipeline Progress Status
Next, we use the ‘get_best_pipeline()’ function to obtain the optimal pipeline, which results in an ARIMA model.
>>> auto_ts.get_best_pipeline()
>>> ‘{“ARIMA”:{“args”:{“SEASONALITY_CRITERION”:0.6,”SEASONAL_PERIOD”:-1,”INFORMATION_CRITERION”:2,”KPSS_SIGNIFICANCE_LEVEL”:0.01,”D”:-1,”MAX_SEASONAL_D”:1,”MAX_D”:4,”CH_SIGNIFICANCE_LEVEL”:0.025,”SEASONAL_D”:-1},”inputs”:{“data”:”ROWDATA”}}}’
Afterward, we employ this best pipeline for our predictions. To activate the explainability within the ‘predict()’ function, all you have to do is set the ‘show_explainer’ parameter to True. This allows us to acquire local explainability for all the exogenous variables.
>>> explain_res = auto_ts.predict(data=df_test.deselect(‘pollution’), key=’date’, show_explainer=True)
>>> print(explain_res.head(2).collect())
The first two rows of prediction results ‘explain_res’ are displayed in the Figure 4. From the ‘REASON_CODE’ column in the HANA DataFrame ‘explain_res’, we can tell the contributions of exogenous variables to the predictions. For instance, in the first row where the score is 8.29, the top three contributors are ‘press’, ‘dew’, and ‘holiday’. The specific contribution values can be found in the field of key ‘val’ while the corresponding percentages are in the field of key ‘pct’ .
Figure 4. The first 2 lines of HANA DataFrame explain_res
6. Summary
In this article, we explored the concept of machine learning explainability in the context of AutoML, particularly within the SAP HANA Predictive Analysis Library (PAL). We began by delving into the multifaceted functioning of AutoML within PAL which supporting multi-objective optimization for Classification, Regression, and Time Series. In our journey to demystify the intricate processes involved, we investigated the different methodologies used for model explainability. Notably, these ranged from the Kernel SHAP to global surrogate models like Random Decision Trees (RDT), as well as a unique extended version of Kernel SHAP tailored for time series data. Further illustrating the practicalities of this technology, we provided a use case that leverages AutoML with a publicly accessible multivariate time series dataset.
Other Useful Links:
Install the Python Machine Learning client from the pypi public repository: hana-ml
We also provide a R API for SAP HANA PAL called hana.ml.r, please refer to more information on the documentation.
For other blog posts on hana-ml:
A Multivariate Time Series Modeling and Forecasting Guide with Python Machine Learning Client for SAP HANAOutlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANAOutlier Detection by Clustering using Python Machine Learning Client for SAP HANAAnomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for SAP HANAOutlier Detection with One-class Classification using Python Machine Learning Client for SAP HANALearning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client for SAP HANAPython Machine Learning Client for SAP HANAImport multiple excel files into a single SAP HANA tableCOPD study, explanation and interpretability with Python machine learning client for SAP HANAModel Storage with Python Machine Learning Client for SAP HANAIdentification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA
1. IntroductionIn this blog post, we will continue our journey of the Machine learning explainability with SAP HANA Predictive Analysis Library (PAL) and delve into the realm of Automated Machine Learning (AutoML) following other two explorations of ML Explainability in classification and regression, and time series analysis. We hope you can uncover how HANA PAL AutoML simplifies and democratizes the process of building predictive models, while maintaining the high standards of explainability and ethical AI that SAP is renowned for.In our previous articles, we’ve established the foundation of ML Explainability, showcasing how HANA PAL integrates this crucial feature into its algorithms, ensuring that the models we create are not only accurate but also understandable and trustworthy. By the end of this exploration, you will gain:A comprehensive understanding of what AutoML is and its significance in the field of MLInsight into AutoML of HANA PAL, particularly in the context of ML ExplainabilityPractical example of AutoML with the Python Machine Learning Client for SAP HANA (hana-ml) 2. AutoML ExplainabilityAutoML stands for Automated Machine Learning, a technology that revolutionizes the development process of ML models by automating tasks such as data preprocessing, feature engineering, model selection, and hyperparameter tuning. A promising area of research is exploring the application of AutoML to Large Language Models (LLMs), aiming to enhance their training, optimization, and inference processes. Traditional workflows, from raw data to deployable ML models, require extensive expertise and time from data scientists. AutoML elevates efficiency and productivity, enabling anyone to build and deploy machine learning models.As we have discussed in our previous blogs regarding classification, regression, and time series, the model explainability is also an important and a challenging aspect within AutoML. Lack of explainability can undermine trust in model decisions. Providing explainability is beneficial for enhancing model transparency and trust, promoting fairness, safety, and compliance.A significant challenge that AutoML faces is that the models it generates are often considered black boxes, especially after a series of processing steps that complicate the connection between the original raw data and the resulting model. Such complex models might perform exceptionally well, but understanding how they make decisions can be difficult. Another challenge is that AutoML heavily relies on the quality of the input data, which may perpetuate or amplify biases present in the training data. These issues can lead to regulatory compliance and ethical concerns.There is a variety of AutoML platforms on the market, each with differences in usability, scalability, integration capabilities, and cost factors. Some platforms focus on optimizing deep neural networks, but not all provide explainability features. The implementation of explainability in these products varies, reflecting the ongoing challenge of making AutoML both powerful and transparent. 3. PAL AutoMLBefore we delve into the explainability of AutoML, we first describe how AutoML works within PAL in HANA Cloud. PAL which has a rich set of ML algorithms, integrated with some built-in operators, to provide a powerful AutoML to support multi-objective optimization for tabular data for Classification, Regression, and Time Series problems.The fundamental elements in our AutoML framework are operators, which include several categories such as preprocessors (e.g., imputer), transformers (e.g., Principal Component Analysis), and estimators (specific algorithms for different types of problems, such as tree ensembles, Naive Bayes, and ARIMA). These operators can be composed to form a pipeline, which serves as the smallest unit of optimization. Pipeline is a composite estimator using a chain of transformers and estimators. Each operator contains a group of hyperparameters. For each hyperparameter, either a fixed value or a search space of values is considered during hyperparameter tuning.In the search for the optimal pipeline with the best composite of operators with hyperparameters, we utilize Genetic Programming (GP) and Multi-Objective Non-Dominated Sorting Genetic Algorithm II (NSGA-II) to reduce complexity and optimize the search space according to specific problems. The search space is determined by the Config Dict (short for configuration dictionary), for which we provide a default or light version, but users can also flexibly customize their own Config dict. Furthermore, PAL AutoML also provides mechanisms such as early-stop iterative training, success halving, and connection constraints to enhance the training speed and model performance. 4. AutoML ExplainabilityPAL’s focus on explainability in AutoML is geared towards post-modeling analysis of the factors influencing the effects, specifically local explainability. A SQL procedure named PIPELINE_EXPLAIN offers different methods for model explainability on the best pipeline or any pipeline. Additionally, within Python API, we provide visualization of the AutoML training process, enabling better monitoring of the iterative process and a clear view of specifics.When designing an explainer for AutoML context, one of the constraints is that it must not be specific to any particular model, as it needs to accommodate a variety of pipeline scenarios, including multiple time series models and combination of processing operators. Additionally, it is preferable that our method does not require re-training the model or access to training data, and can be applied to already trained architectures, potentially in a production environment. This restricts our approach to be model-agnostic and only relies on post-hoc access to a model’s predictions, such as perturbations of its input.If you wish to enable interpretability for PAL Auto ML, please set ‘use_explain=True’ during the training process. As model explanation requires a significant amount of computation, we provide this parameter to control whether to initiate this process, with the default set as False.In addition, there are several other important parameters available for providing model explanation.explain_method : specifies which explanation method will be used.background_size : the row size of the background data.background_sampling_seed : the seed for the random number generator in background sampling. 4.1 Classification and RegressionFor Classification and Regression, we employ Kernel SHAP and global surrogate models using Random Decision Trees (RDT) for explainability. These methods provide insights into the contribution of each feature to the model’s output, helping to understand which features are most influential in the decision-making process. Kernel SHAP, for instance, is a powerful technique that estimates the impact of each feature on the prediction by computing the expected change in the model’s output when the feature is hidden from the model. This approach allows for a detailed understanding of the model’s behavior and can be particularly useful in identifying potential biases or anomalies in the model’s predictions.Global surrogate models like RDT, on the other hand, offer a more holistic view of the model’s decision-making process. They approximate the complex relationships within the model by constructing a simpler, interpretable model, such as a decision tree, that mimics the behavior of the original model. This surrogate model can then be analyzed to understand the decision paths and the importance of various features in the predictions. 4.2 Time SeriesTime series data presents unique challenges for explainable AI models, as forecasts depend on both immediate data points and the sequences of previous data. Hence, PAL has introduced a new time series explanation method that extends Kernel SHAP, maintaining key properties of Shapley values. This method involves drawing binary vectors for simplified features, computing weights, obtaining predictions, and regressing shifted predictions to estimate Shapley values. Additionally, a sampling strategy focuses on discriminative features to approximate Shapley values efficiently. For more detailed information on TS explain and its implementation, you can refer to the blog post Demystifying Pipeline Explanation for Time Series. 5. Use CaseIn this section, we will show an example to demonstrate how to get explainability in AutoML with a timeseries example. Note that the code provided is purely for illustrative purposes and is not intended for production use.The dataset is the Beijing PM2.5 data from the UCI Machine Learning Repository. It comprises hourly recordings of PM2.5 levels (airborne particles with aerodynamic diameters less than 2.5 μm) collected by the US Embassy in Beijing between January 1, 2010, and December 31, 2014. Additionally, meteorological data from Beijing Capital International Airport is included. The objective is to predict PM2.5 concentrations using various input features.This dataset contains 43,824 rows and 11 columns. During preprocessing, the year, month, day, and hour columns were merged into a single ‘date’ column, and rows with missing values were addressed. The restructured dataset included the following 9 columns.date: Timestamp of the recordpollution: PM2.5 concentration (ug/m^3)dew: Dew Pointtemp: Temperaturepress: Pressure (hPa)wnd_dir: Combined wind directionwnd_spd: Cumulated wind speed (m/s)snow: Cumulated hours of snowrain: Cumulated hours of rainFor demonstration purposes, we’ve simplified the process by selecting the first 1,000 instances. Out of these, 990 instances were assigned to the training set, leaving the remaining 10 for the testing set. Recognizing that factors such as holidays can impact air quality levels due to variations in vehicle travel and industrial production, we enriched the original dataset by adding a new column named ‘holiday’. This column specifically identifies the Chinese national holidays.>>> print(df_train.head(5).collect())Figure 1. The first 5 lines of HANA DataFrame df_train>>> print(df_test.head(3).collect())Figure 2. The first 3 lines of HANA DataFrame df_testBelow we show how to run invoke the class AutomaticTimeSeries to get the optimal time series model. During the training process of fit() function, please remember to set ‘use_explain=True’. Currently, AutomaticTimeSeries only supports Kernel SHAP, so there is no need to specifically set ‘explain_method’.Moreover, hana-ml provides a visualization tool, PipelineProgressStatusMonitor, to display the computation progress. The Python code for this is shown below, and a display of the AutomaticTimeSeries progress is illustrated in Figure 3, highlighting the Pipeline Progress Status. >>> from hana_ml.algorithms.pal.auto_ml import AutomaticTimeSeries
>>> from hana_ml.visualizers.automl_progress import PipelineProgressStatusMonitor
>>> import uuid
>>> progress_id = “automl_ts_{}”.format(uuid.uuid1())
>>> auto_ts = AutomaticTimeSeries(generations=5,
population_size=20,
progress_indicator_id=progress_id)
>>> progress_status_monitor = PipelineProgressStatusMonitor(connection_context=dataframe.ConnectionContext(url, port, user, pwd), automatic_obj=auto_ts)
>>> progress_status_monitor.start()
>>> auto_ts.fit(data=df_train, key=’date’, endog=’pollution’, exog=[‘dew’, ‘temp’, ‘press’, ‘wnd_dir’, ‘wnd_spd’, ‘snow’, ‘rain’, ‘holiday’],
categorical_variable=[‘wnd_dir’], use_explain=True, background_size=500) Figure 3. Pipeline Progress StatusNext, we use the ‘get_best_pipeline()’ function to obtain the optimal pipeline, which results in an ARIMA model.>>> auto_ts.get_best_pipeline()>>> ‘{“ARIMA”:{“args”:{“SEASONALITY_CRITERION”:0.6,”SEASONAL_PERIOD”:-1,”INFORMATION_CRITERION”:2,”KPSS_SIGNIFICANCE_LEVEL”:0.01,”D”:-1,”MAX_SEASONAL_D”:1,”MAX_D”:4,”CH_SIGNIFICANCE_LEVEL”:0.025,”SEASONAL_D”:-1},”inputs”:{“data”:”ROWDATA”}}}’Afterward, we employ this best pipeline for our predictions. To activate the explainability within the ‘predict()’ function, all you have to do is set the ‘show_explainer’ parameter to True. This allows us to acquire local explainability for all the exogenous variables. >>> explain_res = auto_ts.predict(data=df_test.deselect(‘pollution’), key=’date’, show_explainer=True)
>>> print(explain_res.head(2).collect()) The first two rows of prediction results ‘explain_res’ are displayed in the Figure 4. From the ‘REASON_CODE’ column in the HANA DataFrame ‘explain_res’, we can tell the contributions of exogenous variables to the predictions. For instance, in the first row where the score is 8.29, the top three contributors are ‘press’, ‘dew’, and ‘holiday’. The specific contribution values can be found in the field of key ‘val’ while the corresponding percentages are in the field of key ‘pct’ .Figure 4. The first 2 lines of HANA DataFrame explain_res 6. SummaryIn this article, we explored the concept of machine learning explainability in the context of AutoML, particularly within the SAP HANA Predictive Analysis Library (PAL). We began by delving into the multifaceted functioning of AutoML within PAL which supporting multi-objective optimization for Classification, Regression, and Time Series. In our journey to demystify the intricate processes involved, we investigated the different methodologies used for model explainability. Notably, these ranged from the Kernel SHAP to global surrogate models like Random Decision Trees (RDT), as well as a unique extended version of Kernel SHAP tailored for time series data. Further illustrating the practicalities of this technology, we provided a use case that leverages AutoML with a publicly accessible multivariate time series dataset.Other Useful Links:Install the Python Machine Learning client from the pypi public repository: hana-ml We also provide a R API for SAP HANA PAL called hana.ml.r, please refer to more information on the documentation.For other blog posts on hana-ml:A Multivariate Time Series Modeling and Forecasting Guide with Python Machine Learning Client for SAP HANAOutlier Detection using Statistical Tests in Python Machine Learning Client for SAP HANAOutlier Detection by Clustering using Python Machine Learning Client for SAP HANAAnomaly Detection in Time-Series using Seasonal Decomposition in Python Machine Learning Client for SAP HANAOutlier Detection with One-class Classification using Python Machine Learning Client for SAP HANALearning from Labeled Anomalies for Efficient Anomaly Detection using Python Machine Learning Client for SAP HANAPython Machine Learning Client for SAP HANAImport multiple excel files into a single SAP HANA tableCOPD study, explanation and interpretability with Python machine learning client for SAP HANAModel Storage with Python Machine Learning Client for SAP HANAIdentification of Seasonality in Time Series with Python Machine Learning Client for SAP HANA Read More Technology Blogs by SAP articles
#SAP
#SAPTechnologyblog