In this blog post, you will learn:

how to detect regression outliers in hana-mlhow to use regression outlier detection in AutoML of PAL

To make it easy to read and show the results, we call the hana-ml and PAL procedures in Jupyter Notebook. For calling the procedures and plotting the results, we need some functions. We put the functions in the Appendix.

Introduction

In regression, outliers can have many causes, such as measurement errors, experimental anomalies and natural variation. In regression analysis, outliers can have a huge impact on the regression model and parameters, and affect the prediction results.

Regression outlier detection is model dependent. In the regression outlier detection procedure, two commonly used regression models are included. They are the linear model based on MLR and the tree model based on HGBT. In AutoML, all the current regressors support outlier detection. They are EXP_Regressor, GEO_Regressor, GLM_Regressor, HGBT_Regressor, LOG_Regressor, MLP_Regressor, MLP_M_TASK_Regressor, MLR_Regressor, POL_Regressor, RDT_Regressor, DT_Regressor, and SVM_Regressor. For more information about the regressors, see Pipeline Operator | SAP Help Portal.

PAL Procedure of Outlier Detection for Regression

Introduction

In the hana-ml function hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression, we include two commonly used regression models, the linear model based on Multiple Linear Regression (MLR) and the tree model based on Hybrid Gradient Boosting Tree (HGBT). The outlier detection procedure is divided into two steps. In step 1, we get the residual from the original data and the selected model. In step 2, we detect the outliers from the residual. Specifically, we calculate the outlier score of each data point from residual and compare the score with the threshold in step 2. For the linear model, the outlier score is the deleted studentized residual. For the tree model, the outlier score is the z-score of the residual.

In the output of this procedure, we have a result table and a statistic table. In the result table, the residual, outlier score and outlier label are included. In the statistic table, some information of the outlier detection method is included.

Test Cases

In order to demonstrate the characteristics of the regression models and better display the plots, we choose 2 dimensional data in test cases. To call the PAL procedure with python, we need to import some python packages.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from hana_ml.dataframe import create_dataframe_from_pandas
from hana_ml.algorithms.pal.preprocessing import OutlierDetectionRegression

case 1: linear model

data

The data is about the average weekly expenditure on tobacco (x, in British pounds) and the average weekly expenditure on alcohol (y, in British pounds) for households in n = 11 different regions in the United Kingdom. The data is from the British government regularly conducts surveys on household spending in 1981. You can find the data in https://online.stat.psu.edu/stat501/lesson/4. The data is as follows.

df = pd.read_csv(‘your path and file’, sep = “t”)
df

Let us add ID column and keep the feature and target for regression.

X = np.array(df[“Tobacco”])
y = np.array(df[“Alcohol”])
df_with_id = pd.DataFrame({“ID”:list(range(len(df))), “Tobacco”:X, “Alcohol”:y})
plt.scatter(X, y)
plt.grid()

outlier detection

We call the hana-ml function hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression to detect outliers. We use linear model to do outlier detection in this case. The python code and results are as follows.

df_hana = create_dataframe_from_pandas(connection_context=cc, pandas_df=df_with_id, table_name=’DATA’, force=True, replace=True)
tsreg = OutlierDetectionRegression(regression_model=’linear’)
res = tsreg.fit_predict(data=df_hana, key=’ID’, label=’Alcohol’)
res.collect()

stats = tsreg.get_model_metrics()
stats.collect()

outlier_result_plot(res.collect(), X)

outlier_plot(res.collect(), X)

From the above results, we can see that the outlier is detected correctly by linear model.

case 2: tree model

data

The data is about the relationship between marks of students and their study time. You can find the data from https://www.kaggle.com/datasets/yasserh/student-marks-dataset. There is also a feature called number of courses in the dataset. However, to plot target and feature on 2-dimensional plane, we drop this feature in the regression outlier analysis. The data is as below.

file = ‘your path and file’
df = pd.read_csv(file)
df

We drop the column “number_courses” and plot the data.

y = np.array(df[“Marks”])
X = np.array(df[“time_study”])
num = len(y)
ID = np.arange(num)
df_with_id = pd.DataFrame({“ID”:ID, “time_study”: X, “Marks”: y})
plt.scatter(X, y)
plt.grid()

outlier detection with linear model

We call the hana-ml function hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression to detect outliers. Let us also use linear model to do outlier detection in this case. The python code and results are as follows.

stats = tsreg.get_model_metrics()
stats.collect()

outlier_result_plot(res.collect(), X)

We can see that there is no outlier detected by linear model. This result is reasonable from the data plot. However, we can find that the residuals are quite large, and do not obey Gaussian distribution. It looks that the residual and feature are approximate square relationships. It shows that linear model is not suitable for this dataset.

outlier detection with tree model

Now let us turn to the tree model.

tsreg = OutlierDetectionRegression(regression_model=’tree’)
res = tsreg.fit_predict(data=df_hana, key=’ID’, label=’Marks’)
res.collect()

stats = tsreg.get_model_metrics()
stats.collect()

outlier_result_plot(res.collect(), X)

There is also no outlier detected by tree model and it is reasonable. The residuals are much smaller than linear model, and they are more consistent with Gaussian distribution than linear model. This dataset is suitable for analysis using the tree model.

data with an outlier added

Now we put an outlier into the dataset.

y = np.array(df[“Marks”])
X = np.array(df[“time_study”])
X = np.append(X, 6)
y = np.append(y, 48)
num = len(y)
ID = np.arange(num)
df_with_id = pd.DataFrame({“ID”:ID, “time_study”: X, “Marks”: y})
plt.scatter(X, y)
plt.grid()

outlier detection with tree model

Let us use tree model to detect outliers in this case.

df_hana = create_dataframe_from_pandas(connection_context=cc, pandas_df=df_with_id, table_name=’DATA’, force=True, replace=True)
tsreg = OutlierDetectionRegression(regression_model=’tree’)
res = tsreg.fit_predict(data=df_hana, key=’ID’, label=’Marks’)
res.collect()

stats = tsreg.get_model_metrics()
stats.collect()

outlier_result_plot(res.collect(), X)

outlier_plot(res.collect(), X)

From the above results, we can see that the outlier is detected correctly by the tree model.

For the datasets which are suitable for linear regression, we can use the linear model to detect outliers. However, if the dataset does not obey linear regression, we need to use the tree model, or make some transformations to the data to make it obey linear regression, and than use the linear model.

Regression Outlier Detection in AutoML

Introduction

We also embed the regression outlier detection in AutoML. It is very easy to use regression outlier detection in AutoML. When PIPELINE_TYPE is set to regressor in AutoML, we just need to set the parameter REGRESSION_OUTLIER to 1, and call the the PAL procedure _SYS_AFL.PAL_AUTOML_FIT. In the following part of this section, I will introduce how regression outlier detection works in AutoML.

In AutoML, we will get best pipeline from cross validation result. When PIPELINE_TYPE is set to regressor, we can perform regression outlier tuning on the best pipeline. The process of how regression outlier detection works is shown as below.

The newly added regression outlier tuning process is in the red box. In the regression outlier tuning process, we will start with the best pipeline (referred to as BP) generated in the previous step. From BP, we will create a set of pipelines, denoted as Pset = {Pi | i = 1, 2, …, N}. The only difference between the pipelines in Pset and BP is the regressor’s parameters. Each pipeline in Pset will apply regression outlier detection and dropping before performing the regression. The specific method for outlier detection will depend on the type of regressor used.

For the generation of Pset, let us take an example. Suppose we set the parameter REGRESSION_OUTLIER_PIPELINE_NUM to 4 and REGRESSION_OUTLIER_THRESHOLDS to [3,4,5](default value), and BP is as follows:

{“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:9,”MIN_CHILD_HESSIAN”:6.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}

Pset may consist of the following four pipelines.

{“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:9,”MIN_CHILD_HESSIAN”:6.0,”REGRESSION_OUTLIER”:1,”REGRESSION_OUTLIER_THRESHOLD”:3.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}{“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:9,”MIN_CHILD_HESSIAN”:6.0,”REGRESSION_OUTLIER”:1,”REGRESSION_OUTLIER_THRESHOLD”:4.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}{“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:9,”MIN_CHILD_HESSIAN”:6.0,”REGRESSION_OUTLIER”:1,”REGRESSION_OUTLIER_THRESHOLD”:5.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}{“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:11,”MIN_CHILD_HESSIAN”:4.0,”REGRESSION_OUTLIER”:1,”REGRESSION_OUTLIER_THRESHOLD”:4.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}

The first three pipelines only add the parameters REGRESSION_OUTLIER and REGRESSION_OUTLIER_THRESHOLD to the BP, and these pipelines will always be generated. In the last pipeline of Pset, in addition to adding the two parameters REGRESSION_OUTLIER and REGRESSION_OUTLIER_THRESHOLD, other parameters of the HGBT regressor have also been modified. These additional parameters are adjusted randomly. In this example, we will use the HGBT method for outlier detection because the regressor being used is HGBT.

The cross validation on pipelines with outlier detection is shown as below.

After cross validation (CV) splitting, we will detect and drop outliers in the training data, and than do the training and test procedure, and finally calculate the cross validation scorings. To handle the outliers in the CV training data, we just drop the outlier points. Outliers can also be in the CV test data. Considering that outliers may significantly impact certain metrics, we have adjusted the cross-validation score measures EVAR, MAE, MSE, RMSE, R2, MAPE, and WMAPE in this step to mitigate the influence of outliers. The corresponding adjusted metrics are termed EVAROUTLIER, MAEOUTLIER, MSEOUTLIER, RMSEOUTLIER, R2OUTLIER, MAPEOUTLIER, and WMAPEOUTLIER, respectively. The adjusted scores are also calculate for BP in the regression outlier tuning result. Let us take MAE as an example. We will drop some smallest and largest absolute residuals. The procedure is as follows.

The bar is the sorted absolute test residuals in CV. In the second bar, we drop the grey part and keep the blue part. In more details, we drop M test data points with the largest absolute residuals and M test data points with the smallest absolute residuals, where M is no more than 5% of the test data. Assuming the residuals are x = {xi|i = 1,2,…,N}, we calculate {si|i = 1,2,…,N}, where si = 0.67449×|xi|/median(|x|). This is the MAD score with zero median. We drop the points with si > 5 and an equivalent number of points with the smallest si. After dropping these points, we calculate the adjusted measure MAEDROP, and MAEOUTLIER is defined as 0.5×(MAE + MAEDROP). Other measures are adjusted similarly.

For the comparison in the regression outlier tuning process, if a pipeline with regression outlier detection outperforms BP in all the measures specified by SCORINGS, and is equal to or better than the other pipelines, it will be considered the best pipeline for this step. If no pipeline in Pset outperforms BP in all the measures according to SCORINGS, BP will remain the best pipeline for this step. The remaining pipelines are then sorted using a multi-objective sorting method.

In the Pipeline Table, the output pipelines from the previous step remain and are controlled by the parameter ELITE_NUMBER. The output pipelines from the regression outlier tuning process are also included in the Pipeline Table and are controlled by the parameter REGRESSION_OUTLIER_TUNE_ELITE_NUMBER. If the best pipeline from this step includes outlier detection, it is placed at the top of the Pipeline Table. Otherwise, the best pipeline from the previous step is placed at the top. Additionally, the IDs of the pipelines from this step are displayed in the Statistic Table.

Test Cases

In this part, I will introduce two test cases. In one case, the final best pipeline is without outlier detection. In the other case, the final best pipeline is with outlier detection. In the test cases, I will use the dataset Concrete Compressive Strength[1]. It is about the relationship between the concrete compressive strength and some variables, such as water and age. You can find it in https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength.

Case 1: The final best pipeline is without outlier detection.

data

The data is from https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength. I also add an ID column in the dataset.

file = ‘your path and file name’
df_with_id = pd.read_csv(file)
df_with_id

The target column is “Concrete compressive strength”.

To do the AutoML fitting and prediction, we will split the data into training part and test part with the hana-ml function hana_ml.algorithms.pal.partition.train_test_val_split. The training part is for AutoML fitting and producing the model. The test part is for testing the prediction from the model. We also select the target column and produce the target values from the test part.

df_train, df_test = train_test_split_fun(df_with_id, test_ratio = 0.2)
target = “Concrete compressive strength”
df_test_X = df_test.copy()
df_test_X.drop(columns=target, inplace=True)
y_test = produce_y_test(df_test, key = “ID”, label = target)

AutoML Fitting

In this part, we will fit the training data with AutoML and get a model.

fit_parameters = {
“REGRESSION_OUTLIER”: 1,
“THREAD_RATIO”: 1.0,
“PIPELINE_TYPE”:’regressor’,
“DEPENDENT_VARIABLE”:target,
“RANDOM_SEED”: 123,
“GENERATIONS”: 1000,
“POPULATION_SIZE”: 20,
“CONFIG_DICT”:json.dumps(config_dict),
“ELITE_NUMBER”: 5,
“REGRESSION_OUTLIER_PIPELINE_NUM”: 4,
“REGRESSION_OUTLIER_TUNE_ELITE_NUMBER”: 5,
}
pipeline_table_pd, model_table_pd, stats_table_pd = AutoML_fit(df_train, fit_parameters, cc)

stats_table_pd

From the statistic table, we can see that the pipeline IDs of regression outlier tuning are 5, 6, 7, 8, 9. This shows that the best pipeline is without outlier detection. Let us print the pipeline table.

print_pipeline_table(pipeline_table_pd)

The pipelines with ID = 0,1,2,3,4 are from the original process. In other words, if we take parameter REGRESSION_OUTLIER to 0 (default value), we will get and only get the above pipelines with ID = 0,1,2,3,4. For pipelines with ID = 5,6,7,8,9, they are from the outlier tuning process. For pipeline with ID = 5, it is the same as pipeline with ID = 0. Because it is the old best pipeline, we take it to compare with pipelines with outlier detection. Therefore, there are “EVAROUTLIER” and “MAEOUTLIER” in scores of ID = 5,6,7,8,9. These two measures are adjusted measures from EVAR and MAE, and they are the scores for outlier tuning process. Because no pipeline with outlier detection outperforms the old best pipeline (ID = 5) in both “EVAROUTLIER” and “MAEOUTLIER”, the final best pipeline is still the old best pipeline. Therefore, we put the output pipelines from original process in front of outlier tuning process. Although the pipelines with ID = 0 and ID = 5 are the same, the scores may be different, because their cross validation training and test data may be different. The pipelines in outlier tuning process share the same cross validation training and test data.

prediction from AutoML model

Let us see the prediction errors from the model of best pipeline.

test_parameters = {

mae = 3.131938162929128, evar = 0.9244784555351252

prediction from pipeline with outlier detection

Let us see the prediction from pipeline with ID = 6, which is the best pipeline with outlier detection. We will get the model by fitting the pipeline and then predict with the model.

pipe_parameters = {
“DEPENDENT_VARIABLE”:target,
“PIPELINE”:json.dumps(pipe),
}
model_table_pipe, stats_table_pipe = Pipeline_fit(df_train, pipe_parameters, cc)
test_result_table_pipe, test_stats_table_pipe = AutoML_predict(df_test_X, test_parameters, model_table_pipe, cc)
pred_pipe = produce_predict(test_result_table_pipe, key = “ID”)
mae_pipe = mae_fun(y_test, pred_pipe)
ever_pipe = evar_fun(y_test, pred_pipe)
print(“mae_pipe = {}, ever_pipe = {}”.format(mae_pipe, ever_pipe))mae_pipe = 3.0678810742472957, ever_pipe = 0.9212280312761948

We can see that the test errors are not the same with AutoML model. This is because some outliers are detected by the pipeline with outlier detection. The results of pipeline with outlier detection are comparable with AutoML model. Both MAE and EVER are a little smaller than AutoML model’s results. For MAE, smaller is better, while for EVAR the opposite is true.

Case 2: The final best pipeline is with outlier detection.

data

In this test case, we include some outliers into the training data df_train of test case1.

y = np.array(df_with_id[“Concrete compressive strength”])
y_std = np.std(y)
np.random.seed(234)
outlier_idxes = np.random.randint(0, len(df_train), size=8)
outlier_amplitude = 3
for idx in outlier_idxes:
df_train[“Concrete compressive strength”][idx] = df_train[“Concrete compressive strength”][idx] + (-1)**idx * outlier_amplitude * y_std
df_train

AutoML Fitting

In this part, we will fit the new training data with AutoML and get a model.

From the statistic table, we can see that the pipeline IDs of regression outlier tuning are 0,1,2,3,4. This shows that the best pipeline is with outlier detection. Let us print the pipeline table.

print_pipeline_table(pipeline_table_pd)

Contrary to test case 1, the pipelines with ID = 0,1,2,3,4 are from the outlier tuning process. The pipeline with ID = 0 is with outlier detection and the pipeline with ID = 2 is the old best pipeline. Pipeline with ID = 0 outperforms pipeline with ID = 1 in both “EVAROUTLIER” and “MAEOUTLIER”, and equal or better than other pipelines in outlier tuning process.

prediction from AutoML model

Let us see the prediction errors from the model of best pipeline.

test_parameters = {

prediction from the old best pipeline

Let us see the prediction from pipeline with ID = 5, which is the best pipeline of the original process. We will get the model by fitting the pipeline and then predict with the model.

pipe = {“HGBT_Regressor”:{“args”:{“SEED”:2024,”ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:8,”MIN_CHILD_HESSIAN”:4.0},”inputs”:{“data”:{“SCALE”:{“args”:{“SCALING_METHOD”:2},”inputs”:{“data”:”ROWDATA”}}}}}}
pipe_parameters = {
“DEPENDENT_VARIABLE”:target,
“PIPELINE”:json.dumps(pipe),
}
model_table_pipe, stats_table_pipe = Pipeline_fit(df_train, pipe_parameters, cc)
test_result_table_pipe, test_stats_table_pipe = AutoML_predict(df_test_X, test_parameters, model_table_pipe, cc)
pred_pipe = produce_predict(test_result_table_pipe, key = “ID”)
mae_pipe = mae_fun(y_test, pred_pipe)
ever_pipe = evar_fun(y_test, pred_pipe)
print(“mae_pipe = {}, ever_pipe = {}”.format(mae_pipe, ever_pipe))mae_pipe = 2.968404778787287, ever_pipe = 0.9250418298900762

We can see that mae_pipe is larger than mae and evar_pipe is smaller than ever in test case 2. Outlier detection and dropping improves the prediction in this case.

Conclusions

In this blog post, we explain what an outlier in regression is and provide a PAL procedure and its corresponding hana-ml function to detect outliers in regression data. The procedure is based on two models, the linear model based on MLR and tree model based on HGBT. We also embed regression outlier detection into AutoML, which can help to improve the regression prediction when there are some outliers in the AutoML input data. Hope you enjoy reading this blog!

The regression outlier detection is already included in hana-ml. The embedding of regression outlier detection in AutoML will also be included in hana-ml. If you want to learn more about this topic in SAP HANA Predictive Analysis Library (PAL) and hana-ml, please refer to the following links:

OutlierDetectionRegression — hana-ml 2.23.2401217 documentation

Outlier Detection for Regression | SAP Help Portal

AutoML | SAP Help Portal

Optimization | SAP Help Portal

[1]: “Concrete Compressive Strength” by I-Cheng Yeh is licensed under CC BY 4.0.

Appendix

SAP HANA Connection

import hana_ml
from hana_ml import dataframe
conn = dataframe.ConnectionContext(‘host’, ‘port’, ‘username’, ‘password’)

Functions of Plotting Regression Outlier Detection Results

def outlier_result_plot(dResults, X):
dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
raw_data = np.array(dResults[‘TARGET’])
residual = np.array(dResults[‘RESIDUAL’])
outlier_score = np.array(dResults[‘OUTLIER_SCORE’])
is_outlier = np.array(dResults[‘IS_OUTLIER’])
plt.figure(figsize = (24,4.5))
plt.subplot(1,4,1)
plt.scatter(X, raw_data)
plt.grid()
plt.xlabel(“X”)
plt.title(‘TARGET’)
plt.subplot(1,4,2)
plt.scatter(X, residual)
plt.grid()
plt.xlabel(“X”)
plt.title(‘RESIDUAL’)
plt.subplot(1,4,3)
plt.scatter(X, outlier_score)
plt.grid()
plt.xlabel(“X”)
plt.title(‘OUTLIER_SCORE’)
plt.subplot(1,4,4)
plt.scatter(X, is_outlier)
plt.grid()
plt.xlabel(“X”)
plt.title(‘IS_OUTLIER’)def outlier_plot(dResults, X):
dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
raw_data = np.array(dResults[‘TARGET’])
is_outlier = np.array(dResults[‘IS_OUTLIER’])
outlier_idx = np.array([],dtype = int)
for i in range(len(is_outlier)):
if is_outlier[i] == 1:
outlier_idx = np.append(outlier_idx,i)
plt.figure()
plt.scatter(X, raw_data)
plt.scatter(X[outlier_idx],raw_data[outlier_idx],color = ‘red’)
plt.grid()
plt.title(“regression data and outlier”)

Functions for Regression Outlier Detection in AutoML

Functions for Table

def createEmptyTable(table_name, proto, cc):
with cc.connection.cursor() as cur:
try:
joint = []
for key in proto:
joint.append(” “.join([‘”{:s}”‘.format(key), proto[key]]))
cur.execute(‘CREATE COLUMN TABLE %s (%s);’ %
(table_name, “,”.join(joint)))
except:
print(
f””CREATE TABLE {table_name}” was unsuccessful. Maybe the table has existed.”)

def dropTable(table_name, cc):
with cc.connection.cursor() as cur:
try:
cur.execute(f”DROP TABLE {table_name}”)
except:
print(f””DROP TABLE {table_name}” was unsuccessful. Maybe the table does not exist yet.”)

def createTableFromDataFrame(df, table_name, cc):
dropTable(table_name, cc)
dt_ml = dataframe.create_dataframe_from_pandas(cc, df, table_name=table_name)
return dt_ml

AutoML and Pipeline Functions

def AutoML_fit(df, parameters, cc,
data_table=’ZPD_PAL_DATA_TBL’,
parameter_table=’ZPD_PAL_PARAMETERS_TBL’,
pipeline_table=’ZPD_PAL_PIPELINE_TBL’,
model_table=’ZPD_PAL_MODEL_TBL’,
stats_table=’ZPD_PAL_STATS_TBL’):

# Input table
createTableFromDataFrame(df, data_table, cc)

# Pipeline table
dropTable(pipeline_table, cc)
createEmptyTable(pipeline_table, {
“ID”: “INTEGER”,”PIPELINE”:”NCLOB”,”SCORES”:”NCLOB”}, cc)

# Model table
dropTable(model_table, cc)
createEmptyTable(model_table, {
“ROW_INDEX”: “INTEGER”,”MODEL_CONTENT”:”NVARCHAR(5000)”}, cc)

# Stats table
dropTable(stats_table, cc)
createEmptyTable(stats_table, {
“STAT_NAME”: “NVARCHAR(5000)”, “STAT_VALUE”: “NVARCHAR(5000)”}, cc)

# Parameter table
dropTable(parameter_table, cc)
createEmptyTable(parameter_table, {“PARAM_NAME”: “nvarchar(256)”, “INT_VALUE”: “integer”,
“DOUBLE_VALUE”: “double”, “STRING_VALUE”: “nvarchar(5000)”}, cc)

if parameters:
with cc.connection.cursor() as cur:
for parName, parValue in parameters.items():

if isinstance(parValue, str):
parValue = f”‘{parValue}'”
parametersSQL = f”{parValue if isinstance(parValue,int) else ‘NULL’}, {parValue if isinstance(parValue,float) else ‘NULL’}, { parValue if isinstance(parValue,str) else ‘NULL’}”
cur.execute(
f”INSERT INTO {parameter_table} VALUES (‘{parName}’, {parametersSQL});”)

else:
parametersSQL = f”{parValue if isinstance(parValue,int) else ‘NULL’}, {parValue if isinstance(parValue,float) else ‘NULL’}, { parValue if isinstance(parValue,str) else ‘NULL’}”
cur.execute(
f”INSERT INTO {parameter_table} VALUES (‘{parName}’, {parametersSQL});”)

else:
print(“No parameters given using default values.”)

sql_str = f”
do begin
lt_data = select * from {data_table};
lt_control = select * from {parameter_table};
CALL _SYS_AFL.PAL_AUTOML_FIT(:lt_data, :lt_control, lt_pipeline, lt_model, lt_stats);
INSERT INTO {pipeline_table} SELECT * FROM :lt_pipeline;
INSERT INTO {stats_table} SELECT * FROM :lt_stats;
INSERT INTO {model_table} SELECT * FROM :lt_model;
end;”

with cc.connection.cursor() as cur:
cur.execute(sql_str)

return cc.table(pipeline_table).collect(), cc.table(model_table).collect(), cc.table(stats_table).collect()def AutoML_predict(df, parameters, model_table_pd, cc,
data_table=’ZPD_PAL_DATA_TBL’,
parameter_table=’ZPD_PAL_PARAMETERS_TBL’,
result_table=’ZPD_PAL_RESULT_TBL’,
model_table=’ZPD_PAL_MODEL_TBL’,
stats_table=’ZPD_PAL_STATS_TBL’):

# Input table
createTableFromDataFrame(df, data_table, cc)

# Result table
dropTable(result_table, cc)
createEmptyTable(result_table, {
“ID”: “NVARCHAR(5000)”,”SCORES”:”NVARCHAR(5000)”}, cc)

# Stats table
dropTable(stats_table, cc)
createEmptyTable(stats_table, {
“STAT_NAME”: “NVARCHAR(5000)”, “STAT_VALUE”: “NVARCHAR(5000)”}, cc)

# Model table
createTableFromDataFrame(model_table_pd, model_table, cc)

if parameters:
with cc.connection.cursor() as cur:
for parName, parValue in parameters.items():

elif isinstance(parValue,list):
for x in parValue:
if isinstance(x, str):
x = f”‘{x}'”
parametersSQL = f”{x if isinstance(x,int) else ‘NULL’}, {x if isinstance(x,float) else ‘NULL’}, { x if isinstance(x,str) else ‘NULL’}”
cur.execute(
f”INSERT INTO {parameter_table} VALUES (‘{parName}’, {parametersSQL});”)
else:
parametersSQL = f”{parValue if isinstance(parValue,int) else ‘NULL’}, {parValue if isinstance(parValue,float) else ‘NULL’}, { parValue if isinstance(parValue,str) else ‘NULL’}”
cur.execute(
f”INSERT INTO {parameter_table} VALUES (‘{parName}’, {parametersSQL});”)

else:
print(“No parameters given using default values.”)

sql_str = f”
do begin
lt_data = select * from {data_table};
lt_param = select * from {parameter_table};
lt_model = select * from {model_table};
CALL _SYS_AFL.PAL_PIPELINE_PREDICT(:lt_data, :lt_model, :lt_param, lt_result, lt_stats);
INSERT INTO {result_table} SELECT * FROM :lt_result;
INSERT INTO {stats_table} SELECT * FROM :lt_stats;
end;”

with cc.connection.cursor() as cur:
cur.execute(sql_str)

return cc.table(result_table).collect(), cc.table(stats_table).collect()def Pipeline_fit(df, parameters, cc,
data_table=’ZPD_PAL_DATA_TBL’,
parameter_table=’ZPD_PAL_PARAMETERS_TBL’,
model_table=’ZPD_PAL_MODEL_TBL’,
stats_table=’ZPD_PAL_STATS_TBL’):

# Input table
createTableFromDataFrame(df, data_table, cc)

# Model table
dropTable(model_table, cc)
createEmptyTable(model_table, {
“ROW_INDEX”: “INTEGER”,”MODEL_CONTENT”:”NVARCHAR(5000)”}, cc)

# Stats table
dropTable(stats_table, cc)
createEmptyTable(stats_table, {
“STAT_NAME”: “NVARCHAR(5000)”, “STAT_VALUE”: “NVARCHAR(5000)”}, cc)

if parameters:
with cc.connection.cursor() as cur:
for parName, parValue in parameters.items():

else:
print(“No parameters given using default values.”)

sql_str = f”
do begin
lt_data = select * from {data_table};
lt_control = select * from {parameter_table};
CALL _SYS_AFL.PAL_PIPELINE_FIT(:lt_data, :lt_control, lt_model, lt_stats);
INSERT INTO {stats_table} SELECT * FROM :lt_stats;
INSERT INTO {model_table} SELECT * FROM :lt_model;
end;”

with cc.connection.cursor() as cur:
cur.execute(sql_str)

return cc.table(model_table).collect(), cc.table(stats_table).collect()

Function of Printing Pipeline Table

def print_pipeline_table(pipeline_table_pd):
num = len(pipeline_table_pd)
for i in range(num):
print(“ID = “+str(pipeline_table_pd[“ID”][i]))
print(“PIPELINE: “+pipeline_table_pd[“PIPELINE”][i])
print(“SCORES: “+pipeline_table_pd[“SCORES”][i])

Function of Splitting Training and Test Data

from hana_ml.algorithms.pal.partition import train_test_val_split
def train_test_split_fun(df, test_ratio = 0.2, random_seed = 123):
df_hana = create_dataframe_from_pandas(connection_context=cc, pandas_df=df, table_name=’DATA’, force=True, replace=True)
train_df, test_df, valid_df = train_test_val_split(data=df_hana, random_seed = random_seed, training_percentage = 1-test_ratio, testing_percentage = test_ratio, validation_percentage = 0.0)
df_train = train_df.collect()
df_test = test_df.collect()

return df_train, df_test

Functions of Getting Target in Test Data

def produce_y_test(df_test, key, label):
df_test.sort_values(by = key, inplace = True, ascending = True)
y_test = np.array(df_test[label])

return y_testdef produce_predict(res_pd0, key):
res_pd0[key] = res_pd0[key].astype(int)
res_pd0[‘SCORES’] = res_pd0[‘SCORES’].astype(float)
res_pd0.sort_values(by = key, inplace = True, ascending = True)
y_predict0 = np.array(res_pd0[‘SCORES’])

return y_predict0

Functions of Errors

def mae_fun(real_data, forecast_data):
error = real_data – forecast_data

return np.mean(np.abs(error))def evar_fun(real_data, forecast_data):
error = real_data – forecast_data

error_square = np.var(error)
real_square = np.var(real_data)

if real_square < 1.0e-16:
return 0.0

evar = 1.0 – error_square / real_square

return evar

In regression, an outlier is a data point which is different from the general behavior of remaining data points. There are multiple regression models, and the regression outlier detection also depends on the regression model. In Predictive Analysis Library (PAL) of SAP HANA, we have a procedure called _SYS_AFL.PAL_OUTLIER_DETECTION_FOR_REGRESSION to detect outliers in regression dataset. In hana-ml, we have the corresponding function hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression. You can find more details about the procedure in Outlier Detection for Regression | SAP Help Portal and OutlierDetectionRegression — hana-ml 2.23.2401217 documentation. We also embed regression outlier detection into AutoML of PAL. For more details about regression outlier detection in AutoML, you can find them in AutoML | SAP Help Portal and Optimization | SAP Help Portal. We will also embed regression outlier detection into AutomaticRegression of hana-ml soon.In this blog post, you will learn:how to detect regression outliers in hana-mlhow to use regression outlier detection in AutoML of PALTo make it easy to read and show the results, we call the hana-ml and PAL procedures in Jupyter Notebook. For calling the procedures and plotting the results, we need some functions. We put the functions in the Appendix.IntroductionIn regression, outliers can have many causes, such as measurement errors, experimental anomalies and natural variation. In regression analysis, outliers can have a huge impact on the regression model and parameters, and affect the prediction results.Regression outlier detection is model dependent. In the regression outlier detection procedure, two commonly used regression models are included. They are the linear model based on MLR and the tree model based on HGBT. In AutoML, all the current regressors support outlier detection. They are EXP_Regressor, GEO_Regressor, GLM_Regressor, HGBT_Regressor, LOG_Regressor, MLP_Regressor, MLP_M_TASK_Regressor, MLR_Regressor, POL_Regressor, RDT_Regressor, DT_Regressor, and SVM_Regressor. For more information about the regressors, see Pipeline Operator | SAP Help Portal.PAL Procedure of Outlier Detection for Regression IntroductionIn the hana-ml function hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression, we include two commonly used regression models, the linear model based on Multiple Linear Regression (MLR) and the tree model based on Hybrid Gradient Boosting Tree (HGBT). The outlier detection procedure is divided into two steps. In step 1, we get the residual from the original data and the selected model. In step 2, we detect the outliers from the residual. Specifically, we calculate the outlier score of each data point from residual and compare the score with the threshold in step 2. For the linear model, the outlier score is the deleted studentized residual. For the tree model, the outlier score is the z-score of the residual.In the output of this procedure, we have a result table and a statistic table. In the result table, the residual, outlier score and outlier label are included. In the statistic table, some information of the outlier detection method is included. Test CasesIn order to demonstrate the characteristics of the regression models and better display the plots, we choose 2 dimensional data in test cases. To call the PAL procedure with python, we need to import some python packages.       import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from hana_ml.dataframe import create_dataframe_from_pandas
from hana_ml.algorithms.pal.preprocessing import OutlierDetectionRegression       case 1: linear modeldataThe data is about the average weekly expenditure on tobacco (x, in British pounds) and the average weekly expenditure on alcohol (y, in British pounds) for households in n = 11 different regions in the United Kingdom. The data is from the British government regularly conducts surveys on household spending in 1981. You can find the data in https://online.stat.psu.edu/stat501/lesson/4. The data is as follows.       df = pd.read_csv(‘your path and file’, sep = “t”)
df       Let us add ID column and keep the feature and target for regression.       X = np.array(df[“Tobacco”])
y = np.array(df[“Alcohol”])
df_with_id = pd.DataFrame({“ID”:list(range(len(df))), “Tobacco”:X, “Alcohol”:y})
plt.scatter(X, y)
plt.grid()       outlier detectionWe call the hana-ml function hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression to detect outliers. We use linear model to do outlier detection in this case. The python code and results are as follows.       df_hana = create_dataframe_from_pandas(connection_context=cc, pandas_df=df_with_id, table_name=’DATA’, force=True, replace=True)
tsreg = OutlierDetectionRegression(regression_model=’linear’)
res = tsreg.fit_predict(data=df_hana, key=’ID’, label=’Alcohol’)
res.collect()              stats = tsreg.get_model_metrics()
stats.collect()             outlier_result_plot(res.collect(), X)              outlier_plot(res.collect(), X)         From the above results, we can see that the outlier is detected correctly by linear model.case 2: tree modeldataThe data is about the relationship between marks of students and their study time. You can find the data from https://www.kaggle.com/datasets/yasserh/student-marks-dataset. There is also a feature called number of courses in the dataset. However, to plot target and feature on 2-dimensional plane, we drop this feature in the regression outlier analysis. The data is as below.       file = ‘your path and file’
df = pd.read_csv(file)
df        We drop the column “number_courses” and plot the data.       y = np.array(df[“Marks”])
X = np.array(df[“time_study”])
num = len(y)
ID = np.arange(num)
df_with_id = pd.DataFrame({“ID”:ID, “time_study”: X, “Marks”: y})
plt.scatter(X, y)
plt.grid()       outlier detection with linear modelWe call the hana-ml function hana_ml.algorithms.pal.preprocessing.OutlierDetectionRegression to detect outliers. Let us also use linear model to do outlier detection in this case. The python code and results are as follows.      df_hana = create_dataframe_from_pandas(connection_context=cc, pandas_df=df_with_id, table_name=’DATA’, force=True, replace=True)
tsreg = OutlierDetectionRegression(regression_model=’linear’)
res = tsreg.fit_predict(data=df_hana, key=’ID’, label=’Marks’)
res.collect()              stats = tsreg.get_model_metrics()
stats.collect()              outlier_result_plot(res.collect(), X)         We can see that there is no outlier detected by linear model. This result is reasonable from the data plot. However, we can find that the residuals are quite large, and do not obey Gaussian distribution. It looks that the residual and feature are approximate square relationships. It shows that linear model is not suitable for this dataset.outlier detection with tree modelNow let us turn to the tree model.      tsreg = OutlierDetectionRegression(regression_model=’tree’)
res = tsreg.fit_predict(data=df_hana, key=’ID’, label=’Marks’)
res.collect()              stats = tsreg.get_model_metrics()
stats.collect()              outlier_result_plot(res.collect(), X)         There is also no outlier detected by tree model and it is reasonable. The residuals are much smaller than linear model, and they are more consistent with Gaussian distribution than linear model. This dataset is suitable for analysis using the tree model.data with an outlier addedNow we put an outlier into the dataset.       y = np.array(df[“Marks”])
X = np.array(df[“time_study”])
X = np.append(X, 6)
y = np.append(y, 48)
num = len(y)
ID = np.arange(num)
df_with_id = pd.DataFrame({“ID”:ID, “time_study”: X, “Marks”: y})
plt.scatter(X, y)
plt.grid()       outlier detection with tree modelLet us use tree model to detect outliers in this case.      df_hana = create_dataframe_from_pandas(connection_context=cc, pandas_df=df_with_id, table_name=’DATA’, force=True, replace=True)
tsreg = OutlierDetectionRegression(regression_model=’tree’)
res = tsreg.fit_predict(data=df_hana, key=’ID’, label=’Marks’)
res.collect()              stats = tsreg.get_model_metrics()
stats.collect()            outlier_result_plot(res.collect(), X)              outlier_plot(res.collect(), X)      From the above results, we can see that the outlier is detected correctly by the tree model.For the datasets which are suitable for linear regression, we can use the linear model to detect outliers. However, if the dataset does not obey linear regression, we need to use the tree model, or make some transformations to the data to make it obey linear regression, and than use the linear model.Regression Outlier Detection in AutoMLIntroductionWe also embed the regression outlier detection in AutoML. It is very easy to use regression outlier detection in AutoML. When PIPELINE_TYPE is set to regressor in AutoML, we just need to set the parameter REGRESSION_OUTLIER to 1, and call the the PAL procedure _SYS_AFL.PAL_AUTOML_FIT. In the following part of this section, I will introduce how regression outlier detection works in AutoML.In AutoML, we will get best pipeline from cross validation result. When PIPELINE_TYPE is set to regressor, we can perform regression outlier tuning on the best pipeline. The process of how regression outlier detection works is shown as below. The newly added regression outlier tuning process is in the red box. In the regression outlier tuning process, we will start with the best pipeline (referred to as BP) generated in the previous step. From BP, we will create a set of pipelines, denoted as Pset = {Pi | i = 1, 2, …, N}. The only difference between the pipelines in Pset and BP is the regressor’s parameters. Each pipeline in Pset will apply regression outlier detection and dropping before performing the regression. The specific method for outlier detection will depend on the type of regressor used.For the generation of Pset, let us take an example. Suppose we set the parameter REGRESSION_OUTLIER_PIPELINE_NUM to 4 and REGRESSION_OUTLIER_THRESHOLDS to [3,4,5](default value), and BP is as follows:       {“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:9,”MIN_CHILD_HESSIAN”:6.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}       Pset may consist of the following four pipelines.       {“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:9,”MIN_CHILD_HESSIAN”:6.0,”REGRESSION_OUTLIER”:1,”REGRESSION_OUTLIER_THRESHOLD”:3.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}{“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:9,”MIN_CHILD_HESSIAN”:6.0,”REGRESSION_OUTLIER”:1,”REGRESSION_OUTLIER_THRESHOLD”:4.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}{“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:9,”MIN_CHILD_HESSIAN”:6.0,”REGRESSION_OUTLIER”:1,”REGRESSION_OUTLIER_THRESHOLD”:5.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}{“HGBT_Regressor”:{“args”:{“ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:11,”MIN_CHILD_HESSIAN”:4.0,”REGRESSION_OUTLIER”:1,”REGRESSION_OUTLIER_THRESHOLD”:4.0},”inputs”:{“data:{“SCALE”:{“args”:{“NEW_MAX”:3.0,”NEW_MIN”:1.0,”SCALING_METHOD”:0} ,”inputs”:{“data”:”ROWDATA”}}}}}}       The first three pipelines only add the parameters REGRESSION_OUTLIER and REGRESSION_OUTLIER_THRESHOLD to the BP, and these pipelines will always be generated. In the last pipeline of Pset, in addition to adding the two parameters REGRESSION_OUTLIER and REGRESSION_OUTLIER_THRESHOLD, other parameters of the HGBT regressor have also been modified. These additional parameters are adjusted randomly. In this example, we will use the HGBT method for outlier detection because the regressor being used is HGBT.The cross validation on pipelines with outlier detection is shown as below. After cross validation (CV) splitting, we will detect and drop outliers in the training data, and than do the training and test procedure, and finally calculate the cross validation scorings. To handle the outliers in the CV training data, we just drop the outlier points. Outliers can also be in the CV test data. Considering that outliers may significantly impact certain metrics, we have adjusted the cross-validation score measures EVAR, MAE, MSE, RMSE, R2, MAPE, and WMAPE in this step to mitigate the influence of outliers. The corresponding adjusted metrics are termed EVAROUTLIER, MAEOUTLIER, MSEOUTLIER, RMSEOUTLIER, R2OUTLIER, MAPEOUTLIER, and WMAPEOUTLIER, respectively. The adjusted scores are also calculate for BP in the regression outlier tuning result. Let us take MAE as an example. We will drop some smallest and largest absolute residuals. The procedure is as follows.The bar is the sorted absolute test residuals in CV. In the second bar, we drop the grey part and keep the blue part. In more details, we drop M test data points with the largest absolute residuals and M test data points with the smallest absolute residuals, where M is no more than 5% of the test data. Assuming the residuals are x = {xi|i = 1,2,…,N}, we calculate {si|i = 1,2,…,N}, where si = 0.67449×|xi|/median(|x|). This is the MAD score with zero median. We drop the points with si > 5 and an equivalent number of points with the smallest si. After dropping these points, we calculate the adjusted measure MAEDROP, and MAEOUTLIER is defined as 0.5×(MAE + MAEDROP). Other measures are adjusted similarly.For the comparison in the regression outlier tuning process, if a pipeline with regression outlier detection outperforms BP in all the measures specified by SCORINGS, and is equal to or better than the other pipelines, it will be considered the best pipeline for this step. If no pipeline in Pset outperforms BP in all the measures according to SCORINGS, BP will remain the best pipeline for this step. The remaining pipelines are then sorted using a multi-objective sorting method.In the Pipeline Table, the output pipelines from the previous step remain and are controlled by the parameter ELITE_NUMBER. The output pipelines from the regression outlier tuning process are also included in the Pipeline Table and are controlled by the parameter REGRESSION_OUTLIER_TUNE_ELITE_NUMBER. If the best pipeline from this step includes outlier detection, it is placed at the top of the Pipeline Table. Otherwise, the best pipeline from the previous step is placed at the top. Additionally, the IDs of the pipelines from this step are displayed in the Statistic Table.Test CasesIn this part, I will introduce two test cases. In one case, the final best pipeline is without outlier detection. In the other case, the final best pipeline is with outlier detection. In the test cases, I will use the dataset Concrete Compressive Strength[1]. It is about the relationship between the concrete compressive strength and some variables, such as water and age. You can find it in https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength.Case 1: The final best pipeline is without outlier detection.dataThe data is from https://archive.ics.uci.edu/dataset/165/concrete+compressive+strength. I also add an ID column in the dataset.        file = ‘your path and file name’
df_with_id = pd.read_csv(file)
df_with_id       The target column is “Concrete compressive strength”.To do the AutoML fitting and prediction, we will split the data into training part and test part with the hana-ml function hana_ml.algorithms.pal.partition.train_test_val_split. The training part is for AutoML fitting and producing the model. The test part is for testing the prediction from the model. We also select the target column and produce the target values from the test part.      df_train, df_test = train_test_split_fun(df_with_id, test_ratio = 0.2)
target = “Concrete compressive strength”
df_test_X = df_test.copy()
df_test_X.drop(columns=target, inplace=True)
y_test = produce_y_test(df_test, key = “ID”, label = target)        AutoML FittingIn this part, we will fit the training data with AutoML and get a model.       fit_parameters = {
“REGRESSION_OUTLIER”: 1,
“THREAD_RATIO”: 1.0,
“PIPELINE_TYPE”:’regressor’,
“DEPENDENT_VARIABLE”:target,
“RANDOM_SEED”: 123,
“GENERATIONS”: 1000,
“POPULATION_SIZE”: 20,
“CONFIG_DICT”:json.dumps(config_dict),
“ELITE_NUMBER”: 5,
“REGRESSION_OUTLIER_PIPELINE_NUM”: 4,
“REGRESSION_OUTLIER_TUNE_ELITE_NUMBER”: 5,
}
pipeline_table_pd, model_table_pd, stats_table_pd = AutoML_fit(df_train, fit_parameters, cc)              stats_table_pd        From the statistic table, we can see that the pipeline IDs of regression outlier tuning are 5, 6, 7, 8, 9. This shows that the best pipeline is without outlier detection. Let us print the pipeline table.       print_pipeline_table(pipeline_table_pd)       The pipelines with ID = 0,1,2,3,4 are from the original process. In other words, if we take parameter REGRESSION_OUTLIER to 0 (default value), we will get and only get the above pipelines with ID = 0,1,2,3,4. For pipelines with ID = 5,6,7,8,9, they are from the outlier tuning process. For pipeline with ID = 5, it is the same as pipeline with ID = 0. Because it is the old best pipeline, we take it to compare with pipelines with outlier detection. Therefore, there are “EVAROUTLIER” and “MAEOUTLIER” in scores of ID = 5,6,7,8,9. These two measures are adjusted measures from EVAR and MAE, and they are the scores for outlier tuning process. Because no pipeline with outlier detection outperforms the old best pipeline (ID = 5) in both “EVAROUTLIER” and “MAEOUTLIER”, the final best pipeline is still the old best pipeline. Therefore, we put the output pipelines from original process in front of outlier tuning process. Although the pipelines with ID = 0 and ID = 5 are the same, the scores may be different, because their cross validation training and test data may be different. The pipelines in outlier tuning process share the same cross validation training and test data.prediction from AutoML modelLet us see the prediction errors from the model of best pipeline.       test_parameters = {

}
test_result_table_pd, test_stats_table_pd = AutoML_predict(df_test_X, test_parameters, model_table_pd, cc)
pred = produce_predict(test_result_table_pd, key = “ID”)
mae = mae_fun(y_test, pred)
ever = evar_fun(y_test, pred)
print(“mae = {}, evar = {}”.format(mae, ever))             mae = 3.131938162929128, evar = 0.9244784555351252      prediction from pipeline with outlier detectionLet us see the prediction from pipeline with ID = 6, which is the best pipeline with outlier detection. We will get the model by fitting the pipeline and then predict with the model.      pipe_parameters = {
“DEPENDENT_VARIABLE”:target,
“PIPELINE”:json.dumps(pipe),
}
model_table_pipe, stats_table_pipe = Pipeline_fit(df_train, pipe_parameters, cc)
test_result_table_pipe, test_stats_table_pipe = AutoML_predict(df_test_X, test_parameters, model_table_pipe, cc)
pred_pipe = produce_predict(test_result_table_pipe, key = “ID”)
mae_pipe = mae_fun(y_test, pred_pipe)
ever_pipe = evar_fun(y_test, pred_pipe)
print(“mae_pipe = {}, ever_pipe = {}”.format(mae_pipe, ever_pipe))mae_pipe = 3.0678810742472957, ever_pipe = 0.9212280312761948        We can see that the test errors are not the same with AutoML model. This is because some outliers are detected by the pipeline with outlier detection. The results of pipeline with outlier detection are comparable with AutoML model. Both MAE and EVER are a little smaller than AutoML model’s results. For MAE, smaller is better, while for EVAR the opposite is true.Case 2: The final best pipeline is with outlier detection.dataIn this test case, we include some outliers into the training data df_train of test case1.      y = np.array(df_with_id[“Concrete compressive strength”])
y_std = np.std(y)
np.random.seed(234)
outlier_idxes = np.random.randint(0, len(df_train), size=8)
outlier_amplitude = 3
for idx in outlier_idxes:
df_train[“Concrete compressive strength”][idx] = df_train[“Concrete compressive strength”][idx] + (-1)**idx * outlier_amplitude * y_std
df_train        AutoML FittingIn this part, we will fit the new training data with AutoML and get a model.       fit_parameters = {
“REGRESSION_OUTLIER”: 1,
“THREAD_RATIO”: 1.0,
“PIPELINE_TYPE”:’regressor’,
“DEPENDENT_VARIABLE”:target,
“RANDOM_SEED”: 123,
“GENERATIONS”: 1000,
“POPULATION_SIZE”: 20,
“CONFIG_DICT”:json.dumps(config_dict),
“ELITE_NUMBER”: 5,
“REGRESSION_OUTLIER_PIPELINE_NUM”: 4,
“REGRESSION_OUTLIER_TUNE_ELITE_NUMBER”: 5,
}
pipeline_table_pd, model_table_pd, stats_table_pd = AutoML_fit(df_train, fit_parameters, cc)stats_table_pd        From the statistic table, we can see that the pipeline IDs of regression outlier tuning are 0,1,2,3,4. This shows that the best pipeline is with outlier detection. Let us print the pipeline table.       print_pipeline_table(pipeline_table_pd)       Contrary to test case 1, the pipelines with ID = 0,1,2,3,4 are from the outlier tuning process. The pipeline with ID = 0 is with outlier detection and the pipeline with ID = 2 is the old best pipeline. Pipeline with ID = 0 outperforms pipeline with ID = 1 in both “EVAROUTLIER” and “MAEOUTLIER”, and equal or better than other pipelines in outlier tuning process.prediction from AutoML modelLet us see the prediction errors from the model of best pipeline.       test_parameters = {

}
test_result_table_pd, test_stats_table_pd = AutoML_predict(df_test_X, test_parameters, model_table_pd, cc)
pred = produce_predict(test_result_table_pd, key = “ID”)
mae = mae_fun(y_test, pred)
ever = evar_fun(y_test, pred)
print(“mae = {}, evar = {}”.format(mae, ever))mae = 2.9165821766760427, evar = 0.9275281312643902       prediction from the old best pipelineLet us see the prediction from pipeline with ID = 5, which is the best pipeline of the original process. We will get the model by fitting the pipeline and then predict with the model.      pipe = {“HGBT_Regressor”:{“args”:{“SEED”:2024,”ITER_NUM”:100,”OBJ_FUNC”:2,”ETA”:0.5,”MAX_DEPTH”:8,”MIN_CHILD_HESSIAN”:4.0},”inputs”:{“data”:{“SCALE”:{“args”:{“SCALING_METHOD”:2},”inputs”:{“data”:”ROWDATA”}}}}}}
pipe_parameters = {
“DEPENDENT_VARIABLE”:target,
“PIPELINE”:json.dumps(pipe),
}
model_table_pipe, stats_table_pipe = Pipeline_fit(df_train, pipe_parameters, cc)
test_result_table_pipe, test_stats_table_pipe = AutoML_predict(df_test_X, test_parameters, model_table_pipe, cc)
pred_pipe = produce_predict(test_result_table_pipe, key = “ID”)
mae_pipe = mae_fun(y_test, pred_pipe)
ever_pipe = evar_fun(y_test, pred_pipe)
print(“mae_pipe = {}, ever_pipe = {}”.format(mae_pipe, ever_pipe))mae_pipe = 2.968404778787287, ever_pipe = 0.9250418298900762      We can see that mae_pipe is larger than mae and evar_pipe is smaller than ever in test case 2. Outlier detection and dropping improves the prediction in this case.ConclusionsIn this blog post, we explain what an outlier in regression is and provide a PAL procedure and its corresponding hana-ml function to detect outliers in regression data. The procedure is based on two models, the linear model based on MLR and tree model based on HGBT. We also embed regression outlier detection into AutoML, which can help to improve the regression prediction when there are some outliers in the AutoML input data. Hope you enjoy reading this blog!The regression outlier detection is already included in hana-ml. The embedding of regression outlier detection in AutoML will also be included in hana-ml. If you want to learn more about this topic in SAP HANA Predictive Analysis Library (PAL) and hana-ml, please refer to the following links:OutlierDetectionRegression — hana-ml 2.23.2401217 documentationOutlier Detection for Regression | SAP Help PortalAutoML | SAP Help PortalOptimization | SAP Help Portal [1]: “Concrete Compressive Strength” by I-Cheng Yeh is licensed under CC BY 4.0. AppendixSAP HANA Connection       import hana_ml
from hana_ml import dataframe
conn = dataframe.ConnectionContext(‘host’, ‘port’, ‘username’, ‘password’)         Functions of Plotting Regression Outlier Detection Results       def outlier_result_plot(dResults, X):
dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
raw_data = np.array(dResults[‘TARGET’])
residual = np.array(dResults[‘RESIDUAL’])
outlier_score = np.array(dResults[‘OUTLIER_SCORE’])
is_outlier = np.array(dResults[‘IS_OUTLIER’])
plt.figure(figsize = (24,4.5))
plt.subplot(1,4,1)
plt.scatter(X, raw_data)
plt.grid()
plt.xlabel(“X”)
plt.title(‘TARGET’)
plt.subplot(1,4,2)
plt.scatter(X, residual)
plt.grid()
plt.xlabel(“X”)
plt.title(‘RESIDUAL’)
plt.subplot(1,4,3)
plt.scatter(X, outlier_score)
plt.grid()
plt.xlabel(“X”)
plt.title(‘OUTLIER_SCORE’)
plt.subplot(1,4,4)
plt.scatter(X, is_outlier)
plt.grid()
plt.xlabel(“X”)
plt.title(‘IS_OUTLIER’)def outlier_plot(dResults, X):
dResults.sort_values(by = list(dResults)[0], inplace = True, ascending = True)
raw_data = np.array(dResults[‘TARGET’])
is_outlier = np.array(dResults[‘IS_OUTLIER’])
outlier_idx = np.array([],dtype = int)
for i in range(len(is_outlier)):
if is_outlier[i] == 1:
outlier_idx = np.append(outlier_idx,i)
plt.figure()
plt.scatter(X, raw_data)
plt.scatter(X[outlier_idx],raw_data[outlier_idx],color = ‘red’)
plt.grid()
plt.title(“regression data and outlier”)       Functions for Regression Outlier Detection in AutoMLFunctions for Table      def createEmptyTable(table_name, proto, cc):
with cc.connection.cursor() as cur:
try:
joint = []
for key in proto:
joint.append(” “.join([‘”{:s}”‘.format(key), proto[key]]))
cur.execute(‘CREATE COLUMN TABLE %s (%s);’ %
(table_name, “,”.join(joint)))
except:
print(
f””CREATE TABLE {table_name}” was unsuccessful. Maybe the table has existed.”)

def createTableFromDataFrame(df, table_name, cc):
dropTable(table_name, cc)
dt_ml = dataframe.create_dataframe_from_pandas(cc, df, table_name=table_name)
return dt_ml AutoML and Pipeline Functions def AutoML_fit(df, parameters, cc,
data_table=’ZPD_PAL_DATA_TBL’,
parameter_table=’ZPD_PAL_PARAMETERS_TBL’,
pipeline_table=’ZPD_PAL_PIPELINE_TBL’,
model_table=’ZPD_PAL_MODEL_TBL’,
stats_table=’ZPD_PAL_STATS_TBL’):

# Input table
createTableFromDataFrame(df, data_table, cc)

# Pipeline table
dropTable(pipeline_table, cc)
createEmptyTable(pipeline_table, {
“ID”: “INTEGER”,”PIPELINE”:”NCLOB”,”SCORES”:”NCLOB”}, cc)

# Model table
dropTable(model_table, cc)
createEmptyTable(model_table, {
“ROW_INDEX”: “INTEGER”,”MODEL_CONTENT”:”NVARCHAR(5000)”}, cc)

# Stats table
dropTable(stats_table, cc)
createEmptyTable(stats_table, {
“STAT_NAME”: “NVARCHAR(5000)”, “STAT_VALUE”: “NVARCHAR(5000)”}, cc)

if parameters:
with cc.connection.cursor() as cur:
for parName, parValue in parameters.items():

else:
print(“No parameters given using default values.”)

with cc.connection.cursor() as cur:
cur.execute(sql_str)

# Input table
createTableFromDataFrame(df, data_table, cc)

# Result table
dropTable(result_table, cc)
createEmptyTable(result_table, {
“ID”: “NVARCHAR(5000)”,”SCORES”:”NVARCHAR(5000)”}, cc)

# Stats table
dropTable(stats_table, cc)
createEmptyTable(stats_table, {
“STAT_NAME”: “NVARCHAR(5000)”, “STAT_VALUE”: “NVARCHAR(5000)”}, cc)

# Model table
createTableFromDataFrame(model_table_pd, model_table, cc)

if parameters:
with cc.connection.cursor() as cur:
for parName, parValue in parameters.items():

else:
print(“No parameters given using default values.”)

with cc.connection.cursor() as cur:
cur.execute(sql_str)

# Input table
createTableFromDataFrame(df, data_table, cc)

# Model table
dropTable(model_table, cc)
createEmptyTable(model_table, {
“ROW_INDEX”: “INTEGER”,”MODEL_CONTENT”:”NVARCHAR(5000)”}, cc)

# Stats table
dropTable(stats_table, cc)
createEmptyTable(stats_table, {
“STAT_NAME”: “NVARCHAR(5000)”, “STAT_VALUE”: “NVARCHAR(5000)”}, cc)

if parameters:
with cc.connection.cursor() as cur:
for parName, parValue in parameters.items():

else:
print(“No parameters given using default values.”)

with cc.connection.cursor() as cur:
cur.execute(sql_str)

return cc.table(model_table).collect(), cc.table(stats_table).collect() Function of Printing Pipeline Table def print_pipeline_table(pipeline_table_pd):
num = len(pipeline_table_pd)
for i in range(num):
print(“ID = “+str(pipeline_table_pd[“ID”][i]))
print(“PIPELINE: “+pipeline_table_pd[“PIPELINE”][i])
print(“SCORES: “+pipeline_table_pd[“SCORES”][i]) Function of Splitting Training and Test Data from hana_ml.algorithms.pal.partition import train_test_val_split
def train_test_split_fun(df, test_ratio = 0.2, random_seed = 123):
df_hana = create_dataframe_from_pandas(connection_context=cc, pandas_df=df, table_name=’DATA’, force=True, replace=True)
train_df, test_df, valid_df = train_test_val_split(data=df_hana, random_seed = random_seed, training_percentage = 1-test_ratio, testing_percentage = test_ratio, validation_percentage = 0.0)
df_train = train_df.collect()
df_test = test_df.collect()

return df_train, df_test Functions of Getting Target in Test Data def produce_y_test(df_test, key, label):
df_test.sort_values(by = key, inplace = True, ascending = True)
y_test = np.array(df_test[label])

return y_predict0 Functions of Errors def mae_fun(real_data, forecast_data):
error = real_data – forecast_data

return np.mean(np.abs(error))def evar_fun(real_data, forecast_data):
error = real_data – forecast_data

error_square = np.var(error)
real_square = np.var(real_data)

if real_square < 1.0e-16:
return 0.0

evar = 1.0 – error_square / real_square

return evar Read More Technology Blogs by SAP articles

#SAP

#SAPTechnologyblog