HANA ML Data Drift Detector

Estimated read time 14 min read

The release of HANA ML 2.23 includes a new function called: APL data drift detector. This drift detector helps you spot changes or deviations between a given dataset and a reference. Reference data could be a version in the past, or a particular segment of customers or employees, or an expected distribution (e.g. Benford). Use cases of data comparison are:

This year employee survey results versus last year results by countryMachine learning inference dataset versus training datasetMale staff versus female staffPayment amounts by legal entity versus the Benford’s law to find potential fraud

This feature from APL (Automated Predictive Library) is available for both Python and SQL.

The rest of this blog will walk you through a scenario with code snippets in Python as well as SQL that you can reuse and adapt to your own case.

We chose to work with Olympics athletes and results data. The dataset contains summer and winter games since 1896. We want to make a comparison over time.

First, we connect to the SAP HANA database using a Python notebook.

DB_Host=’host’
DB_Port=’port_number’
DB_User=’user_name’
DB_Password=’user_password’

from hana_ml import dataframe as hd
conn = hd.ConnectionContext(address = DB_Host, port = DB_Port,
user = DB_User, password = DB_Password,
encrypt = ‘true’, sslValidateCertificate = ‘false’ )
conn.connection.isconnected()

The reference dataframe comprises the Olympic games before 1970 …

sql_cmd = “””
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” <= 1970
order by “Id”
“””
hdf_ref= hd.DataFrame(conn, sql_cmd)
hdf_ref.head(5).collect()

 

 

 

 

… and the comparison dataframe, after 1970:

sql_cmd = “””
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” > 1970
order by “Id”
“””
hdf_new= hd.DataFrame(conn, sql_cmd)
hdf_new.head(5).collect()

 

 

 

 

The code to run the comparison is the following:

from hana_ml.algorithms.apl.drift_detector import DriftDetector
apl_model = DriftDetector()
results = apl_model.fit_detect(hdf_ref, hdf_new, build_report=True)
print(results.collect())

 

 

 

All the variables appearing here have a deviation indicator over 0.95. This threshold can be changed with the syntax below:

results = apl_model.fit_detect(hdf_ref, hdf_new, threshold=0.80)

Here is a report putting the two datasets side by side with their counts (aka weight) by category; the most changing categories (top 10 here) appear first:

df = apl_model.get_debrief_report(‘Deviation_CategoryFrequencies’).head(10).deselect([‘Oid’,’Category Order’]).collect()
format_dict = {
‘Ref Weight’: ‘{:,.0f}’, ‘New Weight’: ‘{:,.0f}’, ‘Change’: ‘{:,.0f}’,
‘Ref % Weight’: ‘{:.1f}’, ‘New % Weight’: ‘{:.1f}’, ‘% Change’: ‘{:.1f}’, ‘Abs % Change’: ‘{:.1f}’
}
df.style.format(format_dict).hide(axis=’index’)

KxMissing is a special category created by APL when finding empty or null values. With far less missing values on athletes’ height and weight, we see that data quality is getting better overtime.

More women are participating to the Olympic games after 1970. The male % change value mirrors the female % change value. The same remark applies to summer and winter seasons.

The following code brings the data drift results as charts:

apl_model.generate_notebook_iframe_report()
# apl_model.generate_html_report(‘drift_olympics’)

After comparing athletes across all types of sports, let’s compare them by sport. For that we must provide the drift detector with two dataframes sorted by Sport:

## Reference Dataset
sql_cmd = “””
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” <= 1970
order by “Sport”, “Id”
“””
hdf_ref= hd.DataFrame(conn, sql_cmd)
hdf_ref.head(5).collect()

 

 

 

 

## Dataset for comparison
sql_cmd = “””
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” > 1970
order by “Sport”, “Id”
“””
hdf_new= hd.DataFrame(conn, sql_cmd)
hdf_new.head(5).collect()

 

 

 

 

You specify the segment for drift detection the same way as for segmented forecasting, regression or classification scenarios:

col_segment= ‘Sport’
from hana_ml.algorithms.apl.drift_detector import DriftDetector
apl_model = DriftDetector(segment_column_name= col_segment, max_tasks= 4)
results = apl_model.fit_detect(hdf_ref, hdf_new, threshold=0.95)
print(results.sort([“Segment”,”Variable”]).collect())

The deviating variables appear for each sport:

 

 

 

 

 

We check if there are segments with status ‘Failed’:

my_filter = “KEY in (‘AplTaskStatus’) and VALUE =’Failed'”
df = apl_model.get_summary().filter(my_filter).select(‘OID’).collect()
failed_segments = df[‘OID’].tolist()
print(‘Preview of failed segments’)
print(failed_segments[:5])

 

The log contains the cause of failure:

df = apl_model.get_fit_operation_log().filter(“LEVEL = 0”).select(‘OID’,’ORIGIN’,’MESSAGE’).head(5).collect()
df.columns = [col_segment, ‘Cause’, ‘Error’]
df.style.hide(axis=’index’)

 

 

 

These sports were introduced after 1970. Thus, there is no reference data to compare against.

To obtain the Data Drift report, you must provide a segment value as in the example below:

seg_value = “Swimming”
apl_model.build_report(segment_name=seg_value)
apl_model.generate_notebook_iframe_report()
#apl_model.generate_html_report(‘Sport_Report’)

 

We are nearing the end of this post, and you have not yet seen how to run a comparison with the SQL interface. Here is the syntax for calling the new APL function COMPARE_DATA from a SQL script:

drop view DATASET_1;
create view DATASET_1 as (
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” <= 1970
order by “Sport”, “Id”
);

drop view DATASET_2;
create view DATASET_2 as (
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” > 1970
order by “Sport”, “Id”
);

DO BEGIN
declare header “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.FUNCTION_HEADER”;
declare config “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.OPERATION_CONFIG_EXTENDED”;
declare var_desc “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.VARIABLE_DESC_OID”;
declare var_role “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.VARIABLE_ROLES_WITH_COMPOSITES_OID”;
declare out_log “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.OPERATION_LOG”;
declare out_summary “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.SUMMARY”;
declare out_metric “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_METRIC_OID”;
declare out_property “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_PROPERTY_OID”;

:header.insert((‘Oid’, ‘Before-After 1970’));

:config.insert((‘APL/SegmentColumnName’, ‘Sport’,null));

:var_desc.insert((0,’Height’,’number’,’continuous’,0,0,null,null,null,null));
:var_desc.insert((1,’Weight’,’number’,’continuous’,0,0,null,null,null,null));
:var_desc.insert((2,’Age’,’integer’,’continuous’,0,0,null,null,null,null));
:var_desc.insert((3,’Medal’,’string’,’nominal’,0,0,null,null,null,null));
:var_desc.insert((4,’Gender’,’string’,’nominal’,0,0,null,null,null,null));
:var_desc.insert((5,’Season’,’string’,’nominal’,0,0,null,null,null,null));
:var_desc.insert((6,’Country’,’string’,’nominal’,0,0,null,null,null,null));

“SAP_PA_APL”.”sap.pa.apl.base::COMPARE_DATA” (
:header, :config, :var_desc, :var_role,’USER_APL’,’DATASET_1′, ‘USER_APL’,’DATASET_2′,
out_log, out_summary, out_metric, out_property );

select value as “Status”, count(*) as “Nb of Segments” from :out_summary where key = ‘AplTaskStatus’ group by value;
select OID as “Sport”, message as “Error” from :out_log where level=0;

select * from SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_ByVariable”(:out_property,:out_metric, Deviation_Threshold => 0.9);
select * from SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_ByCategory”(:out_property,:out_metric, Deviation_Threshold => 0.9);
select * from SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_CategoryFrequencies”(:out_property,:out_metric) order by “Abs % Change” desc;
END;

— Cases where a comparison is not possible
select “Sport” from “DATASET_1” minus select “Sport” from “DATASET_2”; — Sports that have been removed
select “Sport” from “DATASET_2” minus select “Sport” from “DATASET_1”; — New Sports

 

The APL drift detector supports also a 2-step approach to address the cases where you have a changing dataset that needs to be compared on a regular basis (e.g. every week or every month) to a fixed reference. The samples below illustrate how it works with Python:

Data Drift Step 1 notebook   Data Drift Step 2 notebook 

and with SQL:

Data Drift Step 1 script   Data Drift Step 2 script 

 

Happy data drift detection with APL.

To know more about APL 

 

 

​ The release of HANA ML 2.23 includes a new function called: APL data drift detector. This drift detector helps you spot changes or deviations between a given dataset and a reference. Reference data could be a version in the past, or a particular segment of customers or employees, or an expected distribution (e.g. Benford). Use cases of data comparison are:This year employee survey results versus last year results by countryMachine learning inference dataset versus training datasetMale staff versus female staffPayment amounts by legal entity versus the Benford’s law to find potential fraudThis feature from APL (Automated Predictive Library) is available for both Python and SQL.The rest of this blog will walk you through a scenario with code snippets in Python as well as SQL that you can reuse and adapt to your own case.We chose to work with Olympics athletes and results data. The dataset contains summer and winter games since 1896. We want to make a comparison over time.First, we connect to the SAP HANA database using a Python notebook.DB_Host=’host’
DB_Port=’port_number’
DB_User=’user_name’
DB_Password=’user_password’

from hana_ml import dataframe as hd
conn = hd.ConnectionContext(address = DB_Host, port = DB_Port,
user = DB_User, password = DB_Password,
encrypt = ‘true’, sslValidateCertificate = ‘false’ )
conn.connection.isconnected()The reference dataframe comprises the Olympic games before 1970 …sql_cmd = “””
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” <= 1970
order by “Id”
“””
hdf_ref= hd.DataFrame(conn, sql_cmd)
hdf_ref.head(5).collect()    … and the comparison dataframe, after 1970:sql_cmd = “””
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” > 1970
order by “Id”
“””
hdf_new= hd.DataFrame(conn, sql_cmd)
hdf_new.head(5).collect()    The code to run the comparison is the following:from hana_ml.algorithms.apl.drift_detector import DriftDetector
apl_model = DriftDetector()
results = apl_model.fit_detect(hdf_ref, hdf_new, build_report=True)
print(results.collect())   All the variables appearing here have a deviation indicator over 0.95. This threshold can be changed with the syntax below:results = apl_model.fit_detect(hdf_ref, hdf_new, threshold=0.80)Here is a report putting the two datasets side by side with their counts (aka weight) by category; the most changing categories (top 10 here) appear first:df = apl_model.get_debrief_report(‘Deviation_CategoryFrequencies’).head(10).deselect([‘Oid’,’Category Order’]).collect()
format_dict = {
‘Ref Weight’: ‘{:,.0f}’, ‘New Weight’: ‘{:,.0f}’, ‘Change’: ‘{:,.0f}’,
‘Ref % Weight’: ‘{:.1f}’, ‘New % Weight’: ‘{:.1f}’, ‘% Change’: ‘{:.1f}’, ‘Abs % Change’: ‘{:.1f}’
}
df.style.format(format_dict).hide(axis=’index’)KxMissing is a special category created by APL when finding empty or null values. With far less missing values on athletes’ height and weight, we see that data quality is getting better overtime.More women are participating to the Olympic games after 1970. The male % change value mirrors the female % change value. The same remark applies to summer and winter seasons.The following code brings the data drift results as charts:apl_model.generate_notebook_iframe_report()
# apl_model.generate_html_report(‘drift_olympics’)After comparing athletes across all types of sports, let’s compare them by sport. For that we must provide the drift detector with two dataframes sorted by Sport:## Reference Dataset
sql_cmd = “””
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” <= 1970
order by “Sport”, “Id”
“””
hdf_ref= hd.DataFrame(conn, sql_cmd)
hdf_ref.head(5).collect()    ## Dataset for comparison
sql_cmd = “””
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” > 1970
order by “Sport”, “Id”
“””
hdf_new= hd.DataFrame(conn, sql_cmd)
hdf_new.head(5).collect()    You specify the segment for drift detection the same way as for segmented forecasting, regression or classification scenarios:col_segment= ‘Sport’
from hana_ml.algorithms.apl.drift_detector import DriftDetector
apl_model = DriftDetector(segment_column_name= col_segment, max_tasks= 4)
results = apl_model.fit_detect(hdf_ref, hdf_new, threshold=0.95)
print(results.sort([“Segment”,”Variable”]).collect())The deviating variables appear for each sport:     We check if there are segments with status ‘Failed’:my_filter = “KEY in (‘AplTaskStatus’) and VALUE =’Failed'”
df = apl_model.get_summary().filter(my_filter).select(‘OID’).collect()
failed_segments = df[‘OID’].tolist()
print(‘Preview of failed segments’)
print(failed_segments[:5]) The log contains the cause of failure:df = apl_model.get_fit_operation_log().filter(“LEVEL = 0”).select(‘OID’,’ORIGIN’,’MESSAGE’).head(5).collect()
df.columns = [col_segment, ‘Cause’, ‘Error’]
df.style.hide(axis=’index’)   These sports were introduced after 1970. Thus, there is no reference data to compare against.To obtain the Data Drift report, you must provide a segment value as in the example below:seg_value = “Swimming”
apl_model.build_report(segment_name=seg_value)
apl_model.generate_notebook_iframe_report()
#apl_model.generate_html_report(‘Sport_Report’) We are nearing the end of this post, and you have not yet seen how to run a comparison with the SQL interface. Here is the syntax for calling the new APL function COMPARE_DATA from a SQL script:drop view DATASET_1;
create view DATASET_1 as (
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” <= 1970
order by “Sport”, “Id”
);

drop view DATASET_2;
create view DATASET_2 as (
select “Height”, “Weight”, “Age”, “Medal”, “Gender”, “Season”, “Country”, “Sport”
from “APL_SAMPLES”.”OLYMPICS”
where “Year” > 1970
order by “Sport”, “Id”
);

DO BEGIN
declare header “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.FUNCTION_HEADER”;
declare config “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.OPERATION_CONFIG_EXTENDED”;
declare var_desc “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.VARIABLE_DESC_OID”;
declare var_role “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.VARIABLE_ROLES_WITH_COMPOSITES_OID”;
declare out_log “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.OPERATION_LOG”;
declare out_summary “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.SUMMARY”;
declare out_metric “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_METRIC_OID”;
declare out_property “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_PROPERTY_OID”;

:header.insert((‘Oid’, ‘Before-After 1970’));

:config.insert((‘APL/SegmentColumnName’, ‘Sport’,null));

:var_desc.insert((0,’Height’,’number’,’continuous’,0,0,null,null,null,null));
:var_desc.insert((1,’Weight’,’number’,’continuous’,0,0,null,null,null,null));
:var_desc.insert((2,’Age’,’integer’,’continuous’,0,0,null,null,null,null));
:var_desc.insert((3,’Medal’,’string’,’nominal’,0,0,null,null,null,null));
:var_desc.insert((4,’Gender’,’string’,’nominal’,0,0,null,null,null,null));
:var_desc.insert((5,’Season’,’string’,’nominal’,0,0,null,null,null,null));
:var_desc.insert((6,’Country’,’string’,’nominal’,0,0,null,null,null,null));

“SAP_PA_APL”.”sap.pa.apl.base::COMPARE_DATA” (
:header, :config, :var_desc, :var_role,’USER_APL’,’DATASET_1′, ‘USER_APL’,’DATASET_2′,
out_log, out_summary, out_metric, out_property );

select value as “Status”, count(*) as “Nb of Segments” from :out_summary where key = ‘AplTaskStatus’ group by value;
select OID as “Sport”, message as “Error” from :out_log where level=0;

select * from SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_ByVariable”(:out_property,:out_metric, Deviation_Threshold => 0.9);
select * from SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_ByCategory”(:out_property,:out_metric, Deviation_Threshold => 0.9);
select * from SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_CategoryFrequencies”(:out_property,:out_metric) order by “Abs % Change” desc;
END;

— Cases where a comparison is not possible
select “Sport” from “DATASET_1” minus select “Sport” from “DATASET_2”; — Sports that have been removed
select “Sport” from “DATASET_2” minus select “Sport” from “DATASET_1”; — New Sports The APL drift detector supports also a 2-step approach to address the cases where you have a changing dataset that needs to be compared on a regular basis (e.g. every week or every month) to a fixed reference. The samples below illustrate how it works with Python:Data Drift Step 1 notebook   Data Drift Step 2 notebook and with SQL:Data Drift Step 1 script   Data Drift Step 2 script  Happy data drift detection with APL.To know more about APL     Read More Technology Blog Posts by SAP articles 

#SAP

#SAPTechnologyblog

You May Also Like

More From Author