SAP Business Data Cloud (BDC) comes with predefined content as you may know. SAP Data Products constitute a large part of that content. This technical article describes how to make an SAP managed Data Product the input dataset of an APL function. The Automated Predictive Library (APL) is a component inside HANA Cloud; it does in-database machine learning for common scenarios: classification, regression, clustering, forecasting, or data drift.
The easiest way to explain our topic is to draw a diagram:
We are in SAP Datasphere. On the left you have the SAP Data Products Ingestion space that is created automatically when installing the BDC Insight Application. The box on the right side of the diagram is what we will focus on in that blog. It represents a Datasphere space manually created. It must be given a read access to the SAP Data Products required for the ML scenario.
To call APL you have the choice between different tools:
SAP HANA Database ExplorerDbeaver, an external tool for SQL codeJupyter notebook for python code.
Our ML/AI dedicated space
See below the General Ledger raw data that we want as an input of APL.
It is a local table shared by the space BDF700 (SAP-Managed) owning the S/4 Data Products. In your Datasphere tenant this space is more likely to be called: SAP S/4HANA Ingestion (BDC).
Now, see below the Data Builder view we created on top of this General Ledger raw data:
Its SQL definition is as follows:
select
“CompanyCode”,
left(to_char(“AmountInTransactionCurrency”),1) as “FirstDigit”
from
“BDF700.GeneralLedgerAccount:v1.GeneralLedgerAccountLineItem”
where
“AmountInTransactionCurrency” > 0
Our simple goal here is to check whether the GL transaction amount respect the Benford distribution. For that purpose, we imported the Benford data from a csv file and created a view that will serve as a reference dataset. APL has a drift detector to help spot deviations between a given dataset and a reference. It can be applied to various types of comparison:
Over time: e.g. this year employee survey results versus last year resultsBetween two populations: e.g. male students versus female studentsAgainst an expected distribution: e.g. payment amounts versus the Benford’s law
But APL data drift deserves a blog of its own. Let’s go back to BDC.
Our APL ready database user
When you are done with the Datasphere views you must (we assume you are in the Space Management menu) you must edit the ML/AI space to create an APL enabled database user. Here is ours:
In the “Create Database User” dialog box, see the checkbox to enable APL:
If you do NOT see the APL check box, ask the system owner of the Datasphere tenant to activate the HANA Cloud script server; it is a prerequisite. This one-time setup can be done from the Datasphere menu: System / Configuration. From there, go to the tab: Tenant configuration. And then, select the check box: Enable SAP HANA Cloud Script Server.
In the “Create Database User” dialog box, we selected in total 3 check boxes:
Enable Automated Predictive Library (APL)Enable Read AccessEnable Write Access.
Once the database user is created, do not forget to deploy the ML/AI space.
At the end of the row showing your database user, there is a tiny info button. Click on it to obtain the connection parameters: host, port, and password. For the password we had to click on Request New Password, and then on the eye button to see the password string.
One can call APL directly from the Database Explorer:
A SQL console is what we need:
In the SQL script below APL evaluates the deviation of the GL amounts distribution from the Benford’s values:
create table APL_DEBRIEF_METRIC like “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_METRIC_OID”;
create table APL_DEBRIEF_PROPERTY like “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_PROPERTY_OID”;
create view DATASET_1 as select * from MY_SPACE.”Benford_By_Company” order by “CompanyCode”;
create view DATASET_2 as select * from MY_SPACE.”GL_Transaction_1st_Digit” order by “CompanyCode”;
DO BEGIN
declare header “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.FUNCTION_HEADER”;
declare config “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.OPERATION_CONFIG_EXTENDED”;
declare var_desc “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.VARIABLE_DESC_OID”;
declare var_role “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.VARIABLE_ROLES_WITH_COMPOSITES_OID”;
declare out_log “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.OPERATION_LOG”;
declare out_summary “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.SUMMARY”;
declare out_metric “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_METRIC_OID”;
declare out_property “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_PROPERTY_OID”;
:header.insert((‘Oid’, ‘BDC’));
:config.insert((‘APL/SegmentColumnName’, ‘CompanyCode’,null));
“SAP_PA_APL”.”sap.pa.apl.base::COMPARE_DATA” (
:header, :config, :var_desc, :var_role,’MY_SPACE#USER_4AI’,’DATASET_1′, ‘MY_SPACE#USER_4AI’,’DATASET_2′,
out_log, out_summary, out_metric, out_property );
insert into APL_DEBRIEF_METRIC select * from :out_metric;
insert into APL_DEBRIEF_PROPERTY select * from :out_property;
END;
select
“Oid” as “Company”, “Deviation Indicator” as “1st Digit Deviation”
from
SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_ByVariable”
(APL_DEBRIEF_PROPERTY,APL_DEBRIEF_METRIC, Deviation_Threshold =>0.95);
Each company deviates, this is not surprising since we are on a development tenant.
We run a last report that shows the compared datasets side-by-side:
select “Oid” as “Company”, “Category” as “1st Digit”,
“Ref % Weight” as “Benford”, “New % Weight” as “BDC GL”,
“% Change” as “Difference”
from SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_CategoryFrequencies”
(APL_DEBRIEF_PROPERTY,APL_DEBRIEF_METRIC)
order by 1,2;
If you prefer using a Jupyter notebook, the following last section is for you.
How to call APL from a notebook
Connect to HANA Cloud with the parameters of the database user created earlier.
from hana_ml import dataframe as hd
conn = hd.ConnectionContext(
address = ‘Host_String’, port = 443,
user = ‘MY_SPACE#USER_4AI’, password = ‘Password_String’,
encrypt = ‘true’, sslValidateCertificate = ‘false’ )
conn.connection.isconnected()
Define the HANA dataframe for the Benford’s data:
sql_cmd = “””
select * from MY_SPACE.”Benford_By_Company” order by “CompanyCode”
“””
hdf_ref= hd.DataFrame(conn, sql_cmd)
hdf_ref.head(7).collect()
Same thing for the General Ledger data:
sql_cmd = “””
select * from MY_SPACE.”GL_Transaction_1st_Digit” order by “CompanyCode”
“””
hdf_new= hd.DataFrame(conn, sql_cmd)
hdf_new.head(7).collect()
Detect the potential data drift by company with APL:
col_segment= ‘CompanyCode’
from hana_ml.algorithms.apl.drift_detector import DriftDetector
apl_model = DriftDetector(segment_column_name= col_segment, max_tasks= 4)
results = apl_model.fit_detect(hdf_ref, hdf_new, threshold=0.95)
print(results.collect())
The calculations for the APL data drift are done by the HANA database, not by the client.
Finally, view the 2 distributions side-by-side:
df = apl_model.get_debrief_report(‘Deviation_CategoryFrequencies’).sort([‘Oid’,’Variable’,’Category Order’]).select([‘Oid’,’Category’,’Ref % Weight’,’New % Weight’,’% Change’]).collect()
col_dict = {‘Oid’: col_segment, ‘Category’: ‘FirstDigit’, ‘Ref % Weight’: ‘Benford’, ‘New % Weight’: ‘BDC GL’, ‘% Change’: ‘Difference’}
df.rename(columns=col_dict, inplace=True)
format_dict = { ‘Benford’: ‘{:.3f}’, ‘BDC GL’: ‘{:.3f}’, ‘Difference’: ‘{:.3f}’ }
df.style.format(format_dict).hide(axis=’index’)
Happy machine learning on BDC with APL.
SAP Business Data Cloud (BDC) comes with predefined content as you may know. SAP Data Products constitute a large part of that content. This technical article describes how to make an SAP managed Data Product the input dataset of an APL function. The Automated Predictive Library (APL) is a component inside HANA Cloud; it does in-database machine learning for common scenarios: classification, regression, clustering, forecasting, or data drift.The easiest way to explain our topic is to draw a diagram: We are in SAP Datasphere. On the left you have the SAP Data Products Ingestion space that is created automatically when installing the BDC Insight Application. The box on the right side of the diagram is what we will focus on in that blog. It represents a Datasphere space manually created. It must be given a read access to the SAP Data Products required for the ML scenario.To call APL you have the choice between different tools:SAP HANA Database ExplorerDbeaver, an external tool for SQL codeJupyter notebook for python code. Our ML/AI dedicated spaceSee below the General Ledger raw data that we want as an input of APL. It is a local table shared by the space BDF700 (SAP-Managed) owning the S/4 Data Products. In your Datasphere tenant this space is more likely to be called: SAP S/4HANA Ingestion (BDC).Now, see below the Data Builder view we created on top of this General Ledger raw data: Its SQL definition is as follows:select
“CompanyCode”,
left(to_char(“AmountInTransactionCurrency”),1) as “FirstDigit”
from
“BDF700.GeneralLedgerAccount:v1.GeneralLedgerAccountLineItem”
where
“AmountInTransactionCurrency” > 0Our simple goal here is to check whether the GL transaction amount respect the Benford distribution. For that purpose, we imported the Benford data from a csv file and created a view that will serve as a reference dataset. APL has a drift detector to help spot deviations between a given dataset and a reference. It can be applied to various types of comparison:Over time: e.g. this year employee survey results versus last year resultsBetween two populations: e.g. male students versus female studentsAgainst an expected distribution: e.g. payment amounts versus the Benford’s lawBut APL data drift deserves a blog of its own. Let’s go back to BDC. Our APL ready database userWhen you are done with the Datasphere views you must (we assume you are in the Space Management menu) you must edit the ML/AI space to create an APL enabled database user. Here is ours: In the “Create Database User” dialog box, see the checkbox to enable APL: If you do NOT see the APL check box, ask the system owner of the Datasphere tenant to activate the HANA Cloud script server; it is a prerequisite. This one-time setup can be done from the Datasphere menu: System / Configuration. From there, go to the tab: Tenant configuration. And then, select the check box: Enable SAP HANA Cloud Script Server.In the “Create Database User” dialog box, we selected in total 3 check boxes:Enable Automated Predictive Library (APL)Enable Read AccessEnable Write Access.Once the database user is created, do not forget to deploy the ML/AI space.At the end of the row showing your database user, there is a tiny info button. Click on it to obtain the connection parameters: host, port, and password. For the password we had to click on Request New Password, and then on the eye button to see the password string.One can call APL directly from the Database Explorer: A SQL console is what we need: In the SQL script below APL evaluates the deviation of the GL amounts distribution from the Benford’s values:create table APL_DEBRIEF_METRIC like “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_METRIC_OID”;
create table APL_DEBRIEF_PROPERTY like “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_PROPERTY_OID”;
create view DATASET_1 as select * from MY_SPACE.”Benford_By_Company” order by “CompanyCode”;
create view DATASET_2 as select * from MY_SPACE.”GL_Transaction_1st_Digit” order by “CompanyCode”;
DO BEGIN
declare header “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.FUNCTION_HEADER”;
declare config “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.OPERATION_CONFIG_EXTENDED”;
declare var_desc “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.VARIABLE_DESC_OID”;
declare var_role “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.VARIABLE_ROLES_WITH_COMPOSITES_OID”;
declare out_log “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.OPERATION_LOG”;
declare out_summary “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.SUMMARY”;
declare out_metric “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_METRIC_OID”;
declare out_property “SAP_PA_APL”.”sap.pa.apl.base::BASE.T.DEBRIEF_PROPERTY_OID”;
:header.insert((‘Oid’, ‘BDC’));
:config.insert((‘APL/SegmentColumnName’, ‘CompanyCode’,null));
“SAP_PA_APL”.”sap.pa.apl.base::COMPARE_DATA” (
:header, :config, :var_desc, :var_role,’MY_SPACE#USER_4AI’,’DATASET_1′, ‘MY_SPACE#USER_4AI’,’DATASET_2′,
out_log, out_summary, out_metric, out_property );
insert into APL_DEBRIEF_METRIC select * from :out_metric;
insert into APL_DEBRIEF_PROPERTY select * from :out_property;
END;
select
“Oid” as “Company”, “Deviation Indicator” as “1st Digit Deviation”
from
SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_ByVariable”
(APL_DEBRIEF_PROPERTY,APL_DEBRIEF_METRIC, Deviation_Threshold =>0.95); Each company deviates, this is not surprising since we are on a development tenant.We run a last report that shows the compared datasets side-by-side:select “Oid” as “Company”, “Category” as “1st Digit”,
“Ref % Weight” as “Benford”, “New % Weight” as “BDC GL”,
“% Change” as “Difference”
from SAP_PA_APL.”sap.pa.apl.debrief.report::Deviation_CategoryFrequencies”
(APL_DEBRIEF_PROPERTY,APL_DEBRIEF_METRIC)
order by 1,2; If you prefer using a Jupyter notebook, the following last section is for you. How to call APL from a notebookConnect to HANA Cloud with the parameters of the database user created earlier.from hana_ml import dataframe as hd
conn = hd.ConnectionContext(
address = ‘Host_String’, port = 443,
user = ‘MY_SPACE#USER_4AI’, password = ‘Password_String’,
encrypt = ‘true’, sslValidateCertificate = ‘false’ )
conn.connection.isconnected() Define the HANA dataframe for the Benford’s data:sql_cmd = “””
select * from MY_SPACE.”Benford_By_Company” order by “CompanyCode”
“””
hdf_ref= hd.DataFrame(conn, sql_cmd)
hdf_ref.head(7).collect() Same thing for the General Ledger data:sql_cmd = “””
select * from MY_SPACE.”GL_Transaction_1st_Digit” order by “CompanyCode”
“””
hdf_new= hd.DataFrame(conn, sql_cmd)
hdf_new.head(7).collect()Detect the potential data drift by company with APL:col_segment= ‘CompanyCode’
from hana_ml.algorithms.apl.drift_detector import DriftDetector
apl_model = DriftDetector(segment_column_name= col_segment, max_tasks= 4)
results = apl_model.fit_detect(hdf_ref, hdf_new, threshold=0.95)
print(results.collect())The calculations for the APL data drift are done by the HANA database, not by the client.Finally, view the 2 distributions side-by-side:df = apl_model.get_debrief_report(‘Deviation_CategoryFrequencies’).sort([‘Oid’,’Variable’,’Category Order’]).select([‘Oid’,’Category’,’Ref % Weight’,’New % Weight’,’% Change’]).collect()
col_dict = {‘Oid’: col_segment, ‘Category’: ‘FirstDigit’, ‘Ref % Weight’: ‘Benford’, ‘New % Weight’: ‘BDC GL’, ‘% Change’: ‘Difference’}
df.rename(columns=col_dict, inplace=True)
format_dict = { ‘Benford’: ‘{:.3f}’, ‘BDC GL’: ‘{:.3f}’, ‘Difference’: ‘{:.3f}’ }
df.style.format(format_dict).hide(axis=’index’)Happy machine learning on BDC with APL.To know more about APL Read More Technology Blogs by SAP articles
#SAP
#SAPTechnologyblog