SAP Databricks – Best Practices SQL Query Performance Tuning Tips

In this blog post, I will share few Best practices guides you query performance tuning tips that one of your Data Analysts and other SQL users can apply to improve DBSQL Serverless performance.

In below suggested tips using one of the integrated landscape – SAP Datasphere integrated with SAP Business Data Cloud and SAP Databricks

Choose the Right Warehouse Size & Enable Autoscaling

The Databricks SQL Serverless engine powers SAP Databricks compute. Selecting the right warehouse size (S, M, L, etc.) and autoscaling policies ensures your queries have adequate compute without over‑provisioning. Proper resource optimization is foundational to performance.

Optimize Storage Using Delta Lake

Under the hood, SAP Databricks uses Databricks’ optimized open storage formats. By using Delta Lake, your queries benefit from columnar storage, data skipping, ACID transactions, and optimized IO pathways—dramatically boosting speed.

Avoid SELECT *

SELECT * FROM my_table will retrieve all the columns and rows from the given table and it is an Input-Output intensive operation for a large table.

Scanning unnecessary columns slows queries and inflates compute costs. Always specify the exact columns you need—especially when querying wide SAP analytical datasets.

Integer vs String Data Types

Use integer and not to use String on Primary Key

Optimize Data Layout (Statistics, OPTIMIZE, VACUUM)

Three essential housekeeping operations:

ANALYZE TABLE collects table statistics for better planning.OPTIMIZE compacts small files and organizes data.VACUUM cleans unused files to improve scan speed.

These drastically improve performance, especially with frequently refreshed SAP operational data.

ANALYZE collects the statistics of a specified table. These statistics are used by the DBSQL query optimiser to generate a better execution plan.

ANALYZE TABLE demo_19538.cashflow_dp.zcashflow COMPUTE STATISTICS;

OPTIMIZE statement optimizes the layout of the Delta Lake data files. Optionally you can Z-Order the data to colocate the related data in the same set of files.

OPTIMIZE demo_19538.cashflow_dp.zcashflow;
OPTIMIZE demo_19538.cashflow_dp.zcashflow ZORDER BY (COMPANYCODE);

VACUUM removes unused files from the table directory. It removed all the data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days. If you run Vacuum on a delta table, you lose the ability to time travel back to the older versions of the table but it helps in reducing the storage cost.

VACUUM demo_19538.cashflow_dp.zcashflow DRY RUN;
VACUUM demo_19538.cashflow_dp.zcashflow;

Use Query Profiles & EXPLAIN Plans

Databricks provides rich diagnostics:

Query Profile highlights slow operators, joins, and shuffles.EXPLAIN FORMATTED shows the logical and physical execution plan.

Use these tools regularly to spot bottlenecks in SAP data models and adjust your SQL accordingly

It is very important to explore and analyse your query plan before executing it. This enables you to understand how the code will actually be executed and is useful for optimising your queries. The DBSQL optimiser automatically generates the most optimised physical plan which is then executed.

EXPLAIN
SELECT
COMPANYCODE, count(CASHFLOW) AS TotalTrip
FROM
demo_19538.cashflow_dp.zcashflow
GROUP BY COMPANYCODE;

Reduce Shuffle Volume with Better Partitioning

Partition SAP datasets carefully:

Avoid too many small partitions.Avoid over‑partitioning tables under 1TB.Use partitioning only when your access patterns require it.

This reduces shuffle operations and speeds aggregation-heavy SAP reporting queries.

Use Caching Strategically

Caching can cut repeated query runtimes by 50–70%, especially in BI scenarios. Databricks caching options include:

Delta Cache on SSD-backed nodesDataFrame caching (cache() / persist())

Monitor cache hit rates and pre‑warm caches for SAP dashboards.

In this blog post, I will share few Best practices guides you query performance tuning tips that one of your Data Analysts and other SQL users can apply to improve DBSQL Serverless performance.In below suggested tips using one of the integrated landscape – SAP Datasphere integrated with SAP Business Data Cloud and SAP DatabricksChoose the Right Warehouse Size & Enable AutoscalingThe Databricks SQL Serverless engine powers SAP Databricks compute. Selecting the right warehouse size (S, M, L, etc.) and autoscaling policies ensures your queries have adequate compute without over‑provisioning. Proper resource optimization is foundational to performance.Optimize Storage Using Delta LakeUnder the hood, SAP Databricks uses Databricks’ optimized open storage formats. By using Delta Lake, your queries benefit from columnar storage, data skipping, ACID transactions, and optimized IO pathways—dramatically boosting speed.Avoid SELECT *SELECT * FROM my_table will retrieve all the columns and rows from the given table and it is an Input-Output intensive operation for a large table.Scanning unnecessary columns slows queries and inflates compute costs. Always specify the exact columns you need—especially when querying wide SAP analytical datasets.Integer vs String Data TypesUse integer and not to use String on Primary KeyOptimize Data Layout (Statistics, OPTIMIZE, VACUUM)Three essential housekeeping operations:ANALYZE TABLE collects table statistics for better planning.OPTIMIZE compacts small files and organizes data.VACUUM cleans unused files to improve scan speed.These drastically improve performance, especially with frequently refreshed SAP operational data.ANALYZE collects the statistics of a specified table. These statistics are used by the DBSQL query optimiser to generate a better execution plan.ANALYZE TABLE demo_19538.cashflow_dp.zcashflow COMPUTE STATISTICS;OPTIMIZE statement optimizes the layout of the Delta Lake data files. Optionally you can Z-Order the data to colocate the related data in the same set of files.OPTIMIZE demo_19538.cashflow_dp.zcashflow;
OPTIMIZE demo_19538.cashflow_dp.zcashflow ZORDER BY (COMPANYCODE);VACUUM removes unused files from the table directory. It removed all the data files that are no longer in the latest state of the transaction log for the table and are older than a retention threshold. The default threshold is 7 days. If you run Vacuum on a delta table, you lose the ability to time travel back to the older versions of the table but it helps in reducing the storage cost.VACUUM demo_19538.cashflow_dp.zcashflow DRY RUN;
VACUUM demo_19538.cashflow_dp.zcashflow;Use Query Profiles & EXPLAIN PlansDatabricks provides rich diagnostics:Query Profile highlights slow operators, joins, and shuffles.EXPLAIN FORMATTED shows the logical and physical execution plan.Use these tools regularly to spot bottlenecks in SAP data models and adjust your SQL accordinglyIt is very important to explore and analyse your query plan before executing it. This enables you to understand how the code will actually be executed and is useful for optimising your queries. The DBSQL optimiser automatically generates the most optimised physical plan which is then executed.EXPLAIN
SELECT
COMPANYCODE, count(CASHFLOW) AS TotalTrip
FROM
demo_19538.cashflow_dp.zcashflow
GROUP BY COMPANYCODE;Reduce Shuffle Volume with Better PartitioningPartition SAP datasets carefully:Avoid too many small partitions.Avoid over‑partitioning tables under 1TB.Use partitioning only when your access patterns require it.This reduces shuffle operations and speeds aggregation-heavy SAP reporting queries.Use Caching StrategicallyCaching can cut repeated query runtimes by 50–70%, especially in BI scenarios. Databricks caching options include:Delta Cache on SSD-backed nodesDataFrame caching (cache() / persist())Monitor cache hit rates and pre‑warm caches for SAP dashboards. Read More Technology Blog Posts by SAP articles

#SAP

#SAPTechnologyblog