We have been developing the Parquet library in SAP ABAP for a while, not an easy thing to do. The problem we were trying to solve, integrating SAP with multiple Data Lakes is like re inventing the wheel over and over again, once we finish with one customer, we go into another, with different table requirements, data and metadata requirements, frequency, formats, and schemas. We need an output format with structure and metadata, and Parquet, ORC and Avro is our way out of the trap, until we reach higher Open standard goals, like Iceberg.
Currently, we use CSV and JSON file formats, both offered as standard ABAP libraries, provided by SAP, but customers ask for more. CSV and JSON are sometimes too basic for large enterprise, even Avro falls short, its great for streaming, but when you have billions of files, you got to scan entirely the files in CSV, JSON or Avro, and thats a lot of compute, then Parquet or ORC are a great fit, that is the problem they came to solve. In our case, Parquet was our only viable option because, at the time, Amazon Athena only supported Parquet, not ORC, not Avro, this changed recently, so we have gone all in on Parquet.
Parquet is a file format. It operates fundamentally as a columnar file format and focuses on efficient data storage and compression. So, seems like a great fit!
All rainbows and unicorns… Parquet is fantastic for all the features we need for our SAP data integration, but we miss one, and its quite important one; Writing Parquet or ORC, is really hard. Only a lucky few tools like Databricks, Snowflake, Spark, Hive, Flink, have good support for writing Iceberg tables.
Writing an Iceberg table in any of the file formats requires much more care. Each write to an Iceberg table creates new data files in S3. Over time, the table’s data gets split across many small files, and reading the data from the table becomes prohibitively slow. The solution is to periodically compact the table, which combines the data files into fewer larger files and drops any data that has been overwritten/deleted. See this article from Dremio for a better explanation.
Not to mention, the Python Iceberg library nor the Rust Iceberg library today support compaction, that also had to be built on ABAP. Only Java is an option for built libraries, or we build ourselves. Mountain high, Valley Low.
So, Parquet structure is like a ledger of changes, all changes are added as snapshots to the original state.
On the SAP world this was a problem for us, given the nature of changes in SAP data and the rate of updates, we were quickly with hundreds of thousands of objects under every table.
Each daily update will create Parquet files, partitioned by GJAHR and BUKRSas an example
s3://bucket/warehouse/acdoca_table/
├── GJAHR=2024/
├── BUKRS=1000/
├── part-00001.parquet
├── part-00002.parquet
├── BUKRS=2000/
├── part-00003.parquet
…
…
However, as data usage patterns evolved, a need for more dynamic capabilities was needed. While Parquet can be used for mutable tables (e.g., by appending new files), this often involves manual management and can lead to inefficiencies, especially in scenarios with high write volumes or complex data evolution.
S3 table is a bucket with Iceberg table format that creates a rest endpoint on a per-table basis. Inside that bucket we get an Iceberg catalog, we can create namespaces and tables, each table is a first-class resource, and AWS runs all the maintenance and optimization tasks of it.
So, Mario, this means that we are bringing the Catalog to the Storage Layer?
Or, scientifically asked are we shifting towards bringing database-like functionality and efficiency to the storage layer?
In fact, we have been doing it, for a while, with Glue, but not from SAP directly. And we have been asked multiple times. Yet we had to rely on Glue.
AWS Glue cost structure is based on a pay-as-you-go model, which, while offering tremendous flexibility, also has some complexity and requires careful understanding. You got to understand DPUs, Glue Data Catalog and Crawlers to find and organize data for your AWS Glue Data Catalog, charged by the hour.
Resource overprovisioning, Over-allocation of memory and CPU resources, not utilizing on-demand ETL jobs effectively, inefficient partitioning of data, overutilization of AWS Glue crawlers, not using the job bookmark feature and data reprocessing or inefficient use of data formats like CSV or JSON instead of Parquet can increase the cost significantly and kill the project.
Pricing will kill the idea?
Before we activate the S3 table buckets, we receive a warning, we must enable integration.
This integration with AWS analytics services will integrate our table buckets with AWS analytics services. The integration adds our tables to the AWS Glue Data Catalog so we can work with them using analytics services such as Amazon Athena, Amazon Redshift or Amazon QuickSight.
As you see, the integration is done per Region, once activated, for every new table bucket in the console, S3 initiates some actions to integrate table buckets in the Region. Lake Formation registers the table bucket in the current Region and adds the s3tablescatalog to the AWS Glue Data Catalog in the current Region.
All the table buckets, namespaces, and tables will be populated in the Data Catalog of Glue.
This is much more efficient than us managing Glue ourselves. We save significant data discovery and management burden, it helps us to reduce the need for manual metadata handling and minimizes the risk of errors that could lead to increased costs.
Natively integrating S3 tables with AWS Glue Data Catalog and AWS Lake Formation we mitigate several cost traps associated with AWS Glue, such as overutilization of crawlers, inefficient use of data formats, and overlooked partitioning of data.
How are we doing all this?
Table Operation in S3 Tables is quite similar to our current CRUD operation, we now must make sure that we can
A set of APIs for interacting with our tables
Create Table: Create new tables within our table bucket.List Tables: Discover all tables residing within a specific table bucket.Read Table Metadata: Retrieve the latest Iceberg metadata file, essential for accessing the most up-to-date view of our data.Direct Commits as CRUD (Create, Read,, Update and Delete) to commit changes to our tables directly using the provided APIs.
Table Management
Bucket-Level and Table-Level Control: Manage resources at both the table bucket and individual table levels.Resource Policies: Implement fine-grained access control through resource policies.Table Maintenance Policies: Fine-tune table maintenance behavior:Compaction: Define target file sizes for efficient data compaction.Snapshots: Control snapshot retention periods and automate their aging-out process.
Metadata File is a JSON file that contains the latest version of a table. Any changes made to a table create a new metadata file. The contents of this file are simply lists of manifest list files with some high level metadata.Manifest File is the final step for a query as only files that need to be read are determined using these files saving valuable querying time.Manifest List of manifest files that make up a snapshot. This also includes metadata such as partition bounds in order to skip files that do not need to be read for the query.Table directory is the name of the table with a unique uuid in order to support table renamesNamespace holds Parquet and JSON file formats.
Iceberg catalogs is key element as it keep track of which Iceberg tables live at which path in the S3 bucket and which files within that path belong to the live version of the table. Catalog ensures that two systems attempting to write to the table at the same time don’t corrupt each other’s writes. Reading from an Iceberg table doesn’t require interacting with the catalog, but writing to an Iceberg table always does.
Glue Data Catalog automatically populates table buckets, namespaces, and tables in the current Region as corresponding objects in the Data Catalog. Table buckets are populated as sub-catalogs. Namespaces within a table bucket are populated as databases within their respective sub-catalogs. Tables are populated as tables in their respective databases.
One of the biggest advantages of S3 Tables is its built-in catalog, aptly named S3TablesCatalog. The source code is available on GitHub for transparency, and it acts as a wrapper around the S3 Tables API. This means most Iceberg catalog operations directly translate to S3 Tables API calls.
For us as an Iceberg writer, this has been the game-changer. No more wrestling with choosing the right catalog implementation or running a separate service.
Important Performance indications
So, if we are going to use S3 for CRUD operations, we need a better performance. Compared to regular buckets, S3 Tables offer significantly improved performance compared to storing tabular data in general-purpose buckets. Up to 10 times higher transactions per second (TPS) out-of-the-box, translating to 55,000 reads/second and 35,000 writes/second. This performance boost is further amplified by automated background compaction, resulting in up to 3 times faster query performance compared to the previous option for Iceberg tables. Also, S3 automatically scales request capacity as our traffic demands grow, ensuring our applications always have the resources they need.
One of the biggest concerns on Iceberg is how table performance can degrade over time as data accumulates and more files are added. This leads to increased query latency due to the need to read from numerous small files.
Compaction is a crucial feature that addresses this issue by merging smaller data files into larger ones, automatically. This reduces the number of files to read, significantly improving query performance where peviously, we had to manually manage compaction processes.
Amazon just released a blog where they explain these optimizations if you are interested in reading.
Storage Cost Optimization with Automated Snapshot Management
Iceberg maintains a history of changes through snapshots, allowing for time travel and rollbacks. However, accumulating numerous snapshots significantly increase storage costs. Previously, managing snapshot retention and cleaning up associated files required manual effort with Snapshot Expiration and identifying and deleting unreferenced files associated with expired snapshots.
S3 Tables automate these processes with an automated Snapshot Expiration. Policies allow us to specify the number of snapshots to retain and their retention duration. S3 Tables automatically expire snapshots according to these policies;
Storage Costs: The primary cost component, at $27.14/month for 1 TB of data.
Request Costs like PUT requests incur minimal costs, even with high-frequency daily ingestion. GET requests remain affordable for moderate query workloads.
Monitoring Costs: A small, predictable cost based on object count.
Compaction Costs: Object processing costs are negligible while Data processing during compaction adds significant cost ($7.50/month).
Frequent PUT/GET Requests: Real-time ingestion and querying could increase costs significantly.
Compaction Overheads: High-frequency small writes can drive up compaction costs, especially in real-time use cases.
What comes next for us?
Although this is exciting, there are some aspects of Amazon S3 Tables and its integration with other AWS services are still in public preview. This means that these features are available for testing and feedback, but may not be fully production-ready or may have some limitations.
Glue Data Catalog Integration: While S3 Tables automatically register with the Glue Data Catalog, creating tables directly through engines like Amazon Athena or Redshift is not yet supported. In our case, we are creating the tables directly from SAP using SQL queries.
CREATE TABLE sap_materials (
material_id STRING,
material_type STRING,
description STRING,
base_unit STRING,
created_at TIMESTAMP,
modified_at TIMESTAMP,
storage_location STRING,
batch_number STRING,
quantity DECIMAL(15,3),
value DECIMAL(15,2)
)
USING iceberg
S3 Metadata will provide a significant improvement from our SAP operations, the automatic generation of metadata for S3 objects and its storage in an S3 Table bucket is also in preview but certainly will bring us additional benefits for unstructured data
Databricks and Snowflake customers on AWS may see a low entry barrier for Iceberg, a competitor to Delta Lake. Iceberg was developed by Netflix as a replacement for Hive, and is managed by Apache foundation, Delta Lake by the Linux foundation. They are very similar, though different, and until now the barrier was on mostly the compute engine used. Lets see if in the future Amazon S3 will also support Delta Lake table format natively.
S3 Tables will drive our future innovation for Data Lake generation. It opens a world of possibilities in innovation in query engines and optimization strategies and provides new cost models to account for real-time workloads and high-frequency operations integration.
Ifyou want to know more, check out this document
https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/data-lakes.html
And, if you want to join this racket, this will help (easy to find online by Dremio)
We have been developing the Parquet library in SAP ABAP for a while, not an easy thing to do. The problem we were trying to solve, integrating SAP with multiple Data Lakes is like re inventing the wheel over and over again, once we finish with one customer, we go into another, with different table requirements, data and metadata requirements, frequency, formats, and schemas. We need an output format with structure and metadata, and Parquet, ORC and Avro is our way out of the trap, until we reach higher Open standard goals, like Iceberg.Currently, we use CSV and JSON file formats, both offered as standard ABAP libraries, provided by SAP, but customers ask for more. CSV and JSON are sometimes too basic for large enterprise, even Avro falls short, its great for streaming, but when you have billions of files, you got to scan entirely the files in CSV, JSON or Avro, and thats a lot of compute, then Parquet or ORC are a great fit, that is the problem they came to solve. In our case, Parquet was our only viable option because, at the time, Amazon Athena only supported Parquet, not ORC, not Avro, this changed recently, so we have gone all in on Parquet.Parquet is a file format. It operates fundamentally as a columnar file format and focuses on efficient data storage and compression. So, seems like a great fit! by Author All rainbows and unicorns… Parquet is fantastic for all the features we need for our SAP data integration, but we miss one, and its quite important one; Writing Parquet or ORC, is really hard. Only a lucky few tools like Databricks, Snowflake, Spark, Hive, Flink, have good support for writing Iceberg tables.Writing an Iceberg table in any of the file formats requires much more care. Each write to an Iceberg table creates new data files in S3. Over time, the table’s data gets split across many small files, and reading the data from the table becomes prohibitively slow. The solution is to periodically compact the table, which combines the data files into fewer larger files and drops any data that has been overwritten/deleted. See this article from Dremio for a better explanation.Not to mention, the Python Iceberg library nor the Rust Iceberg library today support compaction, that also had to be built on ABAP. Only Java is an option for built libraries, or we build ourselves. Mountain high, Valley Low.So, Parquet structure is like a ledger of changes, all changes are added as snapshots to the original state. day 2 with ParquetOn the SAP world this was a problem for us, given the nature of changes in SAP data and the rate of updates, we were quickly with hundreds of thousands of objects under every table. Each daily update will create Parquet files, partitioned by GJAHR and BUKRSas an example s3://bucket/warehouse/acdoca_table/
├── GJAHR=2024/
├── BUKRS=1000/
├── part-00001.parquet
├── part-00002.parquet
├── BUKRS=2000/
├── part-00003.parquet
…
… However, as data usage patterns evolved, a need for more dynamic capabilities was needed. While Parquet can be used for mutable tables (e.g., by appending new files), this often involves manual management and can lead to inefficiencies, especially in scenarios with high write volumes or complex data evolution.S3 table is a bucket with Iceberg table format that creates a rest endpoint on a per-table basis. Inside that bucket we get an Iceberg catalog, we can create namespaces and tables, each table is a first-class resource, and AWS runs all the maintenance and optimization tasks of it.by Author So, Mario, this means that we are bringing the Catalog to the Storage Layer?Or, scientifically asked are we shifting towards bringing database-like functionality and efficiency to the storage layer?Light Weight! In fact, we have been doing it, for a while, with Glue, but not from SAP directly. And we have been asked multiple times. Yet we had to rely on Glue.AWS Glue cost structure is based on a pay-as-you-go model, which, while offering tremendous flexibility, also has some complexity and requires careful understanding. You got to understand DPUs, Glue Data Catalog and Crawlers to find and organize data for your AWS Glue Data Catalog, charged by the hour.Resource overprovisioning, Over-allocation of memory and CPU resources, not utilizing on-demand ETL jobs effectively, inefficient partitioning of data, overutilization of AWS Glue crawlers, not using the job bookmark feature and data reprocessing or inefficient use of data formats like CSV or JSON instead of Parquet can increase the cost significantly and kill the project.Pricing will kill the idea?Before we activate the S3 table buckets, we receive a warning, we must enable integration. by Author This integration with AWS analytics services will integrate our table buckets with AWS analytics services. The integration adds our tables to the AWS Glue Data Catalog so we can work with them using analytics services such as Amazon Athena, Amazon Redshift or Amazon QuickSight. As you see, the integration is done per Region, once activated, for every new table bucket in the console, S3 initiates some actions to integrate table buckets in the Region. Lake Formation registers the table bucket in the current Region and adds the s3tablescatalog to the AWS Glue Data Catalog in the current Region.All the table buckets, namespaces, and tables will be populated in the Data Catalog of Glue.This is much more efficient than us managing Glue ourselves. We save significant data discovery and management burden, it helps us to reduce the need for manual metadata handling and minimizes the risk of errors that could lead to increased costs.Natively integrating S3 tables with AWS Glue Data Catalog and AWS Lake Formation we mitigate several cost traps associated with AWS Glue, such as overutilization of crawlers, inefficient use of data formats, and overlooked partitioning of data.How are we doing all this?Table Operation in S3 Tables is quite similar to our current CRUD operation, we now must make sure that we can A set of APIs for interacting with our tables Create Table: Create new tables within our table bucket.List Tables: Discover all tables residing within a specific table bucket.Read Table Metadata: Retrieve the latest Iceberg metadata file, essential for accessing the most up-to-date view of our data.Direct Commits as CRUD (Create, Read,, Update and Delete) to commit changes to our tables directly using the provided APIs.Table ManagementBucket-Level and Table-Level Control: Manage resources at both the table bucket and individual table levels.Resource Policies: Implement fine-grained access control through resource policies.Table Maintenance Policies: Fine-tune table maintenance behavior:Compaction: Define target file sizes for efficient data compaction.Snapshots: Control snapshot retention periods and automate their aging-out process.by Author Metadata File is a JSON file that contains the latest version of a table. Any changes made to a table create a new metadata file. The contents of this file are simply lists of manifest list files with some high level metadata.Manifest File is the final step for a query as only files that need to be read are determined using these files saving valuable querying time.Manifest List of manifest files that make up a snapshot. This also includes metadata such as partition bounds in order to skip files that do not need to be read for the query.Table directory is the name of the table with a unique uuid in order to support table renamesNamespace holds Parquet and JSON file formats.Iceberg catalogs is key element as it keep track of which Iceberg tables live at which path in the S3 bucket and which files within that path belong to the live version of the table. Catalog ensures that two systems attempting to write to the table at the same time don’t corrupt each other’s writes. Reading from an Iceberg table doesn’t require interacting with the catalog, but writing to an Iceberg table always does.Glue Data Catalog automatically populates table buckets, namespaces, and tables in the current Region as corresponding objects in the Data Catalog. Table buckets are populated as sub-catalogs. Namespaces within a table bucket are populated as databases within their respective sub-catalogs. Tables are populated as tables in their respective databases.source; AWSOne of the biggest advantages of S3 Tables is its built-in catalog, aptly named S3TablesCatalog. The source code is available on GitHub for transparency, and it acts as a wrapper around the S3 Tables API. This means most Iceberg catalog operations directly translate to S3 Tables API calls.For us as an Iceberg writer, this has been the game-changer. No more wrestling with choosing the right catalog implementation or running a separate service.Important Performance indicationsSo, if we are going to use S3 for CRUD operations, we need a better performance. Compared to regular buckets, S3 Tables offer significantly improved performance compared to storing tabular data in general-purpose buckets. Up to 10 times higher transactions per second (TPS) out-of-the-box, translating to 55,000 reads/second and 35,000 writes/second. This performance boost is further amplified by automated background compaction, resulting in up to 3 times faster query performance compared to the previous option for Iceberg tables. Also, S3 automatically scales request capacity as our traffic demands grow, ensuring our applications always have the resources they need. One of the biggest concerns on Iceberg is how table performance can degrade over time as data accumulates and more files are added. This leads to increased query latency due to the need to read from numerous small files.Compaction is a crucial feature that addresses this issue by merging smaller data files into larger ones, automatically. This reduces the number of files to read, significantly improving query performance where peviously, we had to manually manage compaction processes.Amazon just released a blog where they explain these optimizations if you are interested in reading.https://aws.amazon.com/blogs/storage/how-amazon-ads-uses-iceberg-optimizations-to-accelerate-their-spark-workload-on-amazon-s3/ Cost OptimizationsStorage Cost Optimization with Automated Snapshot ManagementIceberg maintains a history of changes through snapshots, allowing for time travel and rollbacks. However, accumulating numerous snapshots significantly increase storage costs. Previously, managing snapshot retention and cleaning up associated files required manual effort with Snapshot Expiration and identifying and deleting unreferenced files associated with expired snapshots.S3 Tables automate these processes with an automated Snapshot Expiration. Policies allow us to specify the number of snapshots to retain and their retention duration. S3 Tables automatically expire snapshots according to these policies; by Author Storage Costs: The primary cost component, at $27.14/month for 1 TB of data. Request Costs like PUT requests incur minimal costs, even with high-frequency daily ingestion. GET requests remain affordable for moderate query workloads.Monitoring Costs: A small, predictable cost based on object count.Compaction Costs: Object processing costs are negligible while Data processing during compaction adds significant cost ($7.50/month).Frequent PUT/GET Requests: Real-time ingestion and querying could increase costs significantly.Compaction Overheads: High-frequency small writes can drive up compaction costs, especially in real-time use cases.What comes next for us?Although this is exciting, there are some aspects of Amazon S3 Tables and its integration with other AWS services are still in public preview. This means that these features are available for testing and feedback, but may not be fully production-ready or may have some limitations.Glue Data Catalog Integration: While S3 Tables automatically register with the Glue Data Catalog, creating tables directly through engines like Amazon Athena or Redshift is not yet supported. In our case, we are creating the tables directly from SAP using SQL queries. CREATE TABLE sap_materials (
material_id STRING,
material_type STRING,
description STRING,
base_unit STRING,
created_at TIMESTAMP,
modified_at TIMESTAMP,
storage_location STRING,
batch_number STRING,
quantity DECIMAL(15,3),
value DECIMAL(15,2)
)
USING iceberg S3 Metadata will provide a significant improvement from our SAP operations, the automatic generation of metadata for S3 objects and its storage in an S3 Table bucket is also in preview but certainly will bring us additional benefits for unstructured dataDatabricks and Snowflake customers on AWS may see a low entry barrier for Iceberg, a competitor to Delta Lake. Iceberg was developed by Netflix as a replacement for Hive, and is managed by Apache foundation, Delta Lake by the Linux foundation. They are very similar, though different, and until now the barrier was on mostly the compute engine used. Lets see if in the future Amazon S3 will also support Delta Lake table format natively.S3 Tables will drive our future innovation for Data Lake generation. It opens a world of possibilities in innovation in query engines and optimization strategies and provides new cost models to account for real-time workloads and high-frequency operations integration.Ifyou want to know more, check out this documenthttps://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/data-lakes.html And, if you want to join this racket, this will help (easy to find online by Dremio)https://www.oreilly.com/library/view/apache-iceberg-the/9781098148614/ Read More Technology Blogs by Members articles
#SAP
#SAPTechnologyblog