This new Data Architecture called “Shift Left” is a game changer and the industry is taking a turn

Before I set ground, let me talk about the two big blocks in Data Analytics of modern ages, Data Streaming and Data Lakes or LakeHouses

image by Author

The Lakehouse is a relatively new concept that merges the capabilities of a data lake and a data warehouse. Meanwhile, data streaming serves as the foundation for real-time processing and ensures data consistency between real-time and non-real-time systems.

A Lakehouse is typically used for analytical use cases, whereas data streaming supports both operational and analytical applications, bridging the gap between them.

Data streaming is built on event-driven architecture, a concept that has existed for over 20 years with message brokers. However, data streaming introduces fundamental differences. I usually describe it using four pillars:

Real-Time Messaging

Similar to traditional message brokers but built to scale massively—processing millions of messages per second.Supports transactional workloads with features like exactly-once processing, which is crucial for applications like payment processing.

Persistence Layer

Unlike traditional message brokers, data streaming persists events.This allows businesses to define retention policies—from a few hours (for logs or IoT data) to months or years (for critical business data).This prevents the need to repeatedly query the ERP systems.

Data Integration

Unlike message brokers, which require additional ETL tools, data streaming platforms offer built-in integration.They connect monolith systems (relational databases, ERPs, mainframes) and cloud-native services (object stores, cloud Lakehouses, SaaS applications).

Data Processing

Stream processing enables continuous transformation and enrichment of data.Instead of storing and analyzing raw data later, it can be aggregated, filtered, and enriched before reaching its destination.Used for fraud detection, predictive maintenance, customer recommendations, and even real-time AI inference.

Data Streaming Strengths:

Best for processing data in motion with low-latency event-driven architectures.Used in real-time analytics, fraud detection, monitoring, alerting, and transactional processing.Core banking systems, ERP solutions, and AI inference engines often use Kafka and Flink under the hood.Ideal for continuous AI model inference where immediate action is required.

Lakehouse Strengths:

Originates from data warehouse and data lake concepts (e.g., Snowflake, Delta Lake).Best for batch processing, historical data analysis, and business intelligence.Used for long-term data storage, reporting, and model training.AI workloads like model training and large-scale batch inference fit well in a Lakehouse.

Rather than choosing one over the other, the best strategy is to combine them:

Data streaming processes data in motion, enabling real-time decision-making.Lakehouses store processed data at rest, making it available for historical analysis and AI training.Streaming data can be persisted in a Lakehouse for further analytics.

Nevertheless, the key it results in using open table formats, with Apache Iceberg, we can store data once and query it with any tool and this is what drives this blog.

For this, a new Data Processing architecture called Shift Left has emerged as a response to inefficiencies in earlier data processing paradigms Medallion or ETL. The main problem of the Medallion and ETL architectures is its inefficient and has an elevated compute cost because data is copied over and over again, and every time.

In Multi-Hop Medallion Architecture, raw data (“bronze”) undergoes iterative refinement through silver (cleaned) and gold (enriched) layers via batch ETL processes. This required duplicative processing at each layer, increasing compute costs, latency between data generation and availability and silos where operational/analytical systems used different datasets.

The Multi-Hop Architecture
Data is typically extracted, transformed, and loaded (ETL) into an analytics environment following a multi-hop process:

Bronze Layer (Landing Zone) – Raw data from operational systems, including Kafka topics, SaaS endpoints, or text files, is ingested here.Silver Layer (Refined Data) – Data is cleaned, structured, and enriched, making it usable for business purposes.Gold Layer (Business-Ready Data) – Purpose-built datasets are created for specific applications, reports, or exports.

Medallion architecture by Databricks

This Medallion Architecture (Bronze → Silver → Gold) is widely used but has significant drawbacks.

Challenges with Multi-Hop Architectures

Slow Data Processing

Most multi-hop architectures rely on scheduled batch processing.Each transformation step introduces a delay (e.g., a 15-minute interval per hop), resulting in outdated insights.

High Costs

Every hop requires additional storage and processing power, increasing infrastructure expenses.

Brittle Pipelines

Different teams manage different layers (e.g., database engineers, data engineers, and analysts), leading to coordination challenges and frequent breakages.

Duplicate Pipelines

To avoid disruptions, teams often create their own redundant data pipelines, adding unnecessary complexity.

Similar Yet Different Datasets

Multiple versions of the same dataset emerge, leading to confusion about which one is the authoritative source.Inconsistent data can erode trust, especially when reports show conflicting numbers.

Shift Left addresses these inefficiencies by moving data preparation earlier in the process.

Shift Left Architecture. By Author

The key principles include:

A Stream-First Approach

Instead of periodic batch jobs, data is processed in real-time using event-driven architectures (EDAs).Technologies like Kafka and Change Data Capture (CDC) connectors ensure that updates are immediately available.

Stream-to-Table Conversion with Amazon S3

Amazon now automatically converts streams into Apache Iceberg tables, making structured data instantly accessible.Iceberg tables can be read by multiple engines (e.g., Trino, Flink, Presto, Spark, Databricks, and Snowflake).

Integration with Existing Analytics Workflows

Shift Left doesn’t replace existing architectures but enhances them.Cleaned and structured data from streams seamlessly integrates into the Silver layer, enabling faster and more reliable insights.

Benefits of Shift Left

Faster Data Availability – Real-time streaming eliminates batch processing delays.Lower Costs – Fewer data copies reduce storage and processing expenses.Improved Reliability – Stronger schema enforcement prevents pipeline breakages.Simpler Architecture – Eliminates redundant pipelines and conflicting datasets.Incremental Adoption – Companies can gradually migrate workloads to Shift Left without disrupting operations.

The Shift Left Architecture builds on the concept introduced by McKinsey consultants called data products, which are central to modern **data mesh** principles. The goal is to unify operational and analytical workloads by creating high-quality, consistent, and real-time data products.

source; McKinsey “a-better-way-to-put-your-data-to-work”

The Shift Left Approach addresses the medallion inefficiencies by moving data processing closer to its source—on the left side of the architecture. At its core, this approach relies on event-driven data streaming for real-time, scalable, and reliable processing.

The concept is to create the data product once, and that will be available for multiple systems to consume. This is cheaper, faster and supports both Operational and Analytical use cases with Iceberg format and ACID transactions guarantee.

Amazon S3 with its Iceberg capability, as a table format to unify operational and analytical workloads, allows businesses to store data once (e.g., in an object store Amazon S3) and consume it across various platforms (e.g., Snowflake, Athena) without requiring additional processing or connectors, but other cloud Table formats like Delta Lake could be served.

Before I set ground, let me talk about the two big blocks in Data Analytics of modern ages, Data Streaming and Data Lakes or LakeHousesimage by Author The Lakehouse is a relatively new concept that merges the capabilities of a data lake and a data warehouse. Meanwhile, data streaming serves as the foundation for real-time processing and ensures data consistency between real-time and non-real-time systems.A Lakehouse is typically used for analytical use cases, whereas data streaming supports both operational and analytical applications, bridging the gap between them. Data streaming is built on event-driven architecture, a concept that has existed for over 20 years with message brokers. However, data streaming introduces fundamental differences. I usually describe it using four pillars:Real-Time MessagingSimilar to traditional message brokers but built to scale massively—processing millions of messages per second.Supports transactional workloads with features like exactly-once processing, which is crucial for applications like payment processing.Persistence LayerUnlike traditional message brokers, data streaming persists events.This allows businesses to define retention policies—from a few hours (for logs or IoT data) to months or years (for critical business data).This prevents the need to repeatedly query the ERP systems.Data IntegrationUnlike message brokers, which require additional ETL tools, data streaming platforms offer built-in integration.They connect monolith systems (relational databases, ERPs, mainframes) and cloud-native services (object stores, cloud Lakehouses, SaaS applications).Data ProcessingStream processing enables continuous transformation and enrichment of data.Instead of storing and analyzing raw data later, it can be aggregated, filtered, and enriched before reaching its destination.Used for fraud detection, predictive maintenance, customer recommendations, and even real-time AI inference.Data Streaming Strengths:Best for processing data in motion with low-latency event-driven architectures.Used in real-time analytics, fraud detection, monitoring, alerting, and transactional processing.Core banking systems, ERP solutions, and AI inference engines often use Kafka and Flink under the hood.Ideal for continuous AI model inference where immediate action is required.Lakehouse Strengths:Originates from data warehouse and data lake concepts (e.g., Snowflake, Delta Lake).Best for batch processing, historical data analysis, and business intelligence.Used for long-term data storage, reporting, and model training.AI workloads like model training and large-scale batch inference fit well in a Lakehouse.Rather than choosing one over the other, the best strategy is to combine them:Data streaming processes data in motion, enabling real-time decision-making.Lakehouses store processed data at rest, making it available for historical analysis and AI training.Streaming data can be persisted in a Lakehouse for further analytics.Nevertheless, the key it results in using open table formats, with Apache Iceberg, we can store data once and query it with any tool and this is what drives this blog.For this, a new Data Processing architecture called Shift Left has emerged as a response to inefficiencies in earlier data processing paradigms Medallion or ETL. The main problem of the Medallion and ETL architectures is its inefficient and has an elevated compute cost because data is copied over and over again, and every time.In Multi-Hop Medallion Architecture, raw data (“bronze”) undergoes iterative refinement through silver (cleaned) and gold (enriched) layers via batch ETL processes. This required duplicative processing at each layer, increasing compute costs, latency between data generation and availability and silos where operational/analytical systems used different datasets.The Multi-Hop ArchitectureData is typically extracted, transformed, and loaded (ETL) into an analytics environment following a multi-hop process:Bronze Layer (Landing Zone) – Raw data from operational systems, including Kafka topics, SaaS endpoints, or text files, is ingested here.Silver Layer (Refined Data) – Data is cleaned, structured, and enriched, making it usable for business purposes.Gold Layer (Business-Ready Data) – Purpose-built datasets are created for specific applications, reports, or exports.Medallion architecture by Databricks This Medallion Architecture (Bronze → Silver → Gold) is widely used but has significant drawbacks.Challenges with Multi-Hop ArchitecturesSlow Data ProcessingMost multi-hop architectures rely on scheduled batch processing.Each transformation step introduces a delay (e.g., a 15-minute interval per hop), resulting in outdated insights.High CostsEvery hop requires additional storage and processing power, increasing infrastructure expenses.Brittle PipelinesDifferent teams manage different layers (e.g., database engineers, data engineers, and analysts), leading to coordination challenges and frequent breakages.Duplicate PipelinesTo avoid disruptions, teams often create their own redundant data pipelines, adding unnecessary complexity.Similar Yet Different DatasetsMultiple versions of the same dataset emerge, leading to confusion about which one is the authoritative source.Inconsistent data can erode trust, especially when reports show conflicting numbers. Shift Left addresses these inefficiencies by moving data preparation earlier in the process.Shift Left Architecture. By Author The key principles include:A Stream-First ApproachInstead of periodic batch jobs, data is processed in real-time using event-driven architectures (EDAs).Technologies like Kafka and Change Data Capture (CDC) connectors ensure that updates are immediately available.Stream-to-Table Conversion with Amazon S3Amazon now automatically converts streams into Apache Iceberg tables, making structured data instantly accessible.Iceberg tables can be read by multiple engines (e.g., Trino, Flink, Presto, Spark, Databricks, and Snowflake).Integration with Existing Analytics WorkflowsShift Left doesn’t replace existing architectures but enhances them.Cleaned and structured data from streams seamlessly integrates into the Silver layer, enabling faster and more reliable insights.Benefits of Shift LeftFaster Data Availability – Real-time streaming eliminates batch processing delays.Lower Costs – Fewer data copies reduce storage and processing expenses.Improved Reliability – Stronger schema enforcement prevents pipeline breakages.Simpler Architecture – Eliminates redundant pipelines and conflicting datasets.Incremental Adoption – Companies can gradually migrate workloads to Shift Left without disrupting operations. The Shift Left Architecture builds on the concept introduced by McKinsey consultants called data products, which are central to modern **data mesh** principles. The goal is to unify operational and analytical workloads by creating high-quality, consistent, and real-time data products. source; McKinsey “a-better-way-to-put-your-data-to-work” The Shift Left Approach addresses the medallion inefficiencies by moving data processing closer to its source—on the left side of the architecture. At its core, this approach relies on event-driven data streaming for real-time, scalable, and reliable processing. The concept is to create the data product once, and that will be available for multiple systems to consume. This is cheaper, faster and supports both Operational and Analytical use cases with Iceberg format and ACID transactions guarantee.Amazon S3 with its Iceberg capability, as a table format to unify operational and analytical workloads, allows businesses to store data once (e.g., in an object store Amazon S3) and consume it across various platforms (e.g., Snowflake, Athena) without requiring additional processing or connectors, but other cloud Table formats like Delta Lake could be served. Read More Technology Blogs by Members articles

#SAP

#SAPTechnologyblog