The "Pay-Per-Sip" Data Pipeline: How to build a 10-Minute Analytics Stack with Athena, Iceberg, and Airflow
Hey Data Geeks! Let’s talk about building a seriously fast and cheap analytics pipeline.
TL;DR
This post is your guide to building a fast, cheap, and reliable analytics pipeline on AWS. We'll ditch the old, slow batch methods that can't even handle deletes and jump into a modern, serverless micro-batch architecture that gets your data fresh every 10 minutes. We'll use AWS DMS for CDC, Iceberg on S3 for a transactional data lake, Athena's MERGE command for super-simple SQL-based ingestion, and Airflow to run the whole show. The result? A pipeline that's fast, cost-effective, and easy for anyone who knows SQL to manage.
Disclaimer
The views and opinions expressed in this article are my own and do not necessarily reflect the official policy or position of my employer.
1. Introduction
We've all been there. You pull up a critical dashboard to answer an urgent business question, only to see the "Last Updated" timestamp from six hours ago, or worse, yesterday. That sinking feeling you get is the realisation that you're flying blind, making today's decisions on yesterday's data. In a world that moves faster than ever, this lag isn't just an inconvenience; it's a competitive disadvantage.
What if you could shrink that data gap from hours to just 10 minutes?
This article shows you how to do exactly that. We'll walk through a modern, pragmatic approach to building a low-latency analytics pipeline without the overhead of complex streaming systems. By combining the power of a SQL-based micro-batch architecture using AWS Athena, Apache Iceberg, and Apache Airflow, you can deliver fresh, reliable data that keeps pace with your business.
2. The Old Way: Why Traditional Batch Pipelines Fail
Picture this: It’s 2 AM. Your daily batch job just crashed—again—while scanning 500 million rows with SELECT * FROM orders. Production databases groan under the load. Data teams scramble. Morning reports ship late. Analysts lose trust.
For years, data teams have relied on two common batch-processing patterns, each with significant drawbacks.
2.1 The Inefficiency of SELECT * (Full Batch)
The simplest approach is to copy the entire table from the source database and replace the old version in the data lake. While straightforward, this method is a brute-force solution that quickly breaks down.
As tables grow to millions or billions of rows, these jobs become incredibly slow and expensive, consuming massive amounts of compute resources. Worse, running a full table scan puts a heavy, recurring load on the production database, risking performance impacts on the very applications that run your business.
2.2 The Unreliability of Incremental Batch
A seemingly smarter method is to pull only the records that have changed since the last run, typically using a timestamp column like updated_at > last_timestamp.
The problem is that this approach relies on a fragile "contract." It assumes every application process that modifies a row will flawlessly update the timestamp. In any large organization, this contract inevitably breaks. A background job or a legacy piece of code will update a row without touching the timestamp, and that update becomes invisible to the data pipeline. This leads to subtle but maddening data discrepancies that erode trust between data teams and their stakeholders.
2.3 The Dealbreaker: No Deletes!
The most critical failure of both these methods is their complete inability to capture DELETE operations.
If a user account is deleted from your production database, it remains forever in your data lake. This creates "ghost" records that silently corrupt your analytics. Your user counts will be wrong. Your sales totals will be inflated. The data becomes fundamentally untrustworthy. Furthermore, this can violate data regulations such as GDPR (the right to forget).
The only way to find these deleted records is to perform a full comparison between the source and the destination—an expensive and impractical task that defeats the purpose of an efficient pipeline.
3. The Modern Micro-Batch Architecture
Instead of fighting with the limitations of old batch systems, we can build a modern pipeline that is fast, reliable, and surprisingly simple. This architecture stands on three core, serverless-friendly pillars that work together seamlessly.
3.I The Source: A Reliable Stream of Every Change
The foundation of our pipeline is capturing every single data modification as it happens. We achieve this with Change Data Capture (CDC).
A service like AWS Database Migration Service (DMS) connects to our source database's transaction log—a low-level record of every INSERT, UPDATE, and DELETE. DMS then streams these events in near real-time and lands them as a series of small files in an Amazon S3 bucket. This gives us a complete and ordered audit trail of every change, ensuring we never miss an update or a delete again. Alternatively, you can use powerful open-source tools tailored to your specific database.
For PostgreSQL, CDC is powered by its native logical decoding feature. This mechanism transforms changes from the internal Write-Ahead Log (WAL) into a coherent, easy-to-understand stream of events. A tool like the open-source Debezium connector for PostgreSQL can tap into this stream. It reads from a logical replication slot in your database, converting every INSERT, UPDATE, and DELETE into a structured JSON or Avro message. These messages are then published to a streaming platform like Apache Kafka, ready to be landed in S3.
For MySQL, the equivalent mechanism is its battle-tested binary log (binlog). The binlog is an ordered record of all changes made to the database, primarily used for replication. The Debezium connector for MySQL acts like a replica, connecting to your database (ideally a read replica to avoid production load) and "tailing" this log. It parses the low-level binary events and transforms them into clean, row-level change events, providing a reliable stream of modifications that can be easily funnelled into your S3 staging area.
3.II The Engine: Powerful, SQL-Based Ingestion with Athena
This is where the real power of the architecture comes into play. We use AWS Athena not just for querying, but as our primary ingestion engine. But before Athena can work its magic, it needs to understand the structure of the incoming CDC data from our S3 staging area. This is where schema management becomes critical.
Thanks to this automated schema handling and Apache Iceberg's support for ACID transactions on S3, Athena can execute MERGE statements directly on our data lake tables. This single, powerful SQL command reads the batch of new CDC events from our S3 staging area and atomically applies all the inserts, updates, and deletes to our final Iceberg table. It’s a transactionally safe, SQL-native way to handle complex data changes without writing a single line of Spark code.
3.III The Orchestrator: Precise, Reliable Scheduling with Airflow
To run this process consistently every 10 minutes, we need a robust scheduler. Apache Airflow is the perfect tool for the job.
We create a simple Airflow DAG (Directed Acyclic Graph) with a schedule set to run every 10 minutes. The primary task in this DAG is to execute the Athena MERGE query. This approach gives us everything we need for a production-grade pipeline: automatic retries, alerting on failures, detailed logging, and a clear visual interface to monitor every run. It transforms a simple SQL script into a reliable, observable, and automated ingestion process.
Recommended by LinkedIn
4. The 10-Minute Ingestion DAG in Action
Now that we have the architecture, let's walk through how it all comes together in a single, automated workflow orchestrated by Airflow. This is the core engine that runs every 10 minutes to keep our data fresh.
Step 1: How DMS Writes Directly to S3
When you configure an AWS DMS replication task, you define a source and a target endpoint. Instead of setting Kinesis as the target, you simply configure your S3 bucket as the target endpoint. DMS will then:
- Connect to your source database and capture the change events from the transaction log.
- Batch these events into files (typically in .csv or .parquet format).
- Write these files directly into your specified S3 bucket and folder path.
The resulting architecture is much simpler: Database → DMS → S3. Your Airflow DAG then picks up these files from S3 and runs the MERGE query, just as it did before.
Step 2: The Heart of the Pipeline - The Athena MERGE Query
This single SQL query is where all the logic happens. It reads the latest batch of staged CDC events and intelligently applies them to our final Iceberg table.
First, we typically create a view over our raw staging data to handle de-duplication within the batch. This ensures we only process the absolute latest change for each record. Then, the MERGE command uses this view as its source.
MERGE INTO your_database.final_iceberg_table t
USING (
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY primary_key ORDER BY event_timestamp DESC) as rn
FROM your_database.staged_cdc_events_view
WHERE event_date = CURRENT_DATE
) s
ON t.primary_key = s.primary_key
WHEN MATCHED AND s.operation = 'UPDATE'
THEN UPDATE SET t.col1 = s.col1, t.col2 = s.col2
WHEN MATCHED AND s.operation = 'DELETE' THEN DELETE
WHEN NOT MATCHED AND s.operation = 'INSERT' THEN I
NSERT (primary_key, col1, col2, operation, event_timestamp)
VALUES (s.primary_key, s.col1, s.col2, s.operation, s.event_timestamp)
Breaking it down:
ON t.primary_key = s.primary_key: This is the join condition. It checks if a record from our CDC source already exists in the target Iceberg table.
WHEN MATCHED ...: This clause handles records that already exist. If the operation is an UPDATE, it updates the columns. If it's a DELETE, it removes the row entirely from the Iceberg table.
WHEN NOT MATCHED ...: This handles brand new records. If a primary key from the source doesn't exist in the target and the operation is INSERT, it inserts the new row
Step 3: Orchestrating with Airflow
The final step is to automate the execution of this query. We create a simple Airflow DAG scheduled to run every 10 minutes.
The DAG itself is straightforward. Its main purpose is to connect to Athena and run the MERGE query. Using a tool like the Astro SDK for Python, the task can be as simple as a decorated Python function.
from __future__ import annotations
import pendulum
from airflow.models.dag import DAG
from astro.sql import run_raw_sql
# (the above merge)
MERGE_INTO_ICEBERGTABLE = """
MERGE INTO ...
...
"""
@run_raw_sql(conn_id="athena_default")
def merge_cdc_events_into_iceberg(query: str):
"""Executes the Athena MERGE query."""
return query
with DAG(
dag_id="iceberg_10_min_ingestion",
start_date=pendulum.datetime(2025, 8, 18, tz="Europe/Tallinn"),
schedule="*/10 * * * *",
catchup=False,
) as dag:
merge_cdc_events_into_iceberg(MERGE_INTO_ICEBERG_TABLE)
This DAG ensures our SQL logic is executed reliably and on time. If a run fails, Airflow's retry mechanism will kick in. If it fails repeatedly, it will send an alert. This simple setup transforms our SQL script into a resilient, production-ready ingestion pipeline.
5. Smart Maintenance: The Midnight Compaction Job
A low-latency ingestion pipeline is fantastic, but to ensure our data lake remains fast and cost-effective for queries over the long term, we need to perform regular housekeeping. This is where a smart, automated maintenance job becomes a critical part of our architecture.
5.a Why Compaction is Crucial: The "Small File Problem"
Our 10-minute ingestion process provides incredible data freshness, but it has a side effect: every 10 minutes, it creates one or more new, small files in S3. Over the course of a day, this adds up to hundreds of files.
This creates the classic "small file problem." Query engines like Athena are optimized to read large, columnar files. When a query has to open, read metadata from, and process thousands of tiny files instead of a few large ones, performance suffers dramatically. The overhead of listing and opening each file adds significant latency to every query, making your fast data lake feel slow.
The solution is compaction: periodically rewriting these small files into larger, more optimal ones.
5.b The Power of Partition-Aware OPTIMIZE
Thankfully, Apache Iceberg provides a simple and powerful command to handle this directly within Athena: OPTIMIZE. However, running a compaction job on an entire multi-terabyte table every night would be incredibly expensive and wasteful.
A much smarter approach is to use a partition-aware strategy. Since our data is partitioned by date, only the most recent partitions are being modified by our ingestion pipeline. Older partitions are likely already well-compacted. Therefore, we can tell Athena to only compact the data from the last few days.
This targeted approach is highly efficient and is accomplished with a simple WHERE clause in the OPTIMIZE command:
OPTIMIZE your_database.final_iceberg_table
REWRITE DATA
WHERE event_date >= CAST(CURRENT_DATE - INTERVAL '10' DAY AS VARCHAR);
This command instructs Athena to rewrite data only for the last 10 days. This is the sweet spot for efficiency because:
- It targets the "hot" data that suffers the most from the small file problem.
- It leaves older, stable partitions untouched, saving enormous amounts of processing time and cost.
5.c The Daily Airflow DAG for Housekeeping
This maintenance task is perfect for a separate, dedicated Airflow DAG. We simply create a new DAG with a single task that executes the OPTIMIZE query. This DAG is scheduled to run once a day during off-peak hours, such as midnight (schedule="0 0 * * *").
This simple, automated housekeeping job is the key to balancing the needs of our pipeline: fast, low-latency ingestion with consistently fast query performance for our analysts.
6. Summary
By combining this modern toolset and a pragmatic micro-batch approach, we've built more than just a data pipeline; we've created a reliable and efficient analytics foundation. The benefits of this architecture directly address the shortcomings of traditional systems.
Serverless & Cost-Effective by Design
One of the most significant advantages is the serverless nature of the stack. By leveraging AWS Athena, S3, and DMS, you eliminate the need to manage and pay for idle server clusters. Your costs are directly tied to the data you process and the queries you run—a pay-per-query model. This is far more cost-effective than maintaining a 24/7 streaming cluster, especially for workloads that have variable traffic.
Simple & Accessible with SQL
The heart of the ingestion logic—the MERGE statement—is written in standard SQL. This is a game-changer for team productivity. You don't need a specialized team of Spark developers to manage the pipeline. Data analysts, analytics engineers, and anyone comfortable with SQL can easily understand, debug, and even extend the system. This accessibility lowers the barrier to entry and makes your data platform more maintainable and democratic.
A Note on Athena's MERGE Limitations
While the MERGE command is powerful, it has an important limitation in Athena you must be aware of. An Athena MERGE operation will fail if the source data requires it to update or delete rows in more than 100 unique files within a single partition of the target Iceberg table. For a high-frequency ingestion pipeline, it's possible to hit this limit if your data changes are widely distributed. The key to avoiding this is a wise partitioning strategy. By partitioning your target Iceberg table (e.g., by event_date), you ensure that each 10-minute MERGE job only targets a very small and predictable number of partitions (usually just one). This naturally constrains the number of files the operation needs to touch, keeping you safely under the limit and ensuring your ingestion pipeline remains robust.
Helping AI startups and ML teams to build Scalable AI Models with Data Annotation | Founder @ PIEZEE
3moGamal, This is spot on. A strong annotation pipeline is as critical as the model architecture itself.