DEV Community

Apache Iceberg in Production: Compaction, Catalogs, and the Pitfalls Nobody Warns You About

The Small Files Problem Is Not Optional

Iceberg is append-friendly by design. Every micro-batch write, every streaming insert, every incremental load creates new Parquet files. Each file also gets its own metadata entry. After a week of hourly loads, you might have 10,000 files in a single partition where you wanted 20.

The result: Iceberg's metadata layer has to plan queries across thousands of file manifests. Planning takes longer than execution. Your 10-second query becomes a 4-minute query, and your users start filing tickets.

Fix: automate compaction from day one. In Spark, compaction is called rewrite_data_files. The basic call looks like this:

-- Run this on a schedule, not on-demand
CALL iceberg_catalog.system.rewrite_data_files(
  table => 'analytics.events',
  strategy => 'binpack',
  options => map(
    'target-file-size-bytes', '134217728',  -- 128MB target per file
    'min-input-files', '5'                  -- only compact if 5+ small files exist
  )
)

Target file size of 128MB to 512MB is the practical sweet spot. Smaller than that, you still have too many files. Larger, and your query engines cannot parallelize reads efficiently.

If you are not using Spark, PyIceberg exposes compaction through the table maintenance API (as of 0.7.x). For Flink or Trino-only shops, schedule compaction as a separate Spark job. Yes, it is annoying, but it is the right call.

Hidden Partitioning Is the Feature You Are Probably Ignoring

Old Hive partitioning was explicit. You wrote PARTITIONED BY (event_date STRING) and added that column to every query or Hive would scan the entire table.

Iceberg's hidden partitioning decouples the physical layout from what the query writer sees. You define a partition spec on the table, and the engine automatically applies it during writes and prunes during reads without the query needing to reference the partition column.

from pyiceberg.catalog import load_catalog
from pyiceberg.transforms import DayTransform

catalog = load_catalog(
    "rest",
    **{
        "uri": "http://your-rest-catalog:8181",
        "warehouse": "s3://your-bucket/warehouse"
    }
)

# Load an existing table and evolve its partition spec
table = catalog.load_table("analytics.events")

# Add a day-level partition on event_timestamp
# Iceberg handles the bucketing. No ts_date column needed in your schema.
with table.update_spec() as update:
    update.add_field(
        source_column_name="event_timestamp",
        transform=DayTransform(),
        partition_field_name="event_day"
    )

Now every query that filters on event_timestamp automatically benefits from partition pruning. The column stays a timestamp in the schema. No WHERE event_date = '2026-06-24' hack required.

The bigger win: you can change the partition strategy without rewriting the table. Iceberg supports multiple partition specs across snapshots. Old data stays on the old layout. New data uses the new one. The engine handles both transparently.

The Catalog Decision Matters More Than You Think

Every Iceberg table lives in a catalog. The catalog tracks which metadata file is current. Get this wrong and you either lock yourself into one vendor or end up with metadata conflicts that corrupt tables.

The main options in 2026:

  • AWS Glue Catalog works well if your entire stack is AWS. Zero operational overhead. But cross-cloud access is painful, and engine compatibility outside of Spark and Athena requires extra configuration.
  • Nessie / REST Catalog is the open standard. Any engine that supports the Iceberg REST spec can read and write. Nessie adds git-like branching for data, which is genuinely useful for staging ETL results before promoting to prod. Slightly more infra to manage.
  • Unity Catalog is the right choice if you are on Databricks. Tight governance integration, fine-grained access control at the column level. But it is proprietary, and getting data out to non-Databricks engines requires extra work.

My take: if you are building multi-engine (Spark + Trino + Flink), go REST-compatible from the start. Migrating catalogs later is painful. AWS Glue to REST is doable; Unity to anything else is not fun.

Here is a rough decision guide:

  • Single cloud (AWS only) → Glue Catalog
  • Databricks-primary stack → Unity Catalog
  • Multi-engine / multi-cloud → REST Catalog (Nessie or Polaris)

Snapshot Management: The Silent Storage Leak

Every write creates a snapshot. Snapshots reference manifest lists. Manifest lists reference manifest files. Manifest files reference data files. Without snapshot expiration, you are paying for every historical snapshot indefinitely. The metadata alone can grow into gigabytes. S3 LIST operations against large metadata trees get expensive fast.

-- Expire snapshots older than 7 days, keep at least 5 for safety
CALL iceberg_catalog.system.expire_snapshots(
  table => 'analytics.events',
  older_than => TIMESTAMP '2026-06-17 00:00:00',
  retain_last => 5
)

After expiring snapshots, orphan files may still exist (files written but never committed to a snapshot):

-- Remove orphan files older than 3 days
-- The 3-day buffer ensures in-progress writes are not deleted
CALL iceberg_catalog.system.remove_orphan_files(
  table => 'analytics.events',
  older_than => TIMESTAMP '2026-06-21 00:00:00'
)

Run these on a schedule. Weekly is fine for most tables. Daily for high-volume streaming tables.

Time Travel Done Right

One of Iceberg's actual killer features. You can query any historical snapshot:

-- Query the table as it was yesterday at midnight
SELECT * FROM analytics.events
  FOR SYSTEM_TIME AS OF '2026-06-23 00:00:00'
  WHERE event_type = 'purchase';

-- Or by snapshot ID (useful when you need a specific pipeline run)
SELECT * FROM analytics.events
  VERSION AS OF 8027658604211071520;

The catch: time travel only works while the snapshot exists. Once you expire it, it is gone. Plan your retention window around your incident response SLA. If your team takes 72 hours to notice a bad pipeline run, keep at least 7 days of snapshots.

Common Mistakes

  • Not running compaction at all. The default state of most Iceberg tables I have seen is "never been compacted." Set up compaction as part of table creation, not as a fix-it-later task.
  • Compacting too aggressively. Running rewrite_data_files too frequently on large tables wastes compute and can block concurrent reads. Once per day for most tables, twice per day for high-volume ones.
  • Using the wrong partition granularity. Partitioning by HOUR makes sense for 10 billion events per day. For 10 million, it creates too many small partitions and kills planning time. Match partition granularity to your data volume.
  • Picking Glue catalog for a multi-engine stack. You will not feel the pain on day one. You will feel it six months in when you try to add Trino and spend two weeks on catalog configuration.
  • Not setting write.target-file-size-bytes. The default varies by engine. Set it explicitly in your table properties so file sizes stay consistent regardless of which engine is writing.
ALTER TABLE analytics.events SET TBLPROPERTIES (
  'write.target-file-size-bytes' = '134217728',
  'write.delete.target-file-size-bytes' = '67108864'
);

What Iceberg Actually Is

Iceberg is a table format specification, not a storage engine. It tells engines how to find data, what schema it has, and which files are current. The engines (Spark, Trino, Flink, Athena) do the actual reading and writing.

This means Iceberg is only as good as the operational practices around it. The format solves real problems: ACID on object storage, schema evolution without rewriting, partition pruning without partition columns in queries. But you still have to run compaction. You still have to expire snapshots. You still have to pick the right catalog.

The teams I have seen succeed with Iceberg treated these maintenance tasks as first-class engineering concerns, not afterthoughts. The ones who struggled treated Iceberg like a managed service and were surprised when it needed managing.

Start with compaction and snapshot expiration automated before you write your first production table. Everything else you can figure out as you go.

Best regards,
Gabriel Henrique Cardoso Antonio
🔗 gabrielh.dev

Comments

No comments yet. Start the discussion.