The Apache Iceberg upsert bug you won't find in the docs

The Apache Iceberg sink connector documentation listed upsert as a supported mode. We were building a production pipeline that depended on it. Before writing any implementation code, we used Claude to compare what the documentation claimed against what the source code and tests actually did. The feature was broken. Here is how we found it, what the code showed, and what it meant for the pipeline.

Why upsert in a streaming pipeline is where things go wrong

We were building a CDC (change data capture) pipeline that needed to maintain a current-state view of records in Apache Iceberg as events arrived from Kafka. New records get inserted. Updated records get their latest state written. Deleted records get removed. This is the standard upsert pattern in streaming pipelines.

Apache Iceberg is a natural fit for this kind of work. It supports ACID transactions, time travel, schema evolution, and has grown significantly in production adoption. The Iceberg sink connector documentation described a upsert mode with exactly the behavior we needed: primary key-based merging, update handling, delete propagation.

What made us pause before building: upsert in a streaming connector is technically hard to get right. It is not just "write this record." The connector needs to identify records by primary key across potentially large existing datasets, resolve conflicts when events arrive out of order, handle tombstone events (records with null values that represent deletes in a CDC stream), and do all of this with correct semantics even when the stream contains the same key multiple times in a short window.

Each of those requirements is a potential failure point. Each one needs specific test coverage to verify it actually works. Documentation rarely tells you which ones were tested.

What the documentation claimed

The connector documentation described upsert mode clearly. It listed the configuration parameters, explained how to specify primary key columns, and described the expected behavior: records with matching keys would be updated, new keys would be inserted, and delete events would propagate as row deletions in the target table.

The documentation read as a complete feature description. Not experimental. Not partial support with caveats. Upsert mode, as a supported capability, available to configure and use.

In most cases, that would be enough to proceed. For a feature as critical as upsert in a production data pipeline, we decided to verify before writing any implementation code.

Upsert in a data pipeline is load-bearing. If it works, your data is correct. If it has a bug that produces silent failures - events that appear processed but actually wrote incorrect state - you may not discover the problem until you notice downstream data inconsistency. By that point, you have a pipeline in production, data you cannot easily reconstruct, and a debugging problem that starts with "how long has this been wrong?"

Experience supplies the skepticism. After working with enough streaming data connectors and CDC pipelines, you learn that upsert semantics are often the last thing a connector gets right and the first thing to break under real-world conditions. CDC streams in particular produce patterns that stress-test upsert implementations: a single record being updated multiple times in a short window, deletes arriving before the corresponding insert in a replay scenario, or the same key appearing in both a delete event and an insert event within the same micro-batch.

These are not unusual conditions. They are the normal behavior of a CDC stream under load. A connector that handles basic upsert correctly in tests may not handle them.

The analysis

We cloned the connector repository and identified the files relevant to the upsert implementation: the connector source, the writer class that handles the merge logic, and the test suite for upsert mode.

The questions we asked Claude:

"The documentation says this connector supports upsert mode for Apache Iceberg. Walk me through the code that implements this. How does it identify records by primary key, handle updates, and handle delete events from a CDC stream?"

"Walk me through the test suite for the upsert mode. What scenarios are covered? What edge cases are tested? What is not covered?"

The second question consistently produces the most useful information. What tests exist tells you what the developers verified. What tests don't exist tells you what they either didn't anticipate or didn't finish.

What the source code showed

The analysis found a gap between the documentation and the implementation. The upsert mode had incomplete handling for a specific but predictable pattern: a delete event and an insert event for the same primary key arriving within the same processing window.

In a CDC stream, this is not an edge case. When a record is updated at the source database, many CDC systems emit this as two events: a delete event for the old state, followed immediately by an insert event for the new state. This delete-then-insert pattern is fundamental to how CDC works. A connector in upsert mode needs to process these two events together and produce a single updated row in the target table. If it processes them independently and in the wrong order, it can write the delete first (removing the row), then process the insert (writing the new state), which would appear correct - or it can process the insert, overwrite it with the delete, and leave the row missing.

The implementation did not handle this correctly. The code path for upsert mode processed events without the ordering and deduplication logic needed to handle the delete-then-insert pattern reliably. The test suite confirmed this: the tests covered inserts and standalone updates, but there were no tests for the delete-then-insert pattern under any configuration.

The documentation said upsert worked. The code said it did not, under conditions that would occur in any real CDC pipeline.

VS Code showing BaseDeltaTaskWriter.java from the Apache Iceberg source at lines 101-125, with the Claude Code terminal below displaying the analysis output identifying the row kind dispatch logic and the file path flink/v1.18/flink/src/main/java/org/apache/iceberg/flink/sink/BaseDeltaTaskWriter.java

What finding it early actually saved

The analysis took about two hours. That is not a small number, but it is a small number compared to the alternative.

If we had built the pipeline without verifying, we would have shipped a production system that produced incorrect results under predictable conditions. A CDC stream from a busy source database will produce the delete-then-insert pattern constantly. The bug would not have been intermittent. It would have affected real data on a regular basis.

Discovering a data integrity bug in production is a different category of incident from discovering a performance bug. A slow query degrades user experience. A data integrity bug corrupts your dataset. Recovering from it means identifying the scope of the corruption, determining which records were affected, reconstructing the correct state, verifying the fix does not introduce new problems, and explaining the data quality gap to whoever depends on that data. That work is measured in days. Sometimes longer.

Two hours of analysis before the build, or multiple days of incident response after it. The analysis was not optional in retrospect. It should not have been optional in prospect either.

We found an alternative connector implementation that handled CDC semantics correctly, verified it with the same analysis, and built on that instead.

What this tells you about CDC connector evaluation

The Apache Iceberg connector is not a poorly maintained project. It is used in production by real teams. The bug we found was real but specific: teams not using CDC streams, or not depending on the delete-then-insert pattern, would not have encountered it. For a direct-insert use case, the connector worked exactly as documented.

A developer without experience in CDC pipelines might not have known to ask about the delete-then-insert pattern. That question comes from knowing how CDC streams behave in practice, not from reading the connector documentation. The tool does the analysis quickly. The experience defines which analysis to do.

What to do now

If you are evaluating any connector for CDC data, verify three things before you build: how it handles the delete-then-insert pattern, what its ordering guarantees are, and how it processes tombstone events. Load the source and test files into Claude and ask specifically about those scenarios. The gaps in test coverage are more revealing than the documentation.

For the full technique - applicable to any open source tool, not just CDC connectors - see Analyze open source code with Claude, not documentation.