Home BigQueryFixing CDC Race Conditions in a Custom BigQuery Sync

Fixing CDC Race Conditions in a Custom BigQuery Sync

by Marc

How two duplicate webhook events, arriving 2ms apart, broke 12 dbt tests in production — and the surprisingly simple fix that made the sync idempotent.

The Setup

We replaced our managed CDC connector with a custom webhook-based sync from our CRM to BigQuery. The CRM fires a webhook on every record change, and our Cloud Run service receives it and writes the change into BigQuery using a CDC pattern: one row per version, with _active, _start, and _end columns tracking which version is current.

The sync worked perfectly in testing. Then we deployed to production, and within the first week, 12 dbt tests started failing. All of them traced back to just 2 records that had duplicate active rows — something our CDC pattern was supposed to make impossible.

The Race Condition

The CRM was sending duplicate webhook events for the same mutation, arriving 2-3 milliseconds apart. This is not unusual — many SaaS platforms have at-least-once delivery guarantees, and retries or internal fanout can produce duplicates. Our webhook handler used a BigQuery scripting block (a BEGIN...END multi-statement script) to process each event. The original logic looked like this:

The Broken Pattern (UPDATE then INSERT)

-- BROKEN: Original CDC merge pattern
BEGIN
  -- Step 1: Close the currently active row
  UPDATE `project.dataset.deal_history`
  SET
    _active = FALSE,
    _end = TIMESTAMP('2026-01-15 10:30:00.123')
  WHERE id = 5329
    AND _active = TRUE;

  -- Step 2: Insert the new version as active
  INSERT INTO `project.dataset.deal_history`
    (id, title, value, stage_id, _active, _start, _end, _synced)
  VALUES
    (5329, 'Acme Corp Deal', 15000.00, 11,
     TRUE,
     TIMESTAMP('2026-01-15 10:30:00.123'),
     TIMESTAMP('9999-12-31 23:59:59'),
     CURRENT_TIMESTAMP());
END;

The logic seems sound: close the old row, insert the new one. But here is the critical thing about BigQuery scripting: concurrent script executions do not have transaction isolation between them. BigQuery provides snapshot isolation within a single DML statement, but a scripting block (BEGIN...END) is not a single transaction — each statement inside it runs independently.

Here is what happened when two identical webhooks arrived 2ms apart:

  1. Script A runs UPDATE: finds the active row, sets _active = FALSE
  2. Script B runs UPDATE: also finds the same original active row (snapshot isolation means B’s UPDATE sees the state before A’s UPDATE committed)
  3. Script A runs INSERT: inserts new active row
  4. Script B runs INSERT: also inserts a new active row

Result: two active rows for the same record. Both UPDATEs targeted the same original row (one was a no-op since it was already closed), and both INSERTs succeeded because there is no unique constraint enforcement in BigQuery.

Why BigQuery Makes This Tricky

In PostgreSQL, you would wrap this in a serializable transaction. In BigQuery, you cannot. BigQuery DML has these constraints:

  • Each individual DML statement has snapshot isolation
  • A scripting block (BEGIN...END) is not a transaction — it is a sequence of independent statements
  • There are no row-level locks, no SELECT ... FOR UPDATE
  • No unique constraints on tables
  • Concurrent DML on the same table is allowed (since 2023), but each statement sees a snapshot from before any concurrent mutations

This means any multi-step pattern where step 2 depends on the outcome of step 1 is inherently racy when concurrent executions target the same rows.

The Fix: INSERT First, Then UPDATE

The fix was to reverse the order of operations. Instead of closing the old row and then inserting the new one, we insert the new row first, then close all other active rows for the same ID:

The Fixed Pattern (INSERT then UPDATE)

-- FIXED: Idempotent CDC merge pattern
BEGIN
  -- Step 1: Insert the new version as active
  INSERT INTO `project.dataset.deal_history`
    (id, title, value, stage_id, _active, _start, _end, _synced)
  VALUES
    (5329, 'Acme Corp Deal', 15000.00, 11,
     TRUE,
     TIMESTAMP('2026-01-15 10:30:00.123'),
     TIMESTAMP('9999-12-31 23:59:59'),
     CURRENT_TIMESTAMP());

  -- Step 2: Close all active rows EXCEPT the one we just inserted
  -- (the newest one by _start timestamp)
  UPDATE `project.dataset.deal_history`
  SET
    _active = FALSE,
    _end = TIMESTAMP('2026-01-15 10:30:00.123')
  WHERE id = 5329
    AND _active = TRUE
    AND _start != TIMESTAMP('2026-01-15 10:30:00.123');
END;

Why does this work even with concurrent execution? Let’s trace through the same scenario:

  1. Script A runs INSERT: inserts a new active row (row A)
  2. Script B runs INSERT: inserts another active row (row B) — identical data, same _start timestamp
  3. Script A runs UPDATE: closes all active rows for this ID where _start differs from the new timestamp. This closes the original old row.
  4. Script B runs UPDATE: same — closes all active rows for this ID where _start differs. Also closes the original old row (no-op since A already closed it).

After both scripts complete, we have two active rows with the same _start timestamp. To handle this, I added a deduplication step to the weekly full-sync reconciliation job:

-- Deduplication cleanup: keep only one active row per ID
-- Run as part of weekly full reconciliation
MERGE `project.dataset.deal_history` AS target
USING (
  SELECT id, MAX(_synced) AS keep_synced
  FROM `project.dataset.deal_history`
  WHERE _active = TRUE
  GROUP BY id
  HAVING COUNT(*) > 1
) AS dupes
ON target.id = dupes.id
  AND target._active = TRUE
  AND target._synced != dupes.keep_synced
WHEN MATCHED THEN
  UPDATE SET
    _active = FALSE,
    _end = target._start;

In practice, duplicates from the INSERT-first pattern are extremely rare — they only happen when two webhooks with the exact same timestamp arrive within the same BigQuery DML snapshot window. And even when they do occur, the dedup cleanup catches them before the next dbt run.

The Downstream Impact: 12 Tests From 2 Records

What surprised me was how much damage two duplicate records caused. The duplicate active rows in the CDC table propagated through the dbt DAG:

  • The staging model selected all active rows, so both duplicates made it into the normalized layer
  • A unique test on the primary key failed in the normalized layer
  • Downstream joins fanned out — one record joining to two versions of the same deal doubled the revenue
  • Aggregate metrics shifted, causing accepted_values and relationships tests to fail downstream
  • 12 tests across 4 dbt models failed, all traceable to just 2 duplicate deal records

This is a good reminder of why unique tests on primary keys are non-negotiable. Without them, the duplicates would have silently inflated our revenue metrics until someone noticed the dashboards looked off.

Key Takeaways

  • BigQuery scripting blocks are not transactions. Each statement runs independently. Do not assume atomicity across statements in a BEGIN...END block.
  • Design for idempotency, not mutual exclusion. You cannot prevent concurrent execution in BigQuery, so make sure concurrent runs produce a correct result.
  • INSERT-then-UPDATE is safer than UPDATE-then-INSERT. The worst case (a duplicate active row) is easily cleaned up. The worst case of the original pattern (missing active row or phantom active row) is harder to detect and fix.
  • Assume at-least-once delivery from webhooks. Every webhook-based system should be designed to handle duplicate events gracefully.
  • dbt tests are your safety net. 12 test failures from 2 bad records sounds annoying, but it is vastly better than silently wrong data in dashboards.

The fix was a one-line diff in terms of logic (swap the order of INSERT and UPDATE), but it took me half a day to reproduce the race condition and verify the fix. Sometimes the simplest bugs are the hardest to reason about.

Categories: Data Engineering, BigQuery, Lessons Learned

You may also like

Leave a Comment