Home Data EngineeringWhen Your Application Mutates created_at: Fixing Timestamps with CDC History

When Your Application Mutates created_at: Fixing Timestamps with CDC History

by Marc

The application was silently rewriting created_at after record creation, making child records appear older than their parents. Here is how I caught it and fixed it using raw CDC history in the dbt staging layer.

The Symptom

A dbt test started failing intermittently. The test checked a simple invariant: a bid (child record) should never have a created_at timestamp earlier than the case (parent record) it belongs to. In other words, you cannot submit a bid on a case that does not exist yet.

-- This test was failing: bids created before their parent case
SELECT
  b.id AS bid_id,
  b.created_at AS bid_created_at,
  c.id AS case_id,
  c.created_at AS case_created_at
FROM {{ ref('stg_bids') }} b
JOIN {{ ref('stg_cases') }} c ON b.case_id = c.id
WHERE b.created_at < c.created_at

The first time I saw the failures, I assumed bad data — maybe a migration artefact or a bulk import that set incorrect timestamps. But the failing records were recent. And the list of failures kept growing.

The Investigation

I pulled the raw CDC history for one of the failing case records. Our raw layer captures every version of a row via Change Data Capture, so I could see the full mutation history:

SELECT
  id,
  created_at,
  updated_at,
  _cdc_timestamp,
  _cdc_change_type
FROM `raw_project.raw_dataset.cases`
WHERE id = 12345
ORDER BY _cdc_timestamp ASC

The result told the whole story:

id    | created_at          | updated_at          | _cdc_timestamp      | _cdc_change_type
------|---------------------|---------------------|---------------------|------------------
12345 | 2026-01-10 09:15:00 | 2026-01-10 09:15:00 | 2026-01-10 09:15:02 | INSERT
12345 | 2026-01-10 09:15:00 | 2026-01-10 09:22:00 | 2026-01-10 09:22:01 | UPDATE
12345 | 2026-01-12 14:30:00 | 2026-01-12 14:30:00 | 2026-01-12 14:30:03 | UPDATE

Look at the third row. The created_at changed from 2026-01-10 09:15:00 to 2026-01-12 14:30:00two days after the original creation. The application was mutating created_at as part of a business workflow. In this case, when a case was "resubmitted" after being rejected, the application reset created_at to the resubmission time.

This is a legitimate application feature — the product team wants resubmitted cases to behave as if they are new. But from a data warehouse perspective, it breaks temporal integrity. Bids that were created on January 10th (against the original submission) now appear to predate their parent case, which "started" on January 12th according to the mutated created_at.

The Scope of the Problem

I queried to find how many records were affected:

-- Find all cases where created_at was mutated
WITH first_version AS (
  SELECT
    id,
    created_at AS original_created_at,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY _cdc_timestamp ASC) AS rn
  FROM `raw_project.raw_dataset.cases`
),
current_version AS (
  SELECT
    id,
    created_at AS current_created_at,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY _cdc_timestamp DESC) AS rn
  FROM `raw_project.raw_dataset.cases`
)
SELECT COUNT(*) AS mutated_count
FROM first_version f
JOIN current_version c ON f.id = c.id
WHERE f.rn = 1 AND c.rn = 1
  AND f.original_created_at != c.current_created_at

Over 800 records had their created_at mutated at least once. This was not an edge case — it was a routine part of the application's workflow.

The Fix: Use the Original created_at From CDC History

The fix was straightforward: instead of trusting the current created_at value from the application, look it up from the raw CDC history. The first created_at value ever recorded for a given ID is the true creation timestamp.

I implemented this in the dbt staging model using a CTE that queries the raw CDC table:

-- models/staging/stg_cases.sql

WITH source AS (
    SELECT *
    FROM {{ source('app', 'cases') }}
    WHERE _cdc_change_type != 'DELETE'
),

-- Look up the ORIGINAL created_at from CDC history.
-- The application mutates created_at on resubmission,
-- which breaks parent-child temporal integrity.
original_created_at AS (
    SELECT
        id,
        MIN(created_at) AS created_at_original
    FROM {{ source('app_cdc', 'cases') }}
    GROUP BY id
),

final AS (
    SELECT
        s.id,
        s.case_id,
        s.status,
        -- Use the original created_at, not the (possibly mutated) current value
        COALESCE(
            oca.created_at_original,
            s.created_at
        ) AS created_at,
        -- Keep the mutated value available for reference
        s.created_at AS created_at_application,
        s.updated_at,
        s.submitted_at,
        s.deleted_at
    FROM source s
    LEFT JOIN original_created_at oca
        ON s.id = oca.id
)

SELECT * FROM final

Key design decisions in this fix:

  • MIN(created_at) instead of "first CDC row": Using MIN is simpler and more resilient than ordering by CDC timestamp and taking the first row. If the CDC history is incomplete (e.g., we started capturing changes after the record was created), MIN still gives us the earliest known value.
  • COALESCE fallback: If the CDC history lookup returns NULL (rare, but possible for very old records predating our CDC setup), fall back to the current application value. Some data is better than no data.
  • Keep both values: The original created_at is used for all downstream logic, but I also expose created_at_application for analysts who might need to understand the resubmission timeline.
  • Fix it in staging, not downstream: This correction belongs in the staging layer because every downstream model that uses created_at should get the corrected value. Fixing it further downstream would mean playing whack-a-mole across dozens of models.

The CDC Source Configuration

One practical challenge: the raw CDC table lives in a different dataset than the current-state snapshot. I added it as a separate source in dbt:

# models/staging/sources.yml

sources:
  - name: app
    database: "{{ var('raw_project') }}"
    schema: app_data
    tables:
      - name: cases
        description: "Current state snapshot of cases"

  - name: app_cdc
    database: "{{ var('raw_project') }}"
    schema: app_data_cdc
    tables:
      - name: cases
        description: "Full CDC history of cases (all mutations)"

Performance Considerations

The CDC history table is significantly larger than the current-state table — roughly 10x the row count. The MIN(created_at) GROUP BY id aggregation scans the entire history table. For our data volumes (~2M CDC rows), this adds about 4 seconds to the staging model build time. At larger scale, you would want to:

  • Partition the CDC table by _cdc_timestamp and only scan relevant partitions
  • Materialise the original_created_at lookup as a separate incremental model that only processes new CDC rows
  • Or push this logic to the ingestion layer and store original_created_at as a separate column at write time

Lessons Learned

  • Never trust created_at from application databases. It sounds obvious, but created_at feels like an immutable field. It is not. Applications can and do mutate it for business reasons. Always verify with your application team.
  • CDC history is your source of truth for temporal data. If you have CDC set up, you have a tamper-proof record of what the data looked like at every point in time. Use it.
  • dbt tests that enforce temporal integrity catch real bugs. The "child cannot be older than parent" test sounded pedantic when I wrote it. It caught a real data quality issue that would have silently broken time-based metrics like "time to first bid".
  • Fix data quality issues at the lowest possible layer. Correcting created_at in the staging model means every downstream model gets the fix for free. Doing it in individual reporting models would be fragile and incomplete.
  • Keep the original value accessible. Do not discard the application's version of the field. Someone will eventually need it, and having both values makes debugging much easier.

After deploying the fix, all 12 dbt test failures resolved immediately, and our "time to first bid" metric shifted by about 2% — a small but meaningful correction that had been silently skewing our reporting for months.

Categories: Data Engineering, dbt, Lessons Learned

You may also like

Leave a Comment