The application was silently rewriting created_at after record creation, making child records appear older than their parents. Here is how I caught it and fixed it using raw CDC history in the dbt staging layer.
The Symptom
A dbt test started failing intermittently. The test checked a simple invariant: a bid (child record) should never have a created_at timestamp earlier than the case (parent record) it belongs to. In other words, you cannot submit a bid on a case that does not exist yet.
-- This test was failing: bids created before their parent case
SELECT
b.id AS bid_id,
b.created_at AS bid_created_at,
c.id AS case_id,
c.created_at AS case_created_at
FROM {{ ref('stg_bids') }} b
JOIN {{ ref('stg_cases') }} c ON b.case_id = c.id
WHERE b.created_at < c.created_at
The first time I saw the failures, I assumed bad data — maybe a migration artefact or a bulk import that set incorrect timestamps. But the failing records were recent. And the list of failures kept growing.
The Investigation
I pulled the raw CDC history for one of the failing case records. Our raw layer captures every version of a row via Change Data Capture, so I could see the full mutation history:
SELECT
id,
created_at,
updated_at,
_cdc_timestamp,
_cdc_change_type
FROM `raw_project.raw_dataset.cases`
WHERE id = 12345
ORDER BY _cdc_timestamp ASC
The result told the whole story:
id | created_at | updated_at | _cdc_timestamp | _cdc_change_type
------|---------------------|---------------------|---------------------|------------------
12345 | 2026-01-10 09:15:00 | 2026-01-10 09:15:00 | 2026-01-10 09:15:02 | INSERT
12345 | 2026-01-10 09:15:00 | 2026-01-10 09:22:00 | 2026-01-10 09:22:01 | UPDATE
12345 | 2026-01-12 14:30:00 | 2026-01-12 14:30:00 | 2026-01-12 14:30:03 | UPDATE
Look at the third row. The created_at changed from 2026-01-10 09:15:00 to 2026-01-12 14:30:00 — two days after the original creation. The application was mutating created_at as part of a business workflow. In this case, when a case was "resubmitted" after being rejected, the application reset created_at to the resubmission time.
This is a legitimate application feature — the product team wants resubmitted cases to behave as if they are new. But from a data warehouse perspective, it breaks temporal integrity. Bids that were created on January 10th (against the original submission) now appear to predate their parent case, which "started" on January 12th according to the mutated created_at.
The Scope of the Problem
I queried to find how many records were affected:
-- Find all cases where created_at was mutated
WITH first_version AS (
SELECT
id,
created_at AS original_created_at,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _cdc_timestamp ASC) AS rn
FROM `raw_project.raw_dataset.cases`
),
current_version AS (
SELECT
id,
created_at AS current_created_at,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY _cdc_timestamp DESC) AS rn
FROM `raw_project.raw_dataset.cases`
)
SELECT COUNT(*) AS mutated_count
FROM first_version f
JOIN current_version c ON f.id = c.id
WHERE f.rn = 1 AND c.rn = 1
AND f.original_created_at != c.current_created_at
Over 800 records had their created_at mutated at least once. This was not an edge case — it was a routine part of the application's workflow.
The Fix: Use the Original created_at From CDC History
The fix was straightforward: instead of trusting the current created_at value from the application, look it up from the raw CDC history. The first created_at value ever recorded for a given ID is the true creation timestamp.
I implemented this in the dbt staging model using a CTE that queries the raw CDC table:
-- models/staging/stg_cases.sql
WITH source AS (
SELECT *
FROM {{ source('app', 'cases') }}
WHERE _cdc_change_type != 'DELETE'
),
-- Look up the ORIGINAL created_at from CDC history.
-- The application mutates created_at on resubmission,
-- which breaks parent-child temporal integrity.
original_created_at AS (
SELECT
id,
MIN(created_at) AS created_at_original
FROM {{ source('app_cdc', 'cases') }}
GROUP BY id
),
final AS (
SELECT
s.id,
s.case_id,
s.status,
-- Use the original created_at, not the (possibly mutated) current value
COALESCE(
oca.created_at_original,
s.created_at
) AS created_at,
-- Keep the mutated value available for reference
s.created_at AS created_at_application,
s.updated_at,
s.submitted_at,
s.deleted_at
FROM source s
LEFT JOIN original_created_at oca
ON s.id = oca.id
)
SELECT * FROM final
Key design decisions in this fix:
MIN(created_at)instead of "first CDC row": UsingMINis simpler and more resilient than ordering by CDC timestamp and taking the first row. If the CDC history is incomplete (e.g., we started capturing changes after the record was created),MINstill gives us the earliest known value.COALESCEfallback: If the CDC history lookup returns NULL (rare, but possible for very old records predating our CDC setup), fall back to the current application value. Some data is better than no data.- Keep both values: The original
created_atis used for all downstream logic, but I also exposecreated_at_applicationfor analysts who might need to understand the resubmission timeline. - Fix it in staging, not downstream: This correction belongs in the staging layer because every downstream model that uses
created_atshould get the corrected value. Fixing it further downstream would mean playing whack-a-mole across dozens of models.
The CDC Source Configuration
One practical challenge: the raw CDC table lives in a different dataset than the current-state snapshot. I added it as a separate source in dbt:
# models/staging/sources.yml
sources:
- name: app
database: "{{ var('raw_project') }}"
schema: app_data
tables:
- name: cases
description: "Current state snapshot of cases"
- name: app_cdc
database: "{{ var('raw_project') }}"
schema: app_data_cdc
tables:
- name: cases
description: "Full CDC history of cases (all mutations)"
Performance Considerations
The CDC history table is significantly larger than the current-state table — roughly 10x the row count. The MIN(created_at) GROUP BY id aggregation scans the entire history table. For our data volumes (~2M CDC rows), this adds about 4 seconds to the staging model build time. At larger scale, you would want to:
- Partition the CDC table by
_cdc_timestampand only scan relevant partitions - Materialise the
original_created_atlookup as a separate incremental model that only processes new CDC rows - Or push this logic to the ingestion layer and store
original_created_atas a separate column at write time
Lessons Learned
- Never trust
created_atfrom application databases. It sounds obvious, butcreated_atfeels like an immutable field. It is not. Applications can and do mutate it for business reasons. Always verify with your application team. - CDC history is your source of truth for temporal data. If you have CDC set up, you have a tamper-proof record of what the data looked like at every point in time. Use it.
- dbt tests that enforce temporal integrity catch real bugs. The "child cannot be older than parent" test sounded pedantic when I wrote it. It caught a real data quality issue that would have silently broken time-based metrics like "time to first bid".
- Fix data quality issues at the lowest possible layer. Correcting
created_atin the staging model means every downstream model gets the fix for free. Doing it in individual reporting models would be fragile and incomplete. - Keep the original value accessible. Do not discard the application's version of the field. Someone will eventually need it, and having both values makes debugging much easier.
After deploying the fix, all 12 dbt test failures resolved immediately, and our "time to first bid" metric shifted by about 2% — a small but meaningful correction that had been silently skewing our reporting for months.
Categories: Data Engineering, dbt, Lessons Learned
