Data & ArchitectureApril 6, 20269 min read

What Is Data Hygiene and Why Does It Matter for AI

Data hygiene means your data is accurate, consistent, and documented. Most data stacks are one of those three things.

That is the gap. The practical definition is simple. The distance between that definition and reality is where most AI projects quietly fail.

What data hygiene actually means

Not the consulting version. The practical one.

Accurate: the values in your tables reflect reality. If a campaign ran from January to March, the date range in your data matches that. If a customer is categorized as "Enterprise," they meet whatever criteria your company uses to define Enterprise.

Consistent: the same thing is named the same way everywhere. "Facebook Newsfeed" is not "fb_newsfeed" in one table and "Facebook Feed" in another. The logic that maps a raw value to a business label is applied the same way in every query that touches it.

Documented: when a new engineer looks at a transformation, they can tell what it does, why it does it that way, and when it was last updated. Not from asking someone. From reading the code.

Most data stacks are accurate in some places, inconsistent in others, and documented nowhere. That combination compounds.

Where it breaks down: ingestion

The first place data hygiene fails is at the point where data enters your stack.

Raw source data comes in with whatever naming conventions, formats, and assumptions the upstream system used. Sometimes that upstream system is a third-party API. Sometimes it's a form someone built in 2018 that nobody has touched since. Sometimes it's a CRM where sales reps enter freetext values in fields that were supposed to be dropdowns.

The ingestion layer is supposed to normalize this. Load the raw data, land it in a staging table, start the cleaning process.

In practice, the cleaning process often doesn't go far enough. You get data that's technically loaded but still carries the inconsistencies from the source. Those get passed downstream.

Common examples: date fields arriving as strings in inconsistent formats. Null values in columns where null means three different things depending on the system. Product codes with trailing spaces that don't match the lookup table. Customer IDs formatted differently between the CRM export and the billing export.

Each one is small. They add up.

Where it breaks down: transformation

The transformation layer is where most hygiene problems live permanently.

This is the dbt layer, the SQL views, the CASE statements that translate raw campaign codes into business labels. It's where your data team turns source data into something the business can use. It's also where logic accumulates, drifts, and goes undocumented.

A few patterns that show up repeatedly:

Stale mappings. A CASE statement was written eighteen months ago to classify placement types. Since then, three new placement codes have been added. Nobody updated the CASE statement. Every new placement code falls into the ELSE branch and gets labeled "Other." The data is wrong. It's not obviously wrong. It quietly degrades reporting until someone looks closely enough to notice.

Duplicate logic. The same classification logic exists in two different models, implemented slightly differently. One was written by the engineer who built the original pipeline. The other was written six months later by someone who didn't know the first one existed. Both are in production. They sometimes disagree.

Undocumented exceptions. Someone added a WHERE clause to exclude a specific account after an executive flagged it as a test account. The clause is still there three years later. No one knows if that account is still a test account. No one knows why the numbers in that report are consistently lower than expected.

None of this shows up as an error. The pipeline runs. The tables populate. The data looks fine until someone asks the right question.

Where it breaks down: documentation

Documentation is the most neglected piece.

Most data teams treat it as something you write after you've finished building. So it either doesn't exist, or it describes what the model did when it was first written, not what it does now.

The result is that institutional knowledge lives in Slack threads and in the heads of specific engineers. When one of those engineers leaves, or is unavailable during the meeting where someone is questioning a number, the documentation gap becomes a business problem.

The documentation problem also makes accuracy and consistency problems harder to fix. If you don't know what a transformation is supposed to do, you can't tell whether it's doing it correctly. If you can't read the intent behind a CASE statement, you can't update it without risking a regression.

Bad documentation doesn't just slow you down. It makes every other hygiene problem harder to solve.

Why it matters more when AI is in the loop

Traditional BI tools have a specific failure mode: wrong numbers. A dashboard shows the wrong value. A user notices. They file a ticket. Engineering investigates. Eventually the transformation logic gets fixed.

The feedback loop is slow, but it exists.

AI systems have a different failure mode. They don't show you wrong numbers. They answer questions confidently using whatever data they have access to. If the mapping table says "Facebook Newsfeed" in some places and "fb_newsfeed" in others, the AI picks one and keeps going. If your ELSE branch is catching 15% of your placements and labeling them "Other," the AI's answer about placement performance will be wrong in a way that doesn't look wrong. It will look like an answer.

The confidence doesn't change based on data quality. That's the part that matters.

With a BI dashboard, trust erodes slowly. Users start double-checking numbers. They add caveats when presenting. They stop relying on it for decisions.

With an AI system, the trust erosion happens faster and it's harder to trace. If someone asks which campaigns drove the most conversions last month and gets an answer that's off because three campaign codes fell into the ELSE branch, they might act on it. They might not notice until the numbers don't match something else. They might never notice.

Bad data hygiene was a slow tax on your reporting. In an AI context, it's a faster-acting one.

What fixing it actually looks like

The standard advice is to audit your data, document everything, and clean up the transformation layer. That advice is correct and almost nobody follows it because it competes with building new features.

The practical version is smaller.

Start with the CASE statements and mapping tables that feed the reports your AI is actually using. Not all of them. The specific ones. Map what exists, find where the ELSE branches are catching real values, and fix those first. That addresses the active accuracy problems without requiring a full hygiene project.

Then document what you just fixed while you're in the code. Not a separate documentation sprint. Right now, in the same PR. One sentence explaining what the CASE statement is doing and what the ELSE branch is supposed to catch.

Consistency problems usually require a conversation with whoever owns the upstream sources. That conversation is harder, but it's worth having before the AI is making decisions from conflicting values across tables.

The goal is not a perfect data stack. That does not exist. The goal is a stack where the problems are known, the active ones are fixed, and the logic is documented enough that someone can maintain it without relying on institutional memory.

That's what hygiene means in practice. Not pristine. Maintained.

Free Checklist

20-Point Data Hygiene Audit

for AI-Ready Teams

Work through this before you layer AI on top of your warehouse. Score one point per item. 17+ is production-ready. Under 12 means fix before you scale.

Email only. No other fields.

JESTR maintains your transformation logic in plain language so bad data doesn't compound. Learn more.