The Schema Drift Problem
Schema drift is what happens when the shape of your data changes without anyone noticing:
- A vendor adds a new column to their CSV export
- A partner team changes a field from
integertostring - A Parquet file loses a column because upstream changed their projection
- A JSON API response adds a nested object where there used to be a flat value
None of these break the pipeline at load time. CSV parsers are lenient — they'll read a text value into a column that used to contain numbers. Parquet readers skip missing columns silently. The pipeline runs green. The data loads. Then two days later, someone notices the analytics dashboard is wrong.
This is why schema validation belongs in CI, not in production.
How DataMorph Validate Works
DataMorph's validate command checks data files against an expected schema and exits with code 1 on mismatches:
# Step 1: Generate a schema from your known-good data
datamorph schema users.csv --json-output > schemas/users.json
# Step 2: Validate new data against the expected schema
datamorph validate users.csv --schema schemas/users.json
# Output if valid:
# ✓ VALID — 10,000 rows checked
# Output if schema drifted:
# ✗ INVALID — 10,000 rows checked
#
# Errors:
# • Row 1: field 'age' expected integer but got string
# • Row 45: missing required field 'email'
The exit code makes it CI-friendly — if the data doesn't match the schema, the command fails and the pipeline stops.
Two Validation Modes
Without a schema file (structural validation):
datamorph validate data.csv
# Checks: file is readable, format is detected, columns are consistent
# No schema file needed — uses inferred schema from the data itself
# Good for: "is this file even parseable?"
With a schema file (schema validation):
datamorph validate data.csv --schema expected-schema.json
# Checks: all expected fields present, types match, no unexpected fields (in strict mode)
# Requires a schema file generated by `datamorph schema`
# Good for: "does this file still have the shape we expect?"
Strict Mode: Catch Type Mismatches
By default, validate only errors on hard failures (missing fields, unparseable files). Type mismatches produce warnings, not errors. In --strict mode, type mismatches become errors:
# Warning mode (default) — type mismatches are warnings
datamorph validate users.csv --schema users.json
# ✗ INVALID — 10,000 rows checked
# Warnings:
# • Row 1: field 'age' expected integer but got string
# Strict mode — type mismatches are errors, exit code 1
datamorph validate users.csv --schema users.json --strict
# ✗ INVALID — 10,000 rows checked
# Errors:
# • Row 1: field 'age' expected integer but got string
# • Row 45: missing required field 'email'
When to use strict mode: Use --strict in CI pipelines where any deviation from the expected schema should block the deploy. Use default mode in monitoring/reporting where you want to be notified of drift but not break the pipeline.
CI/CD Integration: Four Pipeline Patterns
Pattern 1: Validate Data Files on PR
Block PRs that change data files if the schema doesn't match the committed schema:
# .github/workflows/validate-data.yml
name: Validate Data Schemas
on:
pull_request:
paths: ['data/**']
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install DataMorph
run: pip install datamorph-cli
- name: Validate all CSV files against schemas
run: |
for file in data/*.csv; do
schema="schemas/$(basename "$file" .csv).json"
if [ -f "$schema" ]; then
echo "Validating $file against $schema"
datamorph validate "$file" --schema "$schema" --strict
else
echo "WARNING: No schema for $file — generating one"
datamorph schema "$file" --json-output > "$schema"
fi
done
- name: Validate Parquet files
run: |
for file in data/*.parquet; do
schema="schemas/$(basename "$file" .parquet).json"
datamorph validate "$file" --schema "$schema" --strict
done
Pattern 2: Schema Drift Detection on Schedule
Run a daily check that validates data files against their committed schemas. If a vendor changes their export format overnight, you find out at 9 AM — not when the quarterly report is wrong.
# .github/workflows/schema-drift-detect.yml
name: Schema Drift Detection
on:
schedule:
- cron: '0 9 * * *' # 9 AM daily
workflow_dispatch:
jobs:
drift-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install DataMorph
run: pip install datamorph-cli
- name: Download latest data files
run: |
# Replace with your data source
aws s3 cp s3://data-lake/raw/ ./data/ --recursive
- name: Validate schemas
run: |
FAILED=0
for file in data/*.csv; do
schema="schemas/$(basename "$file" .csv).json"
if ! datamorph validate "$file" --schema "$schema" --strict; then
FAILED=1
echo "::warning::$file has schema drift"
fi
done
exit $FAILED
- name: Alert on drift
if: failure()
run: |
curl -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d '{"text":"⚠️ Schema drift detected in data files. Check the latest CI run."}'
Pattern 3: Validate Before ETL Load
Gate your ETL pipeline on schema validation — don't load data that doesn't match expectations:
#!/bin/bash
# pre-etl-validate.sh — run before loading data into warehouse
set -e
DATA_DIR="$1"
SCHEMA_DIR="schemas"
for file in "$DATA_DIR"/*; do
basename=$(basename "$file")
name="${basename%.*}"
schema="$SCHEMA_DIR/${name}.json"
if [ ! -f "$schema" ]; then
echo "ERROR: No schema for $basename. Run: datamorph schema $file --json-output > $schema"
exit 1
fi
echo "Validating $basename..."
if ! datamorph validate "$file" --schema "$schema" --strict; then
echo "ERROR: $basename failed schema validation. Not loading."
exit 1
fi
done
echo "✓ All files passed validation. Safe to load."
Pattern 4: JSON Output for Programmatic Processing
Use --json-output to pipe validation results into monitoring systems:
datamorph validate users.csv --schema users.json --json-output
# Output:
# {
# "valid": false,
# "rows_checked": 10000,
# "errors": [
# "Row 1: field 'age' expected integer but got string"
# ],
# "warnings": [
# "Row 45: unexpected field 'phone'"
# ]
# }
# Pipe to monitoring
datamorph validate users.csv --schema users.json --json-output | \
python3 -c "
import sys, json
result = json.load(sys.stdin)
if not result['valid']:
print(f'ALERT: {len(result[\"errors\"])} schema errors in users.csv')
for err in result['errors'][:5]:
print(f' - {err}')
"
Schema Management: Create, Commit, Update
The workflow is simple: generate schemas from known-good data, commit them to your repo, and validate new data against them.
Step 1: Generate Schemas from Known-Good Data
# Generate schema for each data file
datamorph schema users.csv --json-output > schemas/users.json
datamorph schema orders.parquet --json-output > schemas/orders.json
datamorph schema events.jsonl --json-output > schemas/events.json
The schema file is a JSON array of {"name": "field_name", "type": "inferred_type"} objects:
[
{"name": "user_id", "type": "integer"},
{"name": "email", "type": "string"},
{"name": "age", "type": "integer"},
{"name": "created_at", "type": "string"},
{"name": "is_active", "type": "boolean"}
]
Step 2: Commit Schemas to Your Repo
git add schemas/
git commit -m "Add data schemas for CI validation"
Step 3: Update Schemas When Data Changes Intentionally
When a schema change is intentional (e.g., a new column is added by design), update the schema file:
# Re-generate schema from the updated data
datamorph schema users.csv --json-output > schemas/users.json
# Or manually add the new field
# Edit schemas/users.json to add {"name": "phone", "type": "string"}
git add schemas/users.json
git commit -m "Update users schema: add phone field"
Schema files are version-controlled. When a PR changes data and the schema, the schema diff in the PR makes it obvious what changed. Reviewers see the intentional schema update alongside the data change, instead of discovering a surprise type mismatch in production.
What Validate Catches vs. What It Doesn't
| Schema Drift Type | Caught by validate? | Mode Required |
|---|---|---|
| Field changes type (integer → string) | ✓ Yes | --strict |
| Required field goes missing | ✓ Yes | --strict |
| New unexpected field appears | ⚠ Warning | --strict |
| File is unparseable / corrupt | ✓ Yes | Default |
| Columns are inconsistent across rows | ✓ Yes | Default |
| Semantic errors (valid string but wrong value) | ✗ No | N/A |
| Business logic violations (age > 150) | ✗ No | N/A |
DataMorph validates structure, not semantics. It catches "the age field is now a string instead of an integer" but not "the age field contains 999." For semantic validation, use a tool like Great Expectations alongside DataMorph — structural checks in CI, semantic checks in your data quality layer.
Multi-Format Validation: Same Schema Logic, Any File Type
DataMorph validates the same schema logic across all supported formats:
# CSV validation
datamorph validate users.csv --schema users.json --strict
# Parquet validation (same schema)
datamorph validate users.parquet --schema users.json --strict
# JSON Lines validation
datamorph validate users.jsonl --schema users.json --strict
# Avro validation
datamorph validate users.avro --schema users.json --strict
This matters when your pipeline converts between formats. If you convert users.csv → users.parquet for warehouse loading, validate both files against the same schema. If the Parquet file has a different shape than the CSV source, the conversion introduced a bug.
Batch Validation: Check Entire Directories
# Validate every CSV in a directory against its schema
for file in data/**/*.csv; do
name=$(basename "$file" .csv)
datamorph validate "$file" --schema "schemas/${name}.json" --strict
done
# Or use the batch command for conversion + validation
datamorph batch data/ output/ --format parquet --validate
Three Real Schema Drift Scenarios
Scenario 1: Vendor CSV Format Change
Your payment processor sends a daily CSV of transactions. On Tuesday, they add a currency column and change amount from integer to float (e.g., 19.99 instead of 1999 cents). Your ETL loads the float values as integers — now every transaction is off by 100x.
With DataMorph: The CI validation catches the type change on amount (integer → float) and the new field. The pipeline fails with a clear error message. You update the schema and adjust the ETL before the data loads.
Scenario 2: Partner Team Drops a Column
The user service team removes the phone field from their user export because they're migrating to a separate phone verification service. They don't tell your team. Your analytics pipeline loads the data — the phone column is just missing, not errored. Your phone segmentation report silently shows zero users with phones.
With DataMorph in strict mode: datamorph validate users.csv --schema users.json --strict reports "missing required field 'phone'". The pipeline fails. You either add the field back from the new phone service, or update the schema to mark phone as optional.
Scenario 3: Parquet Schema Evolution Without Backward Compatibility
Your ML training pipeline reads Parquet files from S3. Someone changes the feature extraction code, which renames user_age_bucket to age_group and changes it from string to integer (0, 1, 2 instead of "18-25", "26-35", "36-45"). The new Parquet files are written alongside the old ones. Your model training reads both and gets a type error — but only on some batches, making it hard to debug.
With DataMorph: Validate both old and new Parquet files against the committed schema. The new files fail validation (field name change + type change). You catch the incompatibility before mixing old and new data in training.
Install DataMorph
# pip
pip install datamorph-cli
# Homebrew (macOS / Linux)
brew tap Coding-Dev-Tools/tap
brew install datamorph
# Scoop (Windows)
scoop bucket add Coding-Dev-Tools https://github.com/Coding-Dev-Tools/scoop-bucket
scoop install datamorph
Star DataMorph on GitHub
Related Reading
- Batch Data Format Conversion at Scale — DataMorph getting-started guide
- Infrastructure Rollback Commands That Actually Work — DeployDiff CI/CD rollback
- Block Deployments on Config Drift — ConfigDrift CI/CD gating
- Before You Deploy: Check Config Drift AND Infrastructure Cost — cross-tool workflow