Catch Data Schema Drift in CI: Validate CSV and Parquet Before It Breaks Production

The Schema Drift Problem

Schema drift is what happens when the shape of your data changes without anyone noticing:

A vendor adds a new column to their CSV export
A partner team changes a field from integer to string
A Parquet file loses a column because upstream changed their projection
A JSON API response adds a nested object where there used to be a flat value

None of these break the pipeline at load time. CSV parsers are lenient — they'll read a text value into a column that used to contain numbers. Parquet readers skip missing columns silently. The pipeline runs green. The data loads. Then two days later, someone notices the analytics dashboard is wrong.

This is why schema validation belongs in CI, not in production.

How DataMorph Validate Works

DataMorph's validate command checks data files against an expected schema and exits with code 1 on mismatches:

# Step 1: Generate a schema from your known-good data
datamorph schema users.csv --json-output > schemas/users.json

# Step 2: Validate new data against the expected schema
datamorph validate users.csv --schema schemas/users.json

# Output if valid:
# ✓ VALID — 10,000 rows checked

# Output if schema drifted:
# ✗ INVALID — 10,000 rows checked
#
# Errors:
#   • Row 1: field 'age' expected integer but got string
#   • Row 45: missing required field 'email'

The exit code makes it CI-friendly — if the data doesn't match the schema, the command fails and the pipeline stops.

Two Validation Modes

Without a schema file (structural validation):

datamorph validate data.csv
# Checks: file is readable, format is detected, columns are consistent
# No schema file needed — uses inferred schema from the data itself
# Good for: "is this file even parseable?"

With a schema file (schema validation):

datamorph validate data.csv --schema expected-schema.json
# Checks: all expected fields present, types match, no unexpected fields (in strict mode)
# Requires a schema file generated by `datamorph schema`
# Good for: "does this file still have the shape we expect?"

Strict Mode: Catch Type Mismatches

By default, validate only errors on hard failures (missing fields, unparseable files). Type mismatches produce warnings, not errors. In --strict mode, type mismatches become errors:

# Warning mode (default) — type mismatches are warnings
datamorph validate users.csv --schema users.json
# ✗ INVALID — 10,000 rows checked
# Warnings:
#   • Row 1: field 'age' expected integer but got string

# Strict mode — type mismatches are errors, exit code 1
datamorph validate users.csv --schema users.json --strict
# ✗ INVALID — 10,000 rows checked
# Errors:
#   • Row 1: field 'age' expected integer but got string
#   • Row 45: missing required field 'email'

When to use strict mode: Use --strict in CI pipelines where any deviation from the expected schema should block the deploy. Use default mode in monitoring/reporting where you want to be notified of drift but not break the pipeline.

CI/CD Integration: Four Pipeline Patterns

Pattern 1: Validate Data Files on PR

Block PRs that change data files if the schema doesn't match the committed schema:

# .github/workflows/validate-data.yml
name: Validate Data Schemas

on:
  pull_request:
    paths: ['data/**']

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install DataMorph
        run: pip install datamorph-cli

      - name: Validate all CSV files against schemas
        run: |
          for file in data/*.csv; do
            schema="schemas/$(basename "$file" .csv).json"
            if [ -f "$schema" ]; then
              echo "Validating $file against $schema"
              datamorph validate "$file" --schema "$schema" --strict
            else
              echo "WARNING: No schema for $file — generating one"
              datamorph schema "$file" --json-output > "$schema"
            fi
          done

      - name: Validate Parquet files
        run: |
          for file in data/*.parquet; do
            schema="schemas/$(basename "$file" .parquet).json"
            datamorph validate "$file" --schema "$schema" --strict
          done

Pattern 2: Schema Drift Detection on Schedule

Run a daily check that validates data files against their committed schemas. If a vendor changes their export format overnight, you find out at 9 AM — not when the quarterly report is wrong.

# .github/workflows/schema-drift-detect.yml
name: Schema Drift Detection

on:
  schedule:
    - cron: '0 9 * * *'  # 9 AM daily
  workflow_dispatch:

jobs:
  drift-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install DataMorph
        run: pip install datamorph-cli

      - name: Download latest data files
        run: |
          # Replace with your data source
          aws s3 cp s3://data-lake/raw/ ./data/ --recursive

      - name: Validate schemas
        run: |
          FAILED=0
          for file in data/*.csv; do
            schema="schemas/$(basename "$file" .csv).json"
            if ! datamorph validate "$file" --schema "$schema" --strict; then
              FAILED=1
              echo "::warning::$file has schema drift"
            fi
          done
          exit $FAILED

      - name: Alert on drift
        if: failure()
        run: |
          curl -X POST "$SLACK_WEBHOOK" \
            -H "Content-Type: application/json" \
            -d '{"text":"⚠️ Schema drift detected in data files. Check the latest CI run."}'

Pattern 3: Validate Before ETL Load

Gate your ETL pipeline on schema validation — don't load data that doesn't match expectations:

#!/bin/bash
# pre-etl-validate.sh — run before loading data into warehouse

set -e

DATA_DIR="$1"
SCHEMA_DIR="schemas"

for file in "$DATA_DIR"/*; do
  basename=$(basename "$file")
  name="${basename%.*}"
  schema="$SCHEMA_DIR/${name}.json"

  if [ ! -f "$schema" ]; then
    echo "ERROR: No schema for $basename. Run: datamorph schema $file --json-output > $schema"
    exit 1
  fi

  echo "Validating $basename..."
  if ! datamorph validate "$file" --schema "$schema" --strict; then
    echo "ERROR: $basename failed schema validation. Not loading."
    exit 1
  fi
done

echo "✓ All files passed validation. Safe to load."

Pattern 4: JSON Output for Programmatic Processing

Use --json-output to pipe validation results into monitoring systems:

datamorph validate users.csv --schema users.json --json-output

# Output:
# {
#   "valid": false,
#   "rows_checked": 10000,
#   "errors": [
#     "Row 1: field 'age' expected integer but got string"
#   ],
#   "warnings": [
#     "Row 45: unexpected field 'phone'"
#   ]
# }

# Pipe to monitoring
datamorph validate users.csv --schema users.json --json-output | \
  python3 -c "
import sys, json
result = json.load(sys.stdin)
if not result['valid']:
    print(f'ALERT: {len(result[\"errors\"])} schema errors in users.csv')
    for err in result['errors'][:5]:
        print(f'  - {err}')
"

Schema Management: Create, Commit, Update

The workflow is simple: generate schemas from known-good data, commit them to your repo, and validate new data against them.

Step 1: Generate Schemas from Known-Good Data

# Generate schema for each data file
datamorph schema users.csv --json-output > schemas/users.json
datamorph schema orders.parquet --json-output > schemas/orders.json
datamorph schema events.jsonl --json-output > schemas/events.json

The schema file is a JSON array of {"name": "field_name", "type": "inferred_type"} objects:

[
  {"name": "user_id", "type": "integer"},
  {"name": "email", "type": "string"},
  {"name": "age", "type": "integer"},
  {"name": "created_at", "type": "string"},
  {"name": "is_active", "type": "boolean"}
]

Step 2: Commit Schemas to Your Repo

git add schemas/
git commit -m "Add data schemas for CI validation"

Step 3: Update Schemas When Data Changes Intentionally

When a schema change is intentional (e.g., a new column is added by design), update the schema file:

# Re-generate schema from the updated data
datamorph schema users.csv --json-output > schemas/users.json

# Or manually add the new field
# Edit schemas/users.json to add {"name": "phone", "type": "string"}

git add schemas/users.json
git commit -m "Update users schema: add phone field"

Schema files are version-controlled. When a PR changes data and the schema, the schema diff in the PR makes it obvious what changed. Reviewers see the intentional schema update alongside the data change, instead of discovering a surprise type mismatch in production.

What Validate Catches vs. What It Doesn't

Schema Drift Type	Caught by validate?	Mode Required
Field changes type (integer → string)	✓ Yes	`--strict`
Required field goes missing	✓ Yes	`--strict`
New unexpected field appears	⚠ Warning	`--strict`
File is unparseable / corrupt	✓ Yes	Default
Columns are inconsistent across rows	✓ Yes	Default
Semantic errors (valid string but wrong value)	✗ No	N/A
Business logic violations (age > 150)	✗ No	N/A

DataMorph validates structure, not semantics. It catches "the age field is now a string instead of an integer" but not "the age field contains 999." For semantic validation, use a tool like Great Expectations alongside DataMorph — structural checks in CI, semantic checks in your data quality layer.

Multi-Format Validation: Same Schema Logic, Any File Type

DataMorph validates the same schema logic across all supported formats:

# CSV validation
datamorph validate users.csv --schema users.json --strict

# Parquet validation (same schema)
datamorph validate users.parquet --schema users.json --strict

# JSON Lines validation
datamorph validate users.jsonl --schema users.json --strict

# Avro validation
datamorph validate users.avro --schema users.json --strict

This matters when your pipeline converts between formats. If you convert users.csv → users.parquet for warehouse loading, validate both files against the same schema. If the Parquet file has a different shape than the CSV source, the conversion introduced a bug.

Batch Validation: Check Entire Directories

# Validate every CSV in a directory against its schema
for file in data/**/*.csv; do
  name=$(basename "$file" .csv)
  datamorph validate "$file" --schema "schemas/${name}.json" --strict
done

# Or use the batch command for conversion + validation
datamorph batch data/ output/ --format parquet --validate

Three Real Schema Drift Scenarios

Scenario 1: Vendor CSV Format Change

Your payment processor sends a daily CSV of transactions. On Tuesday, they add a currency column and change amount from integer to float (e.g., 19.99 instead of 1999 cents). Your ETL loads the float values as integers — now every transaction is off by 100x.

With DataMorph: The CI validation catches the type change on amount (integer → float) and the new field. The pipeline fails with a clear error message. You update the schema and adjust the ETL before the data loads.

Scenario 2: Partner Team Drops a Column

The user service team removes the phone field from their user export because they're migrating to a separate phone verification service. They don't tell your team. Your analytics pipeline loads the data — the phone column is just missing, not errored. Your phone segmentation report silently shows zero users with phones.

With DataMorph in strict mode: datamorph validate users.csv --schema users.json --strict reports "missing required field 'phone'". The pipeline fails. You either add the field back from the new phone service, or update the schema to mark phone as optional.

Scenario 3: Parquet Schema Evolution Without Backward Compatibility

Your ML training pipeline reads Parquet files from S3. Someone changes the feature extraction code, which renames user_age_bucket to age_group and changes it from string to integer (0, 1, 2 instead of "18-25", "26-35", "36-45"). The new Parquet files are written alongside the old ones. Your model training reads both and gets a type error — but only on some batches, making it hard to debug.

With DataMorph: Validate both old and new Parquet files against the committed schema. The new files fail validation (field name change + type change). You catch the incompatibility before mixing old and new data in training.

Install DataMorph

# pip
pip install datamorph-cli

# Homebrew (macOS / Linux)
brew tap Coding-Dev-Tools/tap
brew install datamorph

# Scoop (Windows)
scoop bucket add Coding-Dev-Tools https://github.com/Coding-Dev-Tools/scoop-bucket
scoop install datamorph

Star DataMorph on GitHub

Catch Data Schema Drift in CI: Validate CSV and Parquet Before It Breaks Production

The Schema Drift Problem

How DataMorph Validate Works

Two Validation Modes

Strict Mode: Catch Type Mismatches

CI/CD Integration: Four Pipeline Patterns

Pattern 1: Validate Data Files on PR

Pattern 2: Schema Drift Detection on Schedule

Pattern 3: Validate Before ETL Load

Pattern 4: JSON Output for Programmatic Processing

Schema Management: Create, Commit, Update

Step 1: Generate Schemas from Known-Good Data

Step 2: Commit Schemas to Your Repo

Step 3: Update Schemas When Data Changes Intentionally

What Validate Catches vs. What It Doesn't

Multi-Format Validation: Same Schema Logic, Any File Type

Batch Validation: Check Entire Directories

Three Real Schema Drift Scenarios

Scenario 1: Vendor CSV Format Change

Scenario 2: Partner Team Drops a Column

Scenario 3: Parquet Schema Evolution Without Backward Compatibility

Install DataMorph

Related Reading