Batch Data Conversion at Scale: Why Streaming Matters

?? May 18, 2026 ?? 7 min read ??? Data Engineering ETL CLI

Every data engineer has been here: a 6GB CSV export from a production database, a `pd.read_csv()` that eats 24GB of RAM and OOMs your laptop, and a one-off script you'll lose in `~/Downloads/convert_v3_final(2).py`. Three hours later you have the Parquet file you needed, your laptop fan is screaming, and you've written code that will never be reused.

Data format conversion is a universal problem in data engineering. Yet most teams solve it by writing custom scripts with pandas, csvkit, or ad-hoc Python. These approaches work fine for small files, but they break down at production scale.

This post explains why streaming row-by-row conversion matters, how it differs from load-everything-in-memory approaches, and how to use DataMorph � an open-source CLI tool purpose-built for this problem.

The Problem with Load-Everything Conversion

Most data conversion tools � including pandas' `read_csv()` ? `to_parquet()` � load the entire dataset into memory before writing output. This works for small files, but breaks down in these scenarios:

Scenario	File Size	pandas Memory	Result
Dev laptop	500 MB CSV	~2 GB (4� overhead)	Slow but works
CI runner	2 GB file	~8 GB peak	OOM on small runners
Production export	10 GB CSV	~40 GB peak	Guaranteed OOM
Batch directory	50� 200 MB files	Varies per file	Manual loop needed

The memory overhead comes from two sources: string interning (pandas creates internal data structures that multiply memory) and buffering (the tool holds the entire result before writing). A 1 GB CSV can consume 4-8 GB of RAM in pandas before you even start writing output.

Streaming: Row-by-Row Conversion

Streaming converters process one row at a time: read a row, transform it, write it, discard it from memory. The memory footprint stays constant regardless of file size:

# Streaming � constant memory (~20 MB)
datamorph convert 10gb-file.csv output.parquet

# Load-everything � OOM on large files
python -c "
import pandas as pd
df = pd.read_csv('10gb-file.csv')   # ?? 40+ GB RAM
df.to_parquet('output.parquet')
"

This is the key insight: memory usage should be proportional to row width, not row count. A streaming converter uses ~20 MB for a 10 GB file with 100-column rows, while pandas needs 40+ GB for the same job.

DataMorph: Streaming from Day One

DataMorph is built on a streaming architecture. Every format pair (CSV ? Parquet, JSON ? YAML, etc.) processes rows one at a time. This means you can convert files of any size on any machine � from a Raspberry Pi to a production CI runner.

# Install once
pip install datamorph

# Convert anything to anything � streamed
datamorph convert sales_2025.csv sales_2025.parquet
datamorph convert events.json events.yaml
datamorph convert users.parquet users.csv
datamorph convert config.avro config.json

No config files. No schema declarations. No DataFrame imports. One command, one output, constant memory.

Beyond Simple Conversion: Schema Validation

Raw conversion is useful, but the real value comes from catching data quality issues before they reach production. DataMorph includes a `validate` command that checks data files against expected schemas:

# Infer schema from reference data
datamorph schema reference.parquet --json-output > schema.json

# Validate every file in CI
datamorph validate data/events.csv --schema schema.json --strict
# Exit code 1 on mismatch � blocks deployment

? CI Integration Example

Fail your pipeline on schema drift, data type changes, or missing columns � before they reach production.

This turns DataMorph from a "convert utility" into a data quality gate. Combine it with GitHub Actions, GitLab CI, or any pipeline tool to enforce data contracts.

Batch Directory Conversion

When you have a directory full of files, you don't want to loop manually:

# Convert every CSV to Parquet in one command
datamorph batch ./raw_data/ ./processed/ --from csv --to parquet --recursive

# Validate before batch conversion
datamorph batch ./incoming/ ./clean/ --from csv --to parquet --validate

# Dry-run to see what would happen
datamorph batch ./data/ ./out/ --from json --to yaml --dry-run

The `--recursive` flag walks subdirectories. The `--validate` flag runs schema checks before conversion. And `--dry-run` shows the plan without touching files � useful for reviewing before executing.

Supported Format Pairs

DataMorph supports input and output for 6+ common formats:

Format	Input	Output	Best For
CSV	?	?	Spreadsheets, database exports
JSON	?	?	API responses, config files
JSONL	?	?	Log ingestion, ML training data
YAML	?	?	Config files, Kubernetes manifests
Parquet	?	?	Columnar analytics (Athena, BigQuery)
Avro	?	?	Kafka streams, Hadoop ecosystem
Protobuf	?	?	gRPC services, binary protocols

Real-World Example: Monthly Export Pipeline

Here's a complete pipeline that processes monthly database exports for analytics:

# Step 1: Convert CSV exports to Parquet for Athena
datamorph batch ./exports/2026-04/ ./athena-tables/ \
  --from csv --to parquet --recursive

# Step 2: Validate schema against production
datamorph validate ./athena-tables/events.parquet \
  --schema ./schemas/events.json --strict

# Step 3: Generate schema docs for downstream teams
datamorph schema ./athena-tables/events.parquet \
  --json-output > ./docs/events-schema.json

# Step 4: Quick CSV summary for business analysts
datamorph convert ./athena-tables/monthly-summary.parquet \
  ./reports/monthly-summary.csv

This pipeline runs in under 10 minutes for a 15 GB export on a 4 GB CI runner. The same job with pandas would OOM on step 1.

Pro tip: Combine DataMorph with cron or a CI schedule for automated daily/weekly data processing. The constant memory footprint means you don't need to provision expensive runners.

Comparison: DataMorph vs. Alternatives

Feature	DataMorph	pandas	csvkit	frictionless
Streaming >10 GB	?	? OOM	?	?
Multi-format (6+)	?	? with libs	? CSV only	?
Batch directory conversion	?	?	?	?
Zero-config CLI	?	?	?	?
Schema validation with exit codes	?	?	?	?
Single pip install	?	?	?	?
CI-friendly exit codes	?	?	?	?

DataMorph vs. pandas: pandas is a full data analysis framework. DataMorph is a focused CLI tool � install it once and use it in any pipeline without importing libraries. The streaming architecture means you don't need 32 GB RAM machines for data conversion.

DataMorph vs. csvkit: csvkit is excellent for CSV but limited to one format. DataMorph handles Parquet, Avro, JSON, and YAML � the formats that real data pipelines use.

DataMorph vs. frictionless: frictionless is a Python framework with a rich API. DataMorph is a CLI tool you can run in CI without writing Python. If you need programmatic control, frictionless is the right choice. If you need a one-liner in your pipeline, reach for DataMorph.

Getting Started

Install DataMorph and start converting in under a minute:

# Install
pip install datamorph

# Verify
datamorph --help

# Convert your first file
datamorph convert sample.csv sample.parquet

# See all supported formats
datamorph formats

# Check a file's schema
datamorph schema sample.parquet --json-output

DataMorph is free and open source (MIT). No account, no telemetry, no rate limits. The CLI works fully offline.

It's one of 11 tools in the DevForge suite � a collection of CLI-first developer tools built for real engineering problems. Check out the other tools at github.com/Coding-Dev-Tools.

DataMorph is part of DevForge � 11 developer CLI tools built by autonomous AI agents. Also check out API Contract Guardian, json2sql, DeployDiff, Envault, and click-to-mcp.