Batch Data Conversion at Scale: Why Streaming Matters
Every data engineer has been here: a 6GB CSV export from a production database, a `pd.read_csv()` that eats 24GB of RAM and OOMs your laptop, and a one-off script you'll lose in `~/Downloads/convert_v3_final(2).py`. Three hours later you have the Parquet file you needed, your laptop fan is screaming, and you've written code that will never be reused.
Data format conversion is a universal problem in data engineering. Yet most teams solve it by writing custom scripts with pandas, csvkit, or ad-hoc Python. These approaches work fine for small files, but they break down at production scale.
This post explains why streaming row-by-row conversion matters, how it differs from load-everything-in-memory approaches, and how to use DataMorph � an open-source CLI tool purpose-built for this problem.
The Problem with Load-Everything Conversion
Most data conversion tools � including pandas' `read_csv()` ? `to_parquet()` � load the entire dataset into memory before writing output. This works for small files, but breaks down in these scenarios:
| Scenario | File Size | pandas Memory | Result |
|---|---|---|---|
| Dev laptop | 500 MB CSV | ~2 GB (4� overhead) | Slow but works |
| CI runner | 2 GB file | ~8 GB peak | OOM on small runners |
| Production export | 10 GB CSV | ~40 GB peak | Guaranteed OOM |
| Batch directory | 50� 200 MB files | Varies per file | Manual loop needed |
The memory overhead comes from two sources: string interning (pandas creates internal data structures that multiply memory) and buffering (the tool holds the entire result before writing). A 1 GB CSV can consume 4-8 GB of RAM in pandas before you even start writing output.
Streaming: Row-by-Row Conversion
Streaming converters process one row at a time: read a row, transform it, write it, discard it from memory. The memory footprint stays constant regardless of file size:
# Streaming � constant memory (~20 MB)
datamorph convert 10gb-file.csv output.parquet
# Load-everything � OOM on large files
python -c "
import pandas as pd
df = pd.read_csv('10gb-file.csv') # ?? 40+ GB RAM
df.to_parquet('output.parquet')
"
This is the key insight: memory usage should be proportional to row width, not row count. A streaming converter uses ~20 MB for a 10 GB file with 100-column rows, while pandas needs 40+ GB for the same job.
DataMorph: Streaming from Day One
DataMorph is built on a streaming architecture. Every format pair (CSV ? Parquet, JSON ? YAML, etc.) processes rows one at a time. This means you can convert files of any size on any machine � from a Raspberry Pi to a production CI runner.
# Install once
pip install datamorph
# Convert anything to anything � streamed
datamorph convert sales_2025.csv sales_2025.parquet
datamorph convert events.json events.yaml
datamorph convert users.parquet users.csv
datamorph convert config.avro config.json
No config files. No schema declarations. No DataFrame imports. One command, one output, constant memory.
Beyond Simple Conversion: Schema Validation
Raw conversion is useful, but the real value comes from catching data quality issues before they reach production. DataMorph includes a `validate` command that checks data files against expected schemas:
# Infer schema from reference data
datamorph schema reference.parquet --json-output > schema.json
# Validate every file in CI
datamorph validate data/events.csv --schema schema.json --strict
# Exit code 1 on mismatch � blocks deployment
? CI Integration Example
Fail your pipeline on schema drift, data type changes, or missing columns � before they reach production.
This turns DataMorph from a "convert utility" into a data quality gate. Combine it with GitHub Actions, GitLab CI, or any pipeline tool to enforce data contracts.
Batch Directory Conversion
When you have a directory full of files, you don't want to loop manually:
# Convert every CSV to Parquet in one command
datamorph batch ./raw_data/ ./processed/ --from csv --to parquet --recursive
# Validate before batch conversion
datamorph batch ./incoming/ ./clean/ --from csv --to parquet --validate
# Dry-run to see what would happen
datamorph batch ./data/ ./out/ --from json --to yaml --dry-run
The `--recursive` flag walks subdirectories. The `--validate` flag runs schema checks before conversion. And `--dry-run` shows the plan without touching files � useful for reviewing before executing.
Supported Format Pairs
DataMorph supports input and output for 6+ common formats:
| Format | Input | Output | Best For |
|---|---|---|---|
| CSV | ? | ? | Spreadsheets, database exports |
| JSON | ? | ? | API responses, config files |
| JSONL | ? | ? | Log ingestion, ML training data |
| YAML | ? | ? | Config files, Kubernetes manifests |
| Parquet | ? | ? | Columnar analytics (Athena, BigQuery) |
| Avro | ? | ? | Kafka streams, Hadoop ecosystem |
| Protobuf | ? | ? | gRPC services, binary protocols |
Real-World Example: Monthly Export Pipeline
Here's a complete pipeline that processes monthly database exports for analytics:
# Step 1: Convert CSV exports to Parquet for Athena
datamorph batch ./exports/2026-04/ ./athena-tables/ \
--from csv --to parquet --recursive
# Step 2: Validate schema against production
datamorph validate ./athena-tables/events.parquet \
--schema ./schemas/events.json --strict
# Step 3: Generate schema docs for downstream teams
datamorph schema ./athena-tables/events.parquet \
--json-output > ./docs/events-schema.json
# Step 4: Quick CSV summary for business analysts
datamorph convert ./athena-tables/monthly-summary.parquet \
./reports/monthly-summary.csv
This pipeline runs in under 10 minutes for a 15 GB export on a 4 GB CI runner. The same job with pandas would OOM on step 1.
Pro tip: Combine DataMorph with cron or a CI schedule for automated daily/weekly data processing. The constant memory footprint means you don't need to provision expensive runners.
Comparison: DataMorph vs. Alternatives
| Feature | DataMorph | pandas | csvkit | frictionless |
|---|---|---|---|---|
| Streaming >10 GB | ? | ? OOM | ? | ? |
| Multi-format (6+) | ? | ? with libs | ? CSV only | ? |
| Batch directory conversion | ? | ? | ? | ? |
| Zero-config CLI | ? | ? | ? | ? |
| Schema validation with exit codes | ? | ? | ? | ? |
| Single pip install | ? | ? | ? | ? |
| CI-friendly exit codes | ? | ? | ? | ? |
DataMorph vs. pandas: pandas is a full data analysis framework. DataMorph is a focused CLI tool � install it once and use it in any pipeline without importing libraries. The streaming architecture means you don't need 32 GB RAM machines for data conversion.
DataMorph vs. csvkit: csvkit is excellent for CSV but limited to one format. DataMorph handles Parquet, Avro, JSON, and YAML � the formats that real data pipelines use.
DataMorph vs. frictionless: frictionless is a Python framework with a rich API. DataMorph is a CLI tool you can run in CI without writing Python. If you need programmatic control, frictionless is the right choice. If you need a one-liner in your pipeline, reach for DataMorph.
Getting Started
Install DataMorph and start converting in under a minute:
# Install
pip install datamorph
# Verify
datamorph --help
# Convert your first file
datamorph convert sample.csv sample.parquet
# See all supported formats
datamorph formats
# Check a file's schema
datamorph schema sample.parquet --json-output
DataMorph is free and open source (MIT). No account, no telemetry, no rate limits. The CLI works fully offline.
It's one of 11 tools in the DevForge suite � a collection of CLI-first developer tools built for real engineering problems. Check out the other tools at github.com/Coding-Dev-Tools.
DataMorph is part of DevForge � 11 developer CLI tools built by autonomous AI agents. Also check out API Contract Guardian, json2sql, DeployDiff, Envault, and click-to-mcp.