The Format Conversion Problem
Data lives in different formats for different reasons: CSV for human readability, JSON for APIs, YAML for configuration, Parquet for analytics, Avro for event streaming, and Protobuf for high-performance serialization. Moving data between these formats is one of the most common tasks in data engineering -- and one of the most tedious.
The challenge isn't just changing file extensions. Each format has different type systems, nesting models, schema handling, and encoding conventions. CSV has no schema. JSON has flexible nesting but no type enforcement. Parquet requires a schema. Avro requires a schema. Protobuf requires a compiled schema. Converting between them means resolving these structural differences -- and doing it without running out of memory on large files.
Four approaches at very different scales:
| Approach | DataMorph | Pandas | Apache NiFi | AWS Glue |
|---|---|---|---|---|
| What it does | CLI: convert between 6 formats in one command | Python library: read, transform, write any format | Visual: drag-and-drop data flow pipelines | Cloud: serverless ETL jobs on AWS |
| Format coverage | 6 formats built-in | 30+ via plugins | 20+ via processors | 5 native + custom |
| Setup time | 30 seconds | 5 minutes | 1-2 hours | 30-60 minutes |
Tool 1: DataMorph -- One-Command Format Conversion
DataMorph -- Convert Between Data Formats at Scale
- 6 formats: CSV, JSON, YAML, Parquet, Avro, Protobuf
- Streaming architecture: no file size limit
- Schema inference and field mapping
- CI/CD integration with validation mode
DataMorph is a CLI tool that converts between 6 common data formats in one command. It uses a streaming architecture that processes data row-by-row instead of loading entire files into memory, so it handles files of any size. It infers schemas automatically, maps fields between formats, and validates output -- all from the command line.
Core workflow
# Install
pip install datamorph-cli
# Convert CSV to Parquet
datamorph convert data.csv --to parquet
# data.parquet created (streaming, any size)
# Convert JSON to YAML
datamorph convert config.json --to yaml
# config.yaml created
# Convert Parquet to CSV
datamorph convert analytics.parquet --to csv
# analytics.csv created
# Batch convert: multiple files at once
datamorph convert *.csv --to parquet --output-dir ./parquet/
# Specify schema for Avro/Protobuf output
datamorph convert events.json --to avro --schema events.avsc
# Field mapping: rename or select columns
datamorph convert data.csv --to json --fields "id,name,email"
# Validate output format
datamorph validate output.parquet --format parquet
# CI/CD: validate data in pipeline
datamorph convert fixtures.csv --to json --validate
# Exit code 0 if conversion succeeds and output validates
# Pipe support
cat data.json | datamorph convert --from json --to yaml
What DataMorph gets right
- One command.
datamorph convert data.csv --to parquetproduces a valid Parquet file. No scripts, no boilerplate, no format-specific configuration. The most common data conversion task reduced to a single CLI call. - Streaming architecture. Processes data row-by-row instead of loading entire files into memory. A 10GB CSV file converts to Parquet without OOM errors. This is the #1 advantage over Pandas for large files.
- 6 formats, one interface. CSV, JSON, YAML, Parquet, Avro, and Protobuf with consistent CLI syntax. No need to remember different Python libraries for each format pair.
- Schema inference. Automatically infers column types from data. No schema definition needed for simple conversions. Override with
--schemawhen you need explicit control. - Batch processing.
datamorph convert *.csv --to parquetconverts all CSV files in a directory. Handles bulk format migrations in one command. - CI/CD friendly.
--validatemode confirms conversion success and output format validity. Exit codes work with any CI system. No Python runtime needed in CI. - Field selection.
--fieldsflag selects and reorders columns during conversion. Combine format change with column pruning in one step.
Where DataMorph is limited
- Conversion only. No data transformation, aggregation, filtering, or joins. It converts formats, not shapes. If you need to pivot, group, or reshape data, use Pandas or a pipeline tool.
- 6 formats. Doesn't cover XML, TSV, ORC, HDF5, Excel, or custom formats. For exotic format pairs, you need Pandas with plugins.
- No UI. CLI only. Teams that prefer visual pipeline builders need NiFi or Airflow.
- Schema inference is best-effort. For complex nested schemas, you may need to provide an explicit schema file. The inference engine handles 90% of cases but struggles with deeply nested JSON or polymorphic types.
Tool 2: Pandas -- The Programmatic Approach
Pandas -- Read, Transform, Write Any Data Format
- 30+ format plugins via read_* / to_* methods
- Full transformation: filter, group, pivot, join, aggregate
- Largest data science ecosystem
Pandas is the de facto standard for data manipulation in Python. It reads and writes 30+ formats, supports every transformation operation you can imagine, and has the largest ecosystem of plugins, tutorials, and Stack Overflow answers. If you're doing anything beyond format conversion -- filtering, aggregation, joins, pivoting -- Pandas is the right tool. But for simple format conversion, it's overkill.
Core workflow
# Install
pip install pandas pyarrow fastparquet pyavro
# Convert CSV to Parquet
import pandas as pd
df = pd.read_csv('data.csv')
df.to_parquet('data.parquet')
# Convert JSON to YAML
import yaml
df = pd.read_json('config.json')
with open('config.yaml', 'w') as f:
yaml.dump(df.to_dict(orient='records'), f)
# Convert Parquet to CSV
df = pd.read_parquet('analytics.parquet')
df.to_csv('analytics.csv', index=False)
# With transformation
df = pd.read_csv('data.csv')
df = df[df['status'] == 'active'] # filter
df = df.groupby('region').agg({'revenue': 'sum'}) # aggregate
df.to_parquet('active_revenue.parquet')
# Problem: loads entire file into memory
# 10GB CSV -> MemoryError on 8GB machine
# Workaround: chunked processing
for chunk in pd.read_csv('large.csv', chunksize=10000):
chunk.to_parquet('output.parquet', append=True) # not straightforward
What Pandas gets right
- Full transformation power. Filter, group, pivot, join, aggregate, reshape, merge, and melt. If you can describe a data transformation, Pandas can do it. No other tool comes close for expressiveness.
- 30+ format plugins. CSV, JSON, Parquet, Avro, Excel, SQL, HDF5, Feather, Stata, SAS, SPSS, ORC, Pickle, HTML, XML, and more. Whatever format your data is in, Pandas probably reads it.
- Largest ecosystem. Stack Overflow, documentation, tutorials, books, courses. Every data problem has a Pandas solution posted somewhere. The collective knowledge base is unmatched.
- Free and open source. BSD licensed. No tier limits, no row caps, no API keys.
- Programmable. Any custom logic -- no matter how unusual -- can be implemented in Python. Full Turing-complete control over every transformation step.
Where Pandas falls apart for format conversion
- Memory-bound. Pandas loads entire datasets into memory. A 10GB CSV file requires 10-30GB of RAM (depending on types). For large files, you need chunked processing -- which requires custom scripts and careful handling of headers, schemas, and file modes.
- Boilerplate. Every conversion requires a Python script with import, read, write. For a one-shot CSV-to-Parquet conversion, that's 3 lines of code plus 2 pip installs. Compare to DataMorph's single command.
- Inconsistent format APIs.
pd.read_csv()works differently frompd.read_json()which works differently frompd.read_parquet(). Different parameters, different behavior, different edge cases. You need to learn each format's quirks. - No schema enforcement. Pandas infers types but doesn't enforce schemas. A Parquet file written by Pandas may have different column types than expected. Avro and Protobuf require explicit schemas that Pandas doesn't natively generate.
- Not CI/CD friendly. Requires Python runtime, virtual environment, and format-specific dependencies. Compare to DataMorph's single binary and exit codes.
Best for: Data transformations that go beyond format conversion -- filtering, aggregation, joins, reshaping. If you need to transform data while converting formats, Pandas is the right tool. If you just need to convert formats, DataMorph is faster and simpler.
Tool 3: Apache NiFi -- Visual Data Flow Pipelines
Apache NiFi -- Visual Data Flow Management
- 300+ built-in processors for data routing, transformation, format conversion
- Visual drag-and-drop UI for building data flows
- Built-in provenance tracking and data lineage
Apache NiFi is a data integration platform with a visual drag-and-drop interface for building data flows. You drag processors onto a canvas, connect them, configure properties, and NiFi routes, transforms, and converts data between formats. It's designed for teams that need recurring data pipelines with monitoring, alerting, and full data lineage tracking.
Core workflow
# Install (Docker recommended)
docker run -p 8443:8443 apache/nifi
# Or download and run manually
# 1. Open NiFi UI at https://localhost:8443/nifi
# 2. Drag processors onto canvas:
# - GetFile (read CSV from directory)
# - ConvertRecord (CSV to Parquet via AvroSchemaRegistry)
# - PutFile (write Parquet to output directory)
# 3. Connect processors with flow relationships
# 4. Configure schema registry, record readers/writers
# 5. Start the flow
# CLI alternative (NiFi Toolkit):
# Limited CLI support -- NiFi is primarily a UI tool
# Total setup time: 1-2 hours for first flow
# Subsequent flows: 15-30 minutes each
What Apache NiFi gets right
- Visual pipeline builder. Drag processors, connect them, configure properties. Non-programmers can build data pipelines. The UI makes complex flows understandable at a glance.
- 300+ processors. Format conversion, HTTP requests, database reads/writes, compression, encryption, routing, filtering, enrichment, and more. Almost any data operation has a built-in processor.
- Data provenance. Every data point is tracked from source to destination. Full lineage, debugging, and replay capabilities. You can trace any output back to its input.
- Built for production. Monitoring, alerting, back-pressure, flow versioning, role-based access control. NiFi is designed for 24/7 production data flows.
- Streaming native. Processes data as it arrives, not in batches. Supports real-time data flows with sub-second latency.
Where Apache NiFi is overkill for format conversion
- Platform, not a tool. NiFi is a Java application that runs its own web server, database, and flow repository. It's not a CLI command you run in a script. It's an always-on service that requires infrastructure.
- Heavy setup. Docker image is 1.5GB. JVM needs 2-4GB heap. First flow takes 1-2 hours to configure. For converting a CSV to Parquet, this is 100x more setup than the task requires.
- Visual-first design. The UI is the primary interface. CLI and API exist but are secondary. NiFi flows are hard to version control, diff, and review in PRs.
- Schema overhead. Record-based processors require Avro schemas and schema registries. A simple CSV-to-Parquet conversion needs: an AvroSchemaRegistry, a CSVReader, a ParquetWriter, and a ConvertRecord processor. That's 4 configuration screens for one conversion.
- No CI/CD integration. NiFi flows don't integrate with Git PRs, CI pipelines, or infrastructure-as-code workflows. They live in NiFi's internal repository.
Best for: Teams building recurring, monitored data pipelines with complex routing, multiple sources, and production reliability requirements. Not appropriate for one-shot format conversions -- use DataMorph for that.
Tool 4: AWS Glue -- Serverless Cloud ETL
AWS Glue -- Serverless Data Integration on AWS
- Serverless Spark-based ETL jobs
- Glue Studio: visual job builder
- Data Catalog for schema management
AWS Glue is a serverless ETL service that runs Apache Spark jobs on managed infrastructure. You define ETL jobs (visually in Glue Studio or in PySpark/Scala), point them at data sources, and Glue handles provisioning, execution, and monitoring. It's designed for AWS-native data pipelines that move data between S3, Redshift, RDS, and other AWS services.
Core workflow
# Via AWS Console:
# 1. Create a Glue Database and Table (or use a Crawler)
# 2. Create an ETL Job in Glue Studio (visual or code)
# 3. Configure source (S3 CSV), transform (format conversion), target (S3 Parquet)
# 4. Run the job (serverless Spark)
# 5. Monitor in Glue Console
# Via CLI:
aws glue create-job --name csv-to-parquet \
--role arn:aws:iam::123456789:role/GlueServiceRole \
--command Name=glueetl,ScriptLocation=s3://my-bucket/scripts/convert.py \
--default-arguments '{"--job-type":"glueetl"}'
# Via PySpark (Glue job script):
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
datasource = glueContext.create_dynamic_frame.from_catalog(
database="my_db", table_name="csv_table"
)
datasink = glueContext.write_dynamic_frame.from_options(
frame=datasource,
connection_type="s3",
connection_options={"path": "s3://output/parquet/"},
format="parquet"
)
# Minimum cost: ~$0.44 per DPU-hour (10 DPU minimum = $4.40/hour)
# Cold start: 1-3 minutes
# Setup time: 30-60 minutes
What AWS Glue gets right
- Serverless. No infrastructure to manage. Glue provisions Spark clusters, runs your job, and tears them down. Pay only for what you use.
- Spark-powered. Handles petabyte-scale data. Distributed processing across 10-100+ DPUs. If your data is huge, Glue scales to match.
- AWS-native integration. Reads from and writes to S3, Redshift, RDS, DynamoDB, Kinesis, and 30+ AWS services. Data Catalog provides centralized schema management.
- Visual job builder. Glue Studio lets you build ETL jobs visually. Less code for common transformations.
- Automatic schema inference. Glue Crawlers scan your data and infer schemas. No manual schema definition for S3 data.
Where AWS Glue is overkill for format conversion
- AWS-only. Tightly integrated with AWS services. Not designed for local file conversion or non-AWS data sources. Your data needs to be in S3 (or an AWS database) for Glue to process it.
- Expensive for small jobs. Minimum 2 DPUs at $0.44/DPU-hour. A 1-minute conversion job costs the same as a 1-hour job due to minimum billing increments. For converting a 100MB CSV to Parquet, you pay $0.88 minimum -- compare to DataMorph's free tier.
- Cold starts. Serverless Spark takes 1-3 minutes to provision. For a 10-second conversion, you wait 2 minutes for infrastructure to start up.
- Complex setup. IAM roles, S3 buckets, Glue databases, crawlers, job configurations, and security policies. Not a tool you pick up for a quick conversion.
- Overkill for format conversion. Glue is an ETL platform, not a format converter. Using it for CSV-to-Parquet conversion is like using a cargo ship to cross a puddle.
Best for: AWS-native data pipelines that move data between AWS services at petabyte scale. Not appropriate for local file format conversion -- use DataMorph for that.
Feature Comparison
| Capability | DataMorph | Pandas | Apache NiFi | AWS Glue |
|---|---|---|---|---|
| One-command format conversion | Yes | No (write script) | No (build flow) | No (create job) |
| Streaming (no memory limit) | Yes | No (in-memory) | Yes | Yes (Spark) |
| 6+ formats built-in | Yes (6 native) | Yes (30+ plugins) | Yes (20+ processors) | ~ (5 native + Spark) |
| Schema inference | Yes (auto) | ~ (type inference) | ~ (via registry) | Yes (Crawlers) |
| Batch file processing | Yes (*.csv --to) | No (custom script) | Yes (GetFile) | Yes (S3 scans) |
| Data transformation | No (conversion only) | Yes (full) | Yes (processors) | Yes (PySpark) |
| CI/CD friendly | Yes (exit codes) | ~ (Python scripts) | No (UI-based) | No (AWS Console/API) |
| Visual pipeline builder | No | No | Yes | ~ (Glue Studio) |
| Works offline | Yes | Yes | Yes (self-host) | No (AWS only) |
| Cost for small jobs | Free tier | Free | Free (self-host) | $0.88+ per job |
| Open source | Yes (MIT) | Yes (BSD) | Yes (Apache 2.0) | No (AWS service) |
Use Case Comparison
| Use Case | DataMorph | Pandas | Apache NiFi | AWS Glue |
|---|---|---|---|---|
| Convert CSV to Parquet (one file) | Ideal | Works (3 lines + 2 installs) | No (overkill) | No (overkill + $0.88) |
| Convert 100 CSV files to Parquet | Ideal (*.csv --to) | ~ (custom loop) | Works (GetFile flow) | No (S3 only) |
| Convert 10GB CSV (limited RAM) | Ideal (streaming) | No (OOM risk) | Works (streaming) | Works (Spark) |
| Filter + aggregate + convert format | No (convert only) | Ideal | Works | Works (PySpark) |
| CI/CD data validation step | Ideal | ~ (script + pytest) | No | No |
| Monitored production pipeline | No (CLI only) | No | Ideal | Ideal (AWS) |
| Petabyte-scale data processing | No (single node) | No (in-memory) | ~ (cluster mode) | Ideal (Spark) |
Cost Comparison
| Cost Factor | DataMorph | Pandas | Apache NiFi | AWS Glue |
|---|---|---|---|---|
| License/tool | MIT (free tier) | BSD (free) | Apache 2.0 (free) | Pay per DPU-hour |
| Dev time per conversion | 30 seconds | 5-15 minutes | 30-60 minutes | 30-60 minutes |
| Per-job cost | $0 (free tier) | $0 | $0 (self-host) | $0.88+ per job |
| Full suite (11 tools) | $49/mo | N/A | N/A | N/A |
When to Use Which
Use DataMorph when:
You need to convert between data formats (CSV, JSON, YAML, Parquet, Avro, Protobuf) without writing scripts. You want streaming for large files. You want batch processing for multiple files. You want CI/CD integration with validation. This covers 90% of format conversion needs for developers and data engineers.
Use Pandas when:
You need data transformation beyond format conversion -- filtering, aggregation, joins, pivoting, reshaping. You're already working in Python and need programmatic control. Your data fits in memory. The format conversion is a side effect of a larger data processing workflow.
Use Apache NiFi when:
You need recurring, monitored data pipelines with complex routing, multiple sources, and production reliability. You want visual pipeline building and data lineage tracking. NiFi is a platform, not a conversion tool -- use it when you need the full platform, not just format conversion.
Use AWS Glue when:
You need serverless Spark jobs for petabyte-scale data processing on AWS. Your data lives in S3, Redshift, or other AWS services. You need the Data Catalog for centralized schema management. Glue is AWS infrastructure for AWS data -- don't use it for local file conversions.
The Complementary Stack
These four tools solve different problems at different scales:
| Layer | Tool | Purpose |
|---|---|---|
| 1. Dev laptop | DataMorph | Convert formats instantly. Batch process directories. Validate data in CI. One command, no scripts, streaming for any file size. |
| 2. Data transformation | Pandas | Filter, aggregate, join, pivot, reshape data. Full Python expressiveness for any transformation logic. Use when format conversion is just one step in a larger pipeline. |
| 3. Production pipeline | Apache NiFi | Visual pipeline builder with monitoring, alerting, back-pressure, and data lineage. For recurring data flows that need operational visibility. |
| 4. Cloud scale | AWS Glue | Serverless Spark for petabyte-scale ETL on AWS. Data Catalog for schema management. S3/Redshift/RDS integration. Pay per DPU-hour. |
The key insight: most format conversions are Layer 1 problems. A developer needs to convert a CSV to Parquet, a JSON to YAML, or a directory of files from one format to another. DataMorph handles this in one command with streaming. Pandas, NiFi, and Glue solve different problems -- transformation, pipeline management, and cloud-scale processing -- that happen to include format conversion as a side effect.
Install DataMorph
# Install via pip
pip install datamorph-cli
# Or via Homebrew (macOS/Linux)
brew tap Coding-Dev-Tools/tap
brew install datamorph
# Or via Scoop (Windows)
scoop bucket add Coding-Dev-Tools https://github.com/Coding-Dev-Tools/scoop-bucket
scoop install datamorph
# Convert your data
datamorph convert data.csv --to parquet
Star DataMorph on GitHub
Related Reading
- Convert Between Data Formats at Scale with DataMorph -- getting started
- Batch Data Conversion at Scale: Why Streaming Matters -- streaming architecture
- Validate Data Schema in Your CI Pipeline -- CI/CD integration
- JSON to SQL Conversion Compared -- json2sql vs Papa Parse vs AWS DMS vs Airbyte
- Config Drift Detection Compared -- ConfigDrift vs driftctl vs Terraform Plan vs Checkov