Forget CSV or Excel files. Share your Data with Apache Parquet
What is Apache Parquet?
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval.
It provides:
- High-performance compression
- Efficient encoding schemes
- Support for complex data types
- Broad adoption across analytics tools and programming languages
Parquet is commonly used as a table replacement for analytical workloads.
How to Create Parquet Files
There are several ways to create Parquet files:
- From CSV files
- Directly from PostgreSQL using the
pg_parquetextension - Using DuckDB
- Using web-based CSV to Parquet converters
In this post, DuckDB is used as a lightweight and practical tool to generate and query Parquet files.
Creating Parquet Files from CSV using DuckDB
Install DuckDB
brew install duckdb
Convert CSV files to Parquet
duckdb -c "COPY (SELECT * FROM 'data/*.csv') TO 'output.parquet' (FORMAT 'parquet');"
DuckDB automatically detects column types.
If the detected types are not correct, they can be defined explicitly.
Defining column types manually
duckdb -c "COPY (
SELECT *
FROM read_csv_auto(
'cupcakes.csv',
types = {'filling': 'VARCHAR'}
)
) TO 'cupcakes.parquet' (FORMAT 'parquet');"
Reading and Querying Parquet Files
DuckDB treats a Parquet file like a regular SQL table.
Inspect the schema
duckdb
DESCRIBE SELECT * FROM 'cupcakes.parquet';
Example output:
column_name column_type
--------------------- -----------
id VARCHAR
name VARCHAR
flavor VARCHAR
price_cents BIGINT
created_at TIMESTAMP
avg_rating DOUBLE
is_featured BOOLEAN
Query data
SELECT *
FROM 'cupcakes.parquet'
WHERE flavor = 'VANILLA';
Parquet Files as Tables
A helpful mental model:
One database table equals one Parquet file.
Because Parquet is column-oriented, it maps naturally to SQL tables and supports efficient analytical queries.
Example tables in this domain:
cupcakesingredientsreviewssells
Joining Multiple Parquet Files
Joins work exactly like in a traditional database.
SELECT
c.id AS cupcake_id,
c.name AS cupcake_name,
s.quantity
FROM 'cupcakes.parquet' c
JOIN 'sells.parquet' s
ON c.id = s.cupcake_id;
Compression
Parquet files use efficient column-level compression by default.
This usually results in:
- Smaller file sizes compared to CSV
- Faster analytical queries
- Reduced I/O overhead
GeoParquet and Spatial Data
GeoParquet is an extension of Parquet that standardizes support for geospatial data.
More information: https://geoparquet.org/
With DuckDB spatial extensions, spatial operations can be executed directly on Parquet files.
Spatial query examples
SELECT *
FROM 'sells.parquet', 'reviews.parquet'
WHERE ST_Intersects(sells.geom, reviews.geom);
SELECT id, ST_Area(geom)
FROM 'sells.parquet';
SELECT id, ST_Buffer(geom, 100)
FROM 'sells.parquet';
DuckDB CLI Editing Commands
DuckDB’s CLI provides built-in SQL editing features.
Documentation: https://duckdb.org/docs/stable/clients/cli/editing
Conclusion
DuckDB combined with Parquet offers a powerful, serverless analytics stack:
- No database server required
- SQL-first workflow
- High performance
- Strong geospatial capabilities with GeoParquet
For many analytical use cases, Parquet files can fully replace traditional database tables.