DataFusion Benchmarks

The source code for these benchmarks are in the GitHub repository.


Benchmarks are hard, and usually biased (whether intentionally or not). Take these with a pinch of salt and obviously feel free to run your own benchmarks too. The benchmarks repo is open source and I’m always happy to accept pull requests.

Benchmark 1: Simple Aggregate Queries against NYC Taxi Data

This benchmark uses NYC Taxi Trip Record Data to test the performance of simple aggregate queries against single input files in CSV and Parquet format.

The benchmark uses this simple aggregate query:

SELECT passenger_count, COUNT(1), MIN(fare_amount), MAX(fare_amount)
FROM tripdata \
GROUP BY passenger_count

These are the current results for running this query against a single file (~800MB in CSV format). Times are in seconds.

File Format DataFusion 0.2.11 Apache Spark 2.2.1
CSV 5.27 14.10
Parquet 1.60 2.07


Getting Started with Docker


Apache 2.0


Gitter Release Notes Roadmap Source Docker Repo