The source code for these benchmarks are in the https://github.com/datafusion-rs/benchmarks GitHub repository.
Benchmarks are hard, and usually biased (whether intentionally or not). Take these with a pinch of salt and obviously feel free to run your own benchmarks too. The benchmarks repo is open source and I’m always happy to accept pull requests.
Benchmark 1: Simple Aggregate Queries against NYC Taxi Data
This benchmark uses NYC Taxi Trip Record Data to test the performance of simple aggregate queries against single input files in CSV and Parquet format.
The benchmark uses this simple aggregate query:
SELECT passenger_count, COUNT(1), MIN(fare_amount), MAX(fare_amount) FROM tripdata \ GROUP BY passenger_count
These are the current results for running this query against a single file (~800MB in CSV format). Times are in seconds.
|File Format||DataFusion 0.2.11||Apache Spark 2.2.1|