DataFusion is an effort to build a modern distributed compute platform in the Rust programming language, with capabilities similar to Apache Spark, supporting ad-hoc queries and data processing using SQL and DataFrame APIs.
DataFusion is at a very early stage of development and cannot yet be used for distributed use cases. However, the following components are currently implemented:
- SQL Parser, Planner and Optimizer
- Support for local CSV and Parquet data sources
- Columnar processing using Apache Arrow
- Single-threaded execution of SQL queries, supporting:
- Scalar Functions
- Aggregates (Min, Max, Count)
- DataFrame API
- User-defined Scalar Functions (UDFs)
A Docker image is available, with a SQL console, making it easy to try running queries against your own data sources to compare performance with other solutions.
DataFusion can also be used as a crate dependency if you need SQL capabilities against your own data sources. Here are some examples.
Because DataFusion isn’t distributed yet, benchmarks only exist for single node (in-process) queries. Current benchmarks are available here.
I am currently working towards a 0.3.0 release which will add more SQL capabilities and hopefully make DataFusion useful for some small subset of real-world use cases.
The next step will be to start work on the distributed capabilities, which you can read about in detail here.
The main reason for announcing this project at this early stage (other than creating awareness and getting valuable feedback) is to hopefully find some more contributors to help build this out.
There is also a Gitter channel for general discussions about DataFusion.
Who is behind DataFusion?
My name is Andy Grove and I am currently the lead developer of this project. I have been building scalable distributed data platforms for many years now and I work with the Hadoop stack in my day job.
I work on DataFusion in my spare time, so progress is a little slow, but I hope to inspire others to get involved and help build this out.
You can follow me on twitter (@andygrove73) to stay up to date with news about the project.