DataFusion is an effort to build a modern distributed compute platform in the Rust programming language, with capabilities similar to Apache Spark, supporting ad-hoc queries and data processing using SQL and DataFrame APIs.

Current Status

DataFusion is at a very early stage of development and cannot yet be used for distributed use cases. However, the following components are currently implemented:

A Docker image is available, with a SQL console, making it easy to try running queries against your own data sources to compare performance with other solutions.

DataFusion can also be used as a crate dependency if you need SQL capabilities against your own data sources. Here are some examples.

Performance

Because DataFusion isn’t distributed yet, benchmarks only exist for single node (in-process) queries. Current benchmarks are available here.

Roadmap

I am currently working towards a 0.3.0 release which will add more SQL capabilities and hopefully make DataFusion useful for some small subset of real-world use cases.

The next step will be to start work on the distributed capabilities, which you can read about in detail here.

Contributors Welcome!

The main reason for announcing this project at this early stage (other than creating awareness and getting valuable feedback) is to hopefully find some more contributors to help build this out.

Please do take a look at the source code and try the examples. There are some open issues and milestones defined that will give an idea of the work that is required.

There is also a Gitter channel for general discussions about DataFusion.

Who is behind DataFusion?

My name is Andy Grove and I am currently the lead developer of this project. I have been building scalable distributed data platforms for many years now and I work with the Hadoop stack in my day job.

I work on DataFusion in my spare time, so progress is a little slow, but I hope to inspire others to get involved and help build this out.

You can follow me on twitter (@andygrove73) to stay up to date with news about the project.

Guides

Getting Started with Docker

License

Apache 2.0

Resources

Gitter Release Notes Roadmap Source Docker Repo