DataFusion Distributed Design (Planned)
The goal of this page is to explain how I see the distributed version of DataFusion working.
A DataFusion cluster will consist of one or more worker containers using etcd for discovery and leadership elections (if required).
The worker process will be be built on top of the tokio framework and will support data interchange most likely using the Apache Arrow IPC format. A REST API and a Web user interface will also be included for monitoring the state of the worker.
Distributed Query Execution
Users can use a combination of DataFrame and SQL interactions to build a logical plan of a query to be executed on the cluster. DataFusion already supports execution of logical plans against local files. Adding support for distributed query execution requires some additional components:
- Distributed query planner
- Ability to serialize logical plans and delegate to a worker node
- Partitioning support
- Ability to stream data between nodes
Custom Code (UDFs)
Because Rust is not as flexible as the JVM when it comes to serializing closures, the initial deployment model will likely require the worker nodes to be compiled with custom code provided as a crate dependency. Once that is working, the next step will be to support dynamic loading of UDFs as shared objects.
More sophisticated techniques may be possible in the future, drawing inspiration from projects such as Weld.
I am currently leaning towards using Kubernetes, Docker, and etcd as the main deployment technologies.