Getting Started with DataFusion Docker Image

Installation

First, pull the latest DataFusion Console docker image using the following command.

docker pull datafusionrs/console

Now the console can be run in interactive mode:

docker run -it datafusionrs/console

Supported SQL Syntax

CREATE TABLE (CSV)

When importing CSV files it is currently necessary to specify the columns using the typical CREATE TABLE syntax.

CREATE EXTERNAL TABLE table_name (column_name column_type, ..) 
STORED AS CSV [WITH | WITHOUT] HEADER ROW 
LOCATION path;

CREATE TABLE (Parquet)

Importing a Parquet table does not require columns to be defined since the schema is contained in the Parquet file.

CREATE EXTERNAL TABLE table_name 
STORED AS PARQUET
LOCATION path;

SELECT

Currently just a small subset of ANSI SQL is supported, allowing aggregate queries to be run.

SELECT expr, ..
FROM table_name
[WHERE expr]
[GROUP BY expr, ..]

Supported scalar functions:

Supported aggregate functions:

Using the example data files

The docker image contains some small sample files that can be used without the need to map volumes into the container. To run a container in interactive mode:

docker run -it datafusionrs/console

Run the following SQL to register one of the pre-installed CSV files as a table.

CREATE EXTERNAL TABLE uk_cities (city VARCHAR(100), lat DOUBLE, lng DOUBLE) 
STORED AS CSV WITHOUT HEADER ROW 
LOCATION '/opt/datafusion/data/uk_cities.csv';

Run queries against the table:

SELECT ST_AsText(ST_Point(lat, lng)) 
FROM uk_cities WHERE lat < 53.0;

Using your own files

To run SQL against local files you will need to map a volume into the docker container so that the files are accessible inside of docker.

For example, if your files are stored in /mnt/ssd/data then you will need to use the -v/mnt/ssd/data:/data option when running the docker container to make those files available in the /data path inside the container.

Interactive vs Script Mode

Interactive Mode

To run the console in interactive mode:

docker run -v/mnt/ssd/data:/data -it datafusionrs/console

Script Mode

The console can be run in non-interactive mode by providing a SQL script to execute using the --script command-line parameter. Note that the script must be in a volume that is mapped into the docker container to be visible.

The output of script execution is written to stdout.

docker run -v/mnt/ssd/data:/data -it datafusionrs/console --script /data/path/to/script.sql

Guides

Getting Started with Docker

License

Apache 2.0

Resources

Gitter Release Notes Roadmap Source Docker Repo