Datafusion
Fuse various data from different databases and data sources using Postgres, and generate a one-stop-shop for your BI activity with simple SQL.
Installation
$ gem install datafusion
Usage
This is the configurator part of Datafusion, which is used internally with the Docker image. You can use the docker image directly to get all of the functionality needed in one package.
However, if you are composing your own image, or just wanting an easy way to do foreign data wrapper, use the instructions below.
You should have an integrations.yaml
file (see below for more).
$ datafusion -f integrations.yaml
:
: SQL output...
:
.
The tool will spit out all of the necessary SQL setup code for your database to run.
You can pipe it to psql
or capture into a file to run with the psql -f
command:
Piping:
$ datafusion -f integrations.yaml | psql -U postgres
With a file:
$ datafusion -f integrations.yaml > /tmp/script.sql && psql -U postgres -f /tmp/script.sql
Not yet implemented:
You can use the -c flag to provide a connection in the form of a url to a postgres
database:
$ datafusion -f integrations.yaml -c posgres://postgres:pass@localhost:5432/mydb
Integrations.yaml
This tool uses a special specification for data sources, typically in a file called
integrations.yaml
. Here is an example:
postgres1:
kind: postgres
server:
address: localhost
port: 5432
username: u1
password: p1
dbname: users
tables:
- name: ware1
table_name: registrations
mapping:
id: TEXT
warehouse_id: TEXT
mysql1:
kind: mysql
server:
address: localhost
port: 3306
username: u1
password: p1
dbname: users
tables:
- name: ware1
table_name: registrations
mapping:
id: TEXT
warehouse_id: TEXT
The idea is to specify your databases or data source in a human-readable way once,
and have that parsed by datafusion and set up a postgres
instance to be able to
integrate with them and give you the ability to fuse and dissect your data across
sources.
Agent
This part documents the always-on agent, which schedules data refreshes across your defined integration sources.
Cached Tables
You can define a refresh schedule and a cached view on a table with the cached
key like so:
tables:
- name: ware1
database: db
collection: foobar
cached:
name: mt_ware1
query: select * from ware1
refresh: 1s
- A
refresh
is a natural language short for a time period like1m
,5hr
,15s
and so on. - The
query
should mostly be aselect-all
from your defined table name. - The
name
part is a name which is available for you during your regular SQL queries, so that you can mix-and-match real-time and soft-real-time (cached) data.
Refreshing
The agent comes with a built-in scheduler which is able to execute refresh queries on your postgres cluster.
Start it like so:
$ datafusion -f integrations.yaml -a postgres://postgres:@localhost
And it will immediately come to life, telling you how many schedules it maintains, and various output during the refresh process of the data.
It is advisable to keep this agent up directly under systemd
or upstart
, and look
for an ERROR
level logs for job failures.
Contributing
Fork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).
Thanks:
To all contributors
Copyright
Copyright (c) 2016 Dotan Nahum @jondot. See LICENSE for further details.