Datafusion

Fuse various data from different databases and data sources using Postgres, and generate a one-stop-shop for your BI activity with simple SQL.

Installation

$ gem install datafusion

Usage

This is the configurator part of Datafusion, which is used internally with the Docker image. You can use the docker image directly to get all of the functionality needed in one package.

However, if you are composing your own image, or just wanting an easy way to do foreign data wrapper, use the instructions below.

You should have an integrations.yaml file (see below for more).

$ datafusion -f integrations.yaml
:
:  SQL output...
:
.

The tool will spit out all of the necessary SQL setup code for your database to run. You can pipe it to psql or capture into a file to run with the psql -f command:

Piping:

$ datafusion -f integrations.yaml | psql -U postgres

With a file:

$ datafusion -f integrations.yaml > /tmp/script.sql && psql -U postgres -f /tmp/script.sql

Not yet implemented:

You can use the -c flag to provide a connection in the form of a url to a postgres database:

$ datafusion -f integrations.yaml -c posgres://postgres:pass@localhost:5432/mydb

Integrations.yaml

This tool uses a special specification for data sources, typically in a file called integrations.yaml. Here is an example:

postgres1:
  kind: postgres
  server:
    address: localhost
    port: 5432
    username: u1
    password: p1
    dbname: users
  tables:
    - name: ware1
      table_name: registrations
      mapping:
        id: TEXT
        warehouse_id: TEXT
mysql1:
  kind: mysql
  server:
    address: localhost
    port: 3306
    username: u1
    password: p1
    dbname: users
  tables:
    - name: ware1
      table_name: registrations
      mapping:
        id: TEXT
        warehouse_id: TEXT

The idea is to specify your databases or data source in a human-readable way once, and have that parsed by datafusion and set up a postgres instance to be able to integrate with them and give you the ability to fuse and dissect your data across sources.

Agent

This part documents the always-on agent, which schedules data refreshes across your defined integration sources.

Cached Tables

You can define a refresh schedule and a cached view on a table with the cached key like so:

tables:
  - name: ware1
    database: db
    collection: foobar
    cached:
      name: mt_ware1
      query: select * from ware1
      refresh: 1s

A refresh is a natural language short for a time period like 1m, 5hr, 15s and so on.
The query should mostly be a select-all from your defined table name.
The name part is a name which is available for you during your regular SQL queries, so that you can mix-and-match real-time and soft-real-time (cached) data.

Refreshing

The agent comes with a built-in scheduler which is able to execute refresh queries on your postgres cluster.

Start it like so:

$ datafusion -f integrations.yaml -a postgres://postgres:@localhost

And it will immediately come to life, telling you how many schedules it maintains, and various output during the refresh process of the data.

It is advisable to keep this agent up directly under systemd or upstart, and look for an ERROR level logs for job failures.

Contributing

Fork, implement, add tests, pull request, get my everlasting thanks and a respectable place here :).

Thanks:

To all contributors