`import blaze`

time for flexible data

Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use (e.g. Spark, Impala, SQL databases, No-SQL data-stores, raw-files). One Blaze query can work across data ranging from a CSV file to a distributed database.

introducing kinabalu

It's a bookstore! We've got:

books (Mongo)
users & their libraries (Postgres)
purchase history (CSV)
server logs (HTTP)

let's look at it

(open notebook, fiddle, watch things break…)

mixing it up

things get really interesting when you can query across data sources, as more and more people have to do

(open notebook again, see more things break…)

Now, some of you might be wondering why it wasn't all in Postgres to begin with. It can handle all kinds of data (whatever Uber says). Or AWS Redshift. But this is more akin to a real-world scenario, where for lots of reasons things are in different places.

This is where we find out some of the real power of Blaze, which is going to let us combine our data interestingly.

(discuss notebook) Note: the queries are messy and complicated. This is what happens with data. Many people can make use of these conclusions – even devops can use the download time insights to figure out whether it's worth setting up a node in AWS Alaska.

A happy side effect of doing this in Blaze is that Blaze is optimizing our query for us, pushing down the query as far as it can into each backend, so that it has to do the least amount of work in-process.

It's also worth noting that any of these queries can be trivially translated to fully run on the appropriate backend, so your analysis on a Pandas dataframe here can be executed on your Redshift cluster by changing a single parameter.

backends

CSV
HDF5
SQL (using SQLAlchemy)
JSON
MongoDB
HTTP URIs (if the resource is in a supported format)
Spark (PySpark, SparkSQL)
bcolz

key APIs

resource

accounts = blaze.resource('postgres://cpa:cpa@server/db::accounts')

Translates a string pointing to data into a Python object pointing to that data.

new resource types are easy

from blaze import resource
from pandas import read_excel

@resource.register('.*\.(xls|xlsx)')
def resource_xls(uri, **kwargs):
    return read_excel(uri,  **kwargs)

compute

in_debt = blaze.compute(t[t.balance < 0], {t: accounts})

Does all the work – evaluates an expression against a set of data sources.

data

accounts = blaze.data('postgres://cpa:cpa@server/db::accounts')
in_debt = blaze.compute(accounts[accounts.balance < 0])

Combines resource & compute – extremely handy for interactive exploration.

expressions

from blaze import join, by, concat, transform, merge, abs, sqrt,
sin, sinh, cos, cosh, tan, tanh, exp, expm1, log, log10, radians, \
degrees, ceil, floor, trunc, isnan, greatest, least, coerce, distinct,
min, max, mean, std, count, map

Supports a lot of expressions – documentation at http://blaze.readthedocs.io/en/latest/api.html#expressions doesn't cover all of them, but is a good start.

under the hood

expressions are trees

Expressions are internally described as trees of operations.

Lots of detail at http://blaze.readthedocs.io/en/latest/expr-design.html

>>> bz.to_tree(accounts['$'])
{
  'op': 'Field',
  'args': [{
      'op': 'Symbol',
      'args': ['_2', dshape("2 * {'$': int64, u: string}"), 0]
      },
    '$'],
}

pipeline

http://blaze.readthedocs.io/en/latest/computation.html

pre_compute all the leaves of the tree that represent data
optimize the expression
Try calling compute_down on the entire expression tree
Otherwise, traverse up the tree from the leaves, calling compute_up. Repeat this until the data significantly changes type (e.g. list to int after a sum operation)
Reevaluate optimize on the expression and pre_compute on all data elements.
Go to step 3
Call post_compute on the result

This can, for instance, load data into memory
Doesn't need explaining
This lets us process whole chunks of the tree, if we can. For instance, a distributed processing backend could restructure the tree here and send off whole sub-expressions to be run on different workers.
This is the most common function in Blaze, and encodes most of the logic, since it operates on the smallest and simplest units. For instance, addition would be a compute_up operation that expected two leaf nodes containing numbers.
This happens when the shape of the data we're processing has changed enough that it's probably worth a re-optimize – there might be a shortcut we can take on the new structure we couldn't on the old.
<null>
This handles the data once computation has been done. In the case of the SQL backend, for instance, the computation portion is really constructing an SQL query, and it's this that sends it off to the server and collects the results.

multiple dispatch

Internally, a lot of Blaze is implemented as simple functions that handle just one combination of possible inputs to an expression – like here, we see the case where we're computing a selection on pure-Python data.

@dispatch(Selection, Sequence, Sequence)
def compute_up(expr, seq, predicate, **kwargs):
    preds = iter(predicate)
    return filter(lambda _: next(preds), seq)

(the decorator is from multipledispatch)

As you see, it's a pretty straightforward function. We're expecting two sequences, one of which is our data and one of which is a predicate (i.e. Boolean, include-or- exclude) column, and we just zip them up and filter out the ones with a False-y predicate.

Of course, computing any arbitrary selection involves much more than just filtering based on sequences, but this approach allows us to implement one set of operations at a time. It also gives us a hint into how backends are implemented: there are handler functions for computations for different backend data types (from SQLAlchemy Selectable to NumPy ndarray). Thus adding backend support can be as simple as adding handlers for a few computations, or as complex as implementing all the supported expressions on a new set of data types.

client-server for the win

blaze.server.Server can host your data
blaze.server.Client can be your API

accounts = data('blaze://accounts.bank.com:6363')
in_debt = accounts[accounts.balance < 0]

components & ecosystem

odo

df = odo.odo('accounts.csv', 'postgresql://accounts::db')

datashape

var * { name: string, balance: float64 }