import blaze

time for flexible data

blaze_med.png

Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use (e.g. Spark, Impala, SQL databases, No-SQL data-stores, raw-files). One Blaze query can work across data ranging from a CSV file to a distributed database.

introducing kinabalu

It's a bookstore! We've got:

  • books (Mongo)
  • users & their libraries (Postgres)
  • purchase history (CSV)
  • server logs (HTTP)

let's look at it

(open notebook, fiddle, watch things break…)

mixing it up

things get really interesting when you can query across data sources, as more and more people have to do

(open notebook again, see more things break…)

backends

  • CSV
  • HDF5
  • SQL (using SQLAlchemy)
  • JSON
  • MongoDB
  • HTTP URIs (if the resource is in a supported format)
  • Spark (PySpark, SparkSQL)
  • bcolz

key APIs

resource

accounts = blaze.resource('postgres://cpa:cpa@server/db::accounts')

Translates a string pointing to data into a Python object pointing to that data.

new resource types are easy

from blaze import resource
from pandas import read_excel

@resource.register('.*\.(xls|xlsx)')
def resource_xls(uri, **kwargs):
    return read_excel(uri,  **kwargs)

compute

in_debt = blaze.compute(t[t.balance < 0], {t: accounts})

Does all the work – evaluates an expression against a set of data sources.

data

accounts = blaze.data('postgres://cpa:cpa@server/db::accounts')
in_debt = blaze.compute(accounts[accounts.balance < 0])

Combines resource & compute – extremely handy for interactive exploration.

expressions

from blaze import join, by, concat, transform, merge, abs, sqrt,
sin, sinh, cos, cosh, tan, tanh, exp, expm1, log, log10, radians, \
degrees, ceil, floor, trunc, isnan, greatest, least, coerce, distinct,
min, max, mean, std, count, map

Supports a lot of expressions – documentation at http://blaze.readthedocs.io/en/latest/api.html#expressions doesn't cover all of them, but is a good start.

under the hood

expressions are trees

Expressions are internally described as trees of operations.

Lots of detail at http://blaze.readthedocs.io/en/latest/expr-design.html

>>> bz.to_tree(accounts['$'])
{
  'op': 'Field',
  'args': [{
      'op': 'Symbol',
      'args': ['_2', dshape("2 * {'$': int64, u: string}"), 0]
      },
    '$'],
}

pipeline

http://blaze.readthedocs.io/en/latest/computation.html

  1. pre_compute all the leaves of the tree that represent data
  2. optimize the expression
  3. Try calling compute_down on the entire expression tree
  4. Otherwise, traverse up the tree from the leaves, calling compute_up. Repeat this until the data significantly changes type (e.g. list to int after a sum operation)
  5. Reevaluate optimize on the expression and pre_compute on all data elements.
  6. Go to step 3
  7. Call post_compute on the result

multiple dispatch

Internally, a lot of Blaze is implemented as simple functions that handle just one combination of possible inputs to an expression – like here, we see the case where we're computing a selection on pure-Python data.

@dispatch(Selection, Sequence, Sequence)
def compute_up(expr, seq, predicate, **kwargs):
    preds = iter(predicate)
    return filter(lambda _: next(preds), seq)

(the decorator is from multipledispatch)

client-server for the win

  • blaze.server.Server can host your data
  • blaze.server.Client can be your API
accounts = data('blaze://accounts.bank.com:6363')
in_debt = accounts[accounts.balance < 0]

components & ecosystem

odo

df = odo.odo('accounts.csv', 'postgresql://accounts::db')

datashape

var * { name: string, balance: float64 }

downsides

what it doesn't do

  • Clean up your messy data
  • Most SciPy / SciKit operations
  • Make your existing data-handling code parallel
  • Make everything super fast

http://blaze.readthedocs.io/en/latest/what-blaze-isnt.html

poor mongodb support

debugging can be tough

thanks

contact: @necaris