import blaze
time for flexible data
Blaze presents a pleasant and familiar interface to us regardless of what computational solution or database we use (e.g. Spark, Impala, SQL databases, No-SQL data-stores, raw-files). One Blaze query can work across data ranging from a CSV file to a distributed database.
It's a bookstore! We've got:
(open notebook, fiddle, watch things break…)
things get really interesting when you can query across data sources, as more and more people have to do
(open notebook again, see more things break…)
accounts = blaze.resource('postgres://cpa:cpa@server/db::accounts')
Translates a string pointing to data into a Python object pointing to that data.
from blaze import resource from pandas import read_excel @resource.register('.*\.(xls|xlsx)') def resource_xls(uri, **kwargs): return read_excel(uri, **kwargs)
in_debt = blaze.compute(t[t.balance < 0], {t: accounts})
Does all the work – evaluates an expression against a set of data sources.
accounts = blaze.data('postgres://cpa:cpa@server/db::accounts') in_debt = blaze.compute(accounts[accounts.balance < 0])
Combines resource & compute – extremely handy for interactive exploration.
from blaze import join, by, concat, transform, merge, abs, sqrt, sin, sinh, cos, cosh, tan, tanh, exp, expm1, log, log10, radians, \ degrees, ceil, floor, trunc, isnan, greatest, least, coerce, distinct, min, max, mean, std, count, map
Supports a lot of expressions – documentation at http://blaze.readthedocs.io/en/latest/api.html#expressions doesn't cover all of them, but is a good start.
Expressions are internally described as trees of operations.
Lots of detail at http://blaze.readthedocs.io/en/latest/expr-design.html
>>> bz.to_tree(accounts['$']) { 'op': 'Field', 'args': [{ 'op': 'Symbol', 'args': ['_2', dshape("2 * {'$': int64, u: string}"), 0] }, '$'], }
http://blaze.readthedocs.io/en/latest/computation.html
pre_compute
all the leaves of the tree that represent dataoptimize
the expressioncompute_down
on the entire expression treecompute_up
.
Repeat this until the data significantly changes type (e.g. list to int
after a sum operation)optimize
on the expression and pre_compute
on all data elements.post_compute
on the resultInternally, a lot of Blaze is implemented as simple functions that handle just one combination of possible inputs to an expression – like here, we see the case where we're computing a selection on pure-Python data.
@dispatch(Selection, Sequence, Sequence) def compute_up(expr, seq, predicate, **kwargs): preds = iter(predicate) return filter(lambda _: next(preds), seq)
(the decorator is from multipledispatch
)
blaze.server.Server
can host your datablaze.server.Client
can be your APIaccounts = data('blaze://accounts.bank.com:6363') in_debt = accounts[accounts.balance < 0]
df = odo.odo('accounts.csv', 'postgresql://accounts::db')
var * { name: string, balance: float64 }
contact: @necaris