6

The most common SQLite interface I've seen in Python is sqlite3, but is there anything that works well with NumPy arrays or recarrays? By that I mean one that recognizes data types and does not require inserting row by row, and extracts into a NumPy (rec)array...? Kind of like R's SQL functions in the RDB or sqldf libraries, if anyone is familiar with those (they import/export/append whole tables or subsets of tables to or from R data tables).

2
  • 1
    I'd recommend to have a look at PyTables. It uses HDF5 as backend, not SQLite, but also supports powerful querying. Commented Oct 26, 2011 at 11:21
  • Thanks, but I want it to be in SQLite so R can have a go at it (R does much better with SQL tables than HDF5 files). Commented Oct 26, 2011 at 14:03

4 Answers 4

9

why not give redis a try?

Drivers for your two platforms of interest are available--python (redis, via package index]2), and R (rredis, CRAN).

The genius of redis is not that it will magically recognize the NumPy data type and allow you to insert and extract multi-dimensional NumPy arrays as if they were native redis datatypes, rather its genius is in the remarkable ease with which you can create such an interface with just a few lines of code.

There are (at least) several tutorials on redis in python; the one on the DeGizmo blog is particularly good.

import numpy as NP

# create some data
A = NP.random.randint(0, 10, 40).reshape(8, 5)

# a couple of utility functions to (i) manipulate NumPy arrays prior to insertion 
# into redis db for more compact storage & 
# (ii) to restore the original NumPy data types upon retrieval from redis db
fnx2 = lambda v : map(int, list(v))
fnx = lambda v : ''.join(map(str, v))

# start the redis server (e.g. from a bash prompt)
$> cd /usr/local/bin      # default install directory for 'nix
$> redis-server           # starts the redis server

# start the redis client:
from redis import Redis
r0 = Redis(db=0, port=6379, host='localhost')       # same as: r0 = Redis()

# to insert items using redis 'string' datatype, call 'set' on the database, r0, and
# just pass in a key, and the item to insert
r0.set('k1', A[0,:])

# row-wise insertion the 2D array into redis, iterate over the array:
for c in range(A.shape[0]):
    r0.set( "k{0}".format(c), fnx(A[c,:]) )

# or to insert all rows at once
# use 'mset' ('multi set') and pass in a key-value mapping: 
x = dict([sublist for sublist in enumerate(A.tolist())])
r0.mset(x1)

# to retrieve a row, pass its key to 'get'
>>> r0.get('k0')
  '63295'

# retrieve the entire array from redis:
kx = r0.keys('*')           # returns all keys in redis database, r0

for key in kx :
    r0.get(key)

# to retrieve it in original form:
A = []
for key in kx:
    A.append(fnx2(r0.get("{0}".format(key))))

>>> A = NP.array(A)
>>> A
  array([[ 6.,  2.,  3.,  3.,  9.],
         [ 4.,  9.,  6.,  2.,  3.],
         [ 3.,  7.,  9.,  5.,  0.],
         [ 5.,  2.,  6.,  3.,  4.],
         [ 7.,  1.,  5.,  0.,  2.],
         [ 8.,  6.,  1.,  5.,  8.],
         [ 1.,  7.,  6.,  4.,  9.],
         [ 6.,  4.,  1.,  3.,  6.]])
Sign up to request clarification or add additional context in comments.

1 Comment

This is a great idea in general for exchanging data/information between programs! -- thanks for the example also. Unfortunately redis is an in-memory solution? And the data I want to exchange is quite large so being able to use a hard-disk file is desirable...
3

Doug's suggestion with redis is quite good, but I think his code is a bit complicated and, as a result, rather slow. For my purposes, I had to serialize+write and then grab+deserialize a square matrix of about a million floats in less than a tenth of a second, so I did this:

For writing:

snapshot = np.random.randn(1024,1024)
serialized = snapshot.tobytes()
rs.set('snapshot_key', serialized)

Then for reads:

s = rs.get('snapshot_key')
deserialized = np.frombuffer(s).astype(np.float32)
rank = np.sqrt(deserialized.size).astype(int)
snap = deserialized(rank, rank)

You can do some basic performance testing with ipython using %time, but neither the tobytes or frombuffer take more than a few milliseconds.

2 Comments

great suggestion, but watch out with np.frombuffer(s).astype(np.float32): this will parse the buffer s as if it had dtype np.float64 and then casts it to np.float32. if the original matrix had dtype np.float32 this will return a matrix that is half the size. it is better to use the dtype argument of np.frombuffer (i.e. np.frombuffer(s, dtype=np.float32))
good catch, i was being sloppy here especially with the python int mixed in too.
1

This looks a bit older but is there any reason you cannot just do a fetchall() instead of iterating and then just initializing numpy on declaration?

Comments

1

I found at least three Python packages to interface SQLite and NumPy:

Each of these packages has to deal with the problem that SQLite (by default) only understands standard Python types and not the NumPy data types such as numpy.int64.

RecSQL 0.7.8+ works for me (most of the time) but I consider it a pretty bad hack and glancing over the code, esutil.sqlite_util appears to be more mature.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.