Using tf.py_func to generate input data

Question

Python version = 3.6.3 Tensorflow version = 1.3.0

I have worked in Keras but am now trying to work directly in TensorFlow. I am trying to implement the equivelent of Keras's fit_generator whereby I don't have to have all training data loaded into memory at the start but can feed it into the network as needed for training. The code below represents my attempt to start something like that, but if I am going about this all wrong I would love to know where in the docs I should look and what keywords I should use to search for this.

My system is currently based on a generator that reads through sqlite database files to extract np.arrays and then converts them into my desired data shape (a time series with one prediction for forward). I am trying to migrate that system now to work with Tensorflow Datasets and getting stuck at applying tf.py_func. Here is how I am trying to work right now

import tensorflow as tf
import os
from tensorflow.contrib.data import Dataset, Iterator

import sqlite3
import pandas as pd
import numpy as np

LOOKBACK_ROWS = 600 
DATA_DIR = '/mnt/derived_data/processedData'

files = os.listdir(DATA_DIR)

def data_from_files(f):
    with sqlite3.connect(DATA_DIR + f) as conn:
        results = conn.execute("SELECT col1, col2, FROM tbl")
        col_names = [d[0] for d in results.description]
        arr = np.array(results.fetchall())

    num_obs = arr.shape[0] - LOOKBACK_ROWS + 1

    X = np.zeros((num_obs, LOOKBACK_ROWS, 1), dtype = np.float32)
    Y = np.zeros((num_obs, 1), dtype = np.float32)

    for i in range(num_obs):
        idx = i + LOOKBACK_ROWS - 1
        X[i , :, 0] = arr[(idx - LOOKBACK_ROWS + 1):(idx + 1), 0]
        Y[i, 0] = arr[idx, 1]

    return tf.convert_to_tensor(X, name = 'X'), tf.convert_to_tensor(Y, name = 'Y')

filenames = tf.constant(files)

dataset = Dataset.from_tensor_slices((filenames))

dataset = dataset.map(lambda filename: tuple(tf.py_func(
    data_from_files,
    [filename],
    [tf.float32, tf.float32])))


iterator     = Iterator.from_structure(dataset.output_types, dataset.output_shapes)
next_element = iterator.get_next()
dataset_init_op = iterator.make_initializer(dataset)

with tf.Session() as sess:
    sess.run(dataset_init_op)

    while True:
        try:
            elem = sess.run(next_element)
            print('Success')
        except tf.errors.OutOfRangeError:
            print('End of dataset.')
            break

The initializations run fine, but then when I start the session and run I get the following errors:

  2017-10-16 16:58:45.227612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
    2017-10-16 16:58:45.227615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
    2017-10-16 16:58:45.227620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0)
    2017-10-16 16:58:45.276138: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: TypeError: must be str, not bytes
    2017-10-16 16:58:45.276306: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: TypeError: must be str, not bytes
         [[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
    Traceback (most recent call last):
      File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
        return fn(*args)
      File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
        status, run_metadata)
      File "/opt/python/3.6.3/lib/python3.6/contextlib.py", line 88, in __exit__
        next(self.gen)
      File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
        pywrap_tensorflow.TF_GetCode(status))
    tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: must be str, not bytes
         [[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]

    During handling of the above exception, another exception occurred:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/usr/code/nn/data_folder/pipeline.py", line 51, in <module>
        elem = sess.run(next_element)
      File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
        run_metadata_ptr)
      File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1124, in _run
        feed_dict_tensor, options, run_metadata)
      File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
        options, run_metadata)
      File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
        raise type(e)(node_def, op, message)
    tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: must be str, not bytes
         [[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
         [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]
    >>> python.el: native completion setup loaded
    >>>

Questions

(1) This seems like exactly a use case for py_func but am I wrong about that? If not, can anyone point me to some resources that show more in depth than the Tensorflow docs? (I did notice one potential related issue on git: https://github.com/tensorflow/tensorflow/issues/12396 but the fix of wrapping everything with a tuple did not help me).

(2) What is the general flow I should be following, particularly where I want to start with something like a bunch of filenames and output more than one training Example per file name?

Thank you.

Below I rewrote my script so that it can be a self-contained runnable example. I believe the issue is still the same as in the code above, but I am repasting the error as well to confirm.

Self contained runnable code example incorporating changes from @mrry's answer:

import tensorflow as tf
import os
import numpy as np

LOOKBACK_ROWS = 600 

arr = np.random.random_sample((2000, 2))
np.save("npfile.npy", arr)

def data_from_files(f):

    arr = np.load(f)
    num_obs = arr.shape[0] - LOOKBACK_ROWS + 1

    X = np.zeros((num_obs, LOOKBACK_ROWS, 1), dtype = np.float32)
    Y = np.zeros((num_obs, 1), dtype = np.float32)

    for i in range(num_obs):
        idx = i + LOOKBACK_ROWS - 1
        X[i , :, 0] = arr[(idx - LOOKBACK_ROWS + 1):(idx + 1), 0]
        Y[i, 0] = arr[idx, 1]

    return X, Y

files = ["npfile.npy"]
filenames = tf.constant(files)


# NOTE: In TensorFlow 1.4, `tf.contrib.data` is now `tf.data`.
dataset = tf.contrib.data.Dataset.from_tensor_slices(filenames)

# NOTE: In TensorFlow 1.4, the `tuple` is no longer needed.
dataset = dataset.map(lambda filename: tuple(tf.py_func(
    data_from_files,
    [filename],
    [tf.float32, tf.float32])))

# NOTE: If you only have one `Dataset`, you do not need to use
# `Iterator.from_structure()`.
iterator     = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    sess.run(iterator.initializer)

    while True:
        try:
            elem = sess.run(next_element)
            print('Success')
        except tf.errors.OutOfRangeError:
            print('End of dataset.')
            break

Error:

2017-10-16 18:30:44.143668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-10-16 18:30:44.143672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-10-16 18:30:44.143679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0)
2017-10-16 18:30:44.190852: W tensorflow/core/framework/op_kernel.cc:1192] Unknown: AttributeError: 'bytes' object has no attribute 'read'
2017-10-16 18:30:44.190959: W tensorflow/core/framework/op_kernel.cc:1192] Unknown: AttributeError: 'bytes' object has no attribute 'read'
     [[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
Traceback (most recent call last):
  File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
    return fn(*args)
  File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
    status, run_metadata)
  File "/opt/python/3.6.3/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.UnknownError: AttributeError: 'bytes' object has no attribute 'read'
     [[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
     [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "demo.py", line 48, in <module>
    elem = sess.run(next_element)
  File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
    run_metadata_ptr)
  File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1124, in _run
    feed_dict_tensor, options, run_metadata)
  File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
    options, run_metadata)
  File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: AttributeError: 'bytes' object has no attribute 'read'
     [[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
     [[Node: IteratorGetNext = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]

mrry · Accepted Answer · 2017-10-16 22:55:48Z

Taking your questions in reverse order:

What is the general flow I should be following, particularly where I want to start with something like a bunch of filenames and output more than one training Example per file name?

To transform one element into many, use the Dataset.flat_map(f) transformation. This transformation lets you define a function f(x) that maps a single element x to a nested Dataset object, and then it takes care of flattening the nested datasets.

This seems like exactly a use case for py_func but am I wrong about that?

This is a use case for tf.py_func() but your program has a slight error: the tf.py_func() op expects that your function (data_from_files()) returns NumPy arrays, and not tf.Tensor objects. Simply returning X and Y should work.

With those two points answered, let's take a look at how you can rewrite your code:

import tensorflow as tf
import os

import sqlite3
import pandas as pd
import numpy as np

LOOKBACK_ROWS = 600 
DATA_DIR = '/mnt/derived_data/processedData'

files = os.listdir(DATA_DIR)

def data_from_files(f):
    with sqlite3.connect(DATA_DIR + f) as conn:
        results = conn.execute("SELECT col1, col2, FROM tbl")
        col_names = [d[0] for d in results.description]
        arr = np.array(results.fetchall())

    num_obs = arr.shape[0] - LOOKBACK_ROWS + 1

    X = np.zeros((num_obs, LOOKBACK_ROWS, 1), dtype = np.float32)
    Y = np.zeros((num_obs, 1), dtype = np.float32)

    for i in range(num_obs):
        idx = i + LOOKBACK_ROWS - 1
        X[i , :, 0] = arr[(idx - LOOKBACK_ROWS + 1):(idx + 1), 0]
        Y[i, 0] = arr[idx, 1]

    return X, Y

filenames = tf.constant(files)

# NOTE: In TensorFlow 1.4, `tf.contrib.data` is now `tf.data`.
dataset = tf.contrib.data.Dataset.from_tensor_slices(filenames)

# NOTE: In TensorFlow 1.4, the `tuple` is no longer needed.
dataset = dataset.map(lambda filename: tuple(tf.py_func(
    data_from_files,
    [filename],
    [tf.float32, tf.float32])))

# NOTE: If you only have one `Dataset`, you do not need to use
# `Iterator.from_structure()`.
iterator     = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
    sess.run(iterator.initializer)

    while True:
        try:
            elem = sess.run(next_element)
            print('Success')
        except tf.errors.OutOfRangeError:
            print('End of dataset.')
            break

Thanks for this very detailed and helpful answer. I may be missing something, but I found that even incorporating your changes I still had the same error. I have appended standalone code in my example above as well as the error I get in that case.

Flynn · Accepted Answer · 2019-02-03 03:38:02Z

1

I have the same error (AttributeError: 'bytes' object has no attribute 'read') when i use tensorflow. I did these things: uninstall numpy, delete files about numpy in 'Lib\site-packages', reinstall numpy. And the error is gone. Maybe some mistakes Occurred where i updated numpy.

answered Feb 3, 2019 at 3:38

Flynn

514 bronze badges

Collectives™ on Stack Overflow

Using tf.py_func to generate input data

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related