Python version = 3.6.3 Tensorflow version = 1.3.0
I have worked in Keras but am now trying to work directly in TensorFlow. I am trying to implement the equivelent of Keras's fit_generator whereby I don't have to have all training data loaded into memory at the start but can feed it into the network as needed for training. The code below represents my attempt to start something like that, but if I am going about this all wrong I would love to know where in the docs I should look and what keywords I should use to search for this.
My system is currently based on a generator that reads through sqlite database files to extract np.arrays and then converts them into my desired data shape (a time series with one prediction for forward). I am trying to migrate that system now to work with Tensorflow Datasets and getting stuck at applying tf.py_func. Here is how I am trying to work right now
import tensorflow as tf
import os
from tensorflow.contrib.data import Dataset, Iterator
import sqlite3
import pandas as pd
import numpy as np
LOOKBACK_ROWS = 600
DATA_DIR = '/mnt/derived_data/processedData'
files = os.listdir(DATA_DIR)
def data_from_files(f):
with sqlite3.connect(DATA_DIR + f) as conn:
results = conn.execute("SELECT col1, col2, FROM tbl")
col_names = [d[0] for d in results.description]
arr = np.array(results.fetchall())
num_obs = arr.shape[0] - LOOKBACK_ROWS + 1
X = np.zeros((num_obs, LOOKBACK_ROWS, 1), dtype = np.float32)
Y = np.zeros((num_obs, 1), dtype = np.float32)
for i in range(num_obs):
idx = i + LOOKBACK_ROWS - 1
X[i , :, 0] = arr[(idx - LOOKBACK_ROWS + 1):(idx + 1), 0]
Y[i, 0] = arr[idx, 1]
return tf.convert_to_tensor(X, name = 'X'), tf.convert_to_tensor(Y, name = 'Y')
filenames = tf.constant(files)
dataset = Dataset.from_tensor_slices((filenames))
dataset = dataset.map(lambda filename: tuple(tf.py_func(
data_from_files,
[filename],
[tf.float32, tf.float32])))
iterator = Iterator.from_structure(dataset.output_types, dataset.output_shapes)
next_element = iterator.get_next()
dataset_init_op = iterator.make_initializer(dataset)
with tf.Session() as sess:
sess.run(dataset_init_op)
while True:
try:
elem = sess.run(next_element)
print('Success')
except tf.errors.OutOfRangeError:
print('End of dataset.')
break
The initializations run fine, but then when I start the session and run I get the following errors:
2017-10-16 16:58:45.227612: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-10-16 16:58:45.227615: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-10-16 16:58:45.227620: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0)
2017-10-16 16:58:45.276138: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: TypeError: must be str, not bytes
2017-10-16 16:58:45.276306: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: TypeError: must be str, not bytes
[[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
Traceback (most recent call last):
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
File "/opt/python/3.6.3/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: must be str, not bytes
[[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/usr/code/nn/data_folder/pipeline.py", line 51, in <module>
elem = sess.run(next_element)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: TypeError: must be str, not bytes
[[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]
>>> python.el: native completion setup loaded
>>>
Questions
(1) This seems like exactly a use case for py_func but am I wrong about that? If not, can anyone point me to some resources that show more in depth than the Tensorflow docs? (I did notice one potential related issue on git: https://github.com/tensorflow/tensorflow/issues/12396 but the fix of wrapping everything with a tuple did not help me).
(2) What is the general flow I should be following, particularly where I want to start with something like a bunch of filenames and output more than one training Example per file name?
Thank you.
Below I rewrote my script so that it can be a self-contained runnable example. I believe the issue is still the same as in the code above, but I am repasting the error as well to confirm.
Self contained runnable code example incorporating changes from @mrry's answer:
import tensorflow as tf
import os
import numpy as np
LOOKBACK_ROWS = 600
arr = np.random.random_sample((2000, 2))
np.save("npfile.npy", arr)
def data_from_files(f):
arr = np.load(f)
num_obs = arr.shape[0] - LOOKBACK_ROWS + 1
X = np.zeros((num_obs, LOOKBACK_ROWS, 1), dtype = np.float32)
Y = np.zeros((num_obs, 1), dtype = np.float32)
for i in range(num_obs):
idx = i + LOOKBACK_ROWS - 1
X[i , :, 0] = arr[(idx - LOOKBACK_ROWS + 1):(idx + 1), 0]
Y[i, 0] = arr[idx, 1]
return X, Y
files = ["npfile.npy"]
filenames = tf.constant(files)
# NOTE: In TensorFlow 1.4, `tf.contrib.data` is now `tf.data`.
dataset = tf.contrib.data.Dataset.from_tensor_slices(filenames)
# NOTE: In TensorFlow 1.4, the `tuple` is no longer needed.
dataset = dataset.map(lambda filename: tuple(tf.py_func(
data_from_files,
[filename],
[tf.float32, tf.float32])))
# NOTE: If you only have one `Dataset`, you do not need to use
# `Iterator.from_structure()`.
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
while True:
try:
elem = sess.run(next_element)
print('Success')
except tf.errors.OutOfRangeError:
print('End of dataset.')
break
Error:
2017-10-16 18:30:44.143668: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0
2017-10-16 18:30:44.143672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0: Y
2017-10-16 18:30:44.143679: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0)
2017-10-16 18:30:44.190852: W tensorflow/core/framework/op_kernel.cc:1192] Unknown: AttributeError: 'bytes' object has no attribute 'read'
2017-10-16 18:30:44.190959: W tensorflow/core/framework/op_kernel.cc:1192] Unknown: AttributeError: 'bytes' object has no attribute 'read'
[[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
Traceback (most recent call last):
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1306, in _run_fn
status, run_metadata)
File "/opt/python/3.6.3/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.UnknownError: AttributeError: 'bytes' object has no attribute 'read'
[[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "demo.py", line 48, in <module>
elem = sess.run(next_element)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/opt/python/3.6.3/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.UnknownError: AttributeError: 'bytes' object has no attribute 'read'
[[Node: PyFunc = PyFunc[Tin=[DT_STRING], Tout=[DT_FLOAT, DT_FLOAT], token="pyfunc_0"](arg0)]]
[[Node: IteratorGetNext = IteratorGetNext[output_shapes=[<unknown>, <unknown>], output_types=[DT_FLOAT, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](Iterator)]]