Handling large numpy array in tensorflow with regression output(51 outputs)

Question

I have a very large dataset which is a single npy file that contains around 1.5m elements each a 150x150x3 image. The output has 51 columns (51 outputs). Since the dataset can't fit into memory, How do I load it and use it to fit the model? An efficient way is using TFRecords and tf.data but I couldn't understand how to do this. I would appreciate the help. Thank you.

What does I couldn’t understand how to do this mean? Can you share your attempts? — AMC
– AMC, Commented Dec 5, 2019 at 20:47
@AlexanderCécile yeah sure, the idea is to convert the large dataset into a TensorFlow compatible format, TFRecord, and then use the tf.data API to read this tfrecord file to feed it to the neural network. I tried various approaches but failed to do it — Amin Marshal
– Amin Marshal, Commented Dec 5, 2019 at 21:11

aminrd · Accepted Answer · 2019-12-05 22:08:32Z

3

One way is to load your NPY file fragment by fragment ( to feed your neural network with) and not to load it into the memory at once. You can use numpy.load as normal and specify the mmap_mode keyword so that the array is kept on disk, and only necessary bits are loaded into memory upon access (more details here)

numpy.load(file, mmap_mode=None, allow_pickle=False, fix_imports=True, encoding='ASCII')

Memory-mapped files are used for accessing small segments of large files on disk, without reading the entire file into memory. NumPy’s memmap’s are array-like objects. This differs from Python’s mmap module, which uses file-like objects.

If you want to know how to create a tfrecords from a numpy array, and then read the tfrecords using the Dataset API, this link provides a good answer.

edited Dec 5, 2019 at 22:08

answered Dec 5, 2019 at 20:55

aminrd

5,2605 gold badges34 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Amin Marshal Over a year ago

Thanks a lot, I will try both methods

Amin Marshal Over a year ago

I have some questions about the link you provided for TFRecords. Why was X flattened? My numpy arrays are image arrays and I have 51 outputs for y. Do I also need to flatten them? moreover, when I try this code, ram goes as high as 90% (I have 32GB RAM) and the program crashes. Can you identify the problem?

Collectives™ on Stack Overflow

Handling large numpy array in tensorflow with regression output(51 outputs)

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related