Benchmark TensorFlow

Google's TensorFlow benchmarks are here!

I've run the benchmarks on the Imagenet Winners. 
When I saw issues with the numbers, memory etc., I emailed @Yangqing to confirm what I'm seeing, and that it is expected. 

With that disclaimer out of the way, here's some things that you should know about TensorFlow (as of the pip version that I installed today):
- in-place ReLU seems non-existent in practice.
  - Yangqing says: "right now there are little in-place operations in TensorFlow and we pretty much rely on the scheduler and the memory pool to allocate and deallocate memory"
- Supports CuDNN R2. No R3 support yet, Yangqing says the next version they are going to support is likely R4.

Coming to the benchmarks:
- Googlenet with batchsize 128 goes Out of Memory. The largest batch-size I could fit is 16 (tried 16, 32, 64, 128)
- VGG with batchsize 64 goes Out of Memory (**Edit: VGG memory issue was solved by using the BFC allocator updated by GOOG)**. ~~The largest batch-size I could fit is 32 (tried 32, 64).~~
- I've also computed Torch7+CuDNN-R2 baselines for these batch-sizes.

**[AlexNet (One Weird Trick paper)](https://code.google.com/p/cuda-convnet2/source/browse/layers/layers-imagenet-1gpu.cfg)** - Input 128x3x224x224

| Library | Time (ms) | forward (ms) | backward (ms) |
| :-: | --: | --: | --: |
| CuDNN-R3 (Torch) | 96 | 32 | 64 |
| Nervana (Neon) | 101 | 32 | 69 |
| CuDNN-R2 (Torch) | 231 | 70 | 161 |
| **TensorFlow** | 326 | 96 | 230 |

**[Overfeat [fast]](http://arxiv.org/abs/1312.6229)** - Input 128x3x231x231

| Library | Time (ms) | forward (ms) | backward (ms) |
| :-: | --: | --: | --: |
| CuDNN-R3 (Torch) | 326 | 113 | 213 |
| fbfft (Torch) | 342 | 114 | 227 |
| CuDNN-R2 (Torch) | 810 | 234 | 576 |
| TensorFlow | 1084 | 316 | 768 |

**[OxfordNet [Model-A]](http://arxiv.org/abs/1409.1556/)** - Input 64x3x224x224

| Library | Time (ms) | forward (ms) | backward (ms) |
| :-: | --: | --: | --: |
| Nervana | 590 | 180 | 410 |
| CuDNN-R3 (Torch) | 615 | 196 | 418 |
| CuDNN-R2 (Torch) | 1099 | 342 | 757 |
| TensorFlow | 1840 | 545 | 1295 |

**[GoogleNet V1](http://research.google.com/pubs/pub43022.html)** - Input **16**x3x224x224

| Library | Time (ms) | forward (ms) | backward (ms) |
| :-: | --: | --: | --: |
| CuDNN-R2 (Torch) | 564 | 174 | 390 |
| TensorFlow | 590 | 54 | 536 |

**Note that at batch size of 16, googlenet with CuDNN-R2 + Torch likely runs into dispatching overhead, so it's an exotic comparison, but not practically very interesting or encouraging.**

There you go. 

I'm assuming that the first release of TensorFlow is still quite unpolished, and that they will improve it over time with various memory and time optimizations baked in.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Benchmark TensorFlow #66

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	96	32	64
Nervana (Neon)	101	32	69
CuDNN-R2 (Torch)	231	70	161
TensorFlow	326	96	230

Library	Time (ms)	forward (ms)	backward (ms)
CuDNN-R3 (Torch)	326	113	213
fbfft (Torch)	342	114	227
CuDNN-R2 (Torch)	810	234	576
TensorFlow	1084	316	768

Library	Time (ms)	forward (ms)	backward (ms)
Nervana	590	180	410
CuDNN-R3 (Torch)	615	196	418
CuDNN-R2 (Torch)	1099	342	757
TensorFlow	1840	545	1295

Benchmark TensorFlow #66

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions