0
\$\begingroup\$

I'm using OpenGL but this question should apply generally to rendering.

I understand that for efficient rendering in games, you want to minimize communication between the CPU and GPU. This means pre-loading as much vertex data as possible into graphics memory before a level starts (during the level loading screen). And updating the view and projection matrices for your camera once per frame, then letting the vertex shaders scale, rotate and translate models as required to render the scene. You can use instanced rendering this way to minimize draw calls.

This works great for static geometry that never changes. But I don't understand how you're supposed to minimize communication between the CPU and GPU when lots of objects are moving. This seems impossible?

If only the CPU knows the new location of all the objects moving around each frame, then it must somehow pass that data to the GPU. If it passes the updated model matrices for moving objects to the GPU as uniforms to the vertex shader, isn't this the CPU talking to the GPU? Isn't this very slow and what we're supposed to avoid?

I understand we can use uniform buffer objects instead of updating uniforms individually, but this still means sending lots of data from the CPU to the GPU.

How can you render lots of moving objects efficiently, whose model matrices are changing every frame?

\$\endgroup\$
0

2 Answers 2

2
\$\begingroup\$

I understand that for efficient rendering in games, you want to minimize communication between the CPU and GPU. … isn't this the CPU talking to the GPU? Isn't this very slow and what we're supposed to avoid?

No matter what, you always have to send something to the GPU every frame. The absolute minimum possible data to send would be a single command “draw the next frame based on the data you already have”, but that’s not how games work, since games are interactive. Even “GPU-driven rendering” still needs the latest frame’s scene data to be transferred to the GPU.

But certainly, for efficiency, you should avoid sending unnecessary data to the GPU. Updated transforms are necessary data (unless you have, say, a GPU particle simulation).

You might be thinking of the advice to avoid reading data from the GPU (e.g. with glReadPixels). This is “slow” because it has to wait for the GPU to finish writing the data you want to read, whereas as long as you don't read anything, the CPU and GPU can work in parallel, preparing the next frame while rendering the current one. (And you can still read efficiently by careful scheduling and choosing to read data from a previous frame’s computation.)

But consideration of the amount of data you write to the GPU should be approached just like any other program optimization problem: think about how many operations you're running, think about how much data you have to transfer over the memory bus and how it can be represented more compactly, etc. And always benchmark/profile a possible change or alternative design rather than assuming that it must be better.

I understand we can use uniform buffer objects instead of updating uniforms individually

You should be using an instance buffer to deliver model transforms to the GPU. That way you don’t have to issue a separate draw call for every object.

\$\endgroup\$
3
  • \$\begingroup\$ Thanks. My objects only change position each frame but don't rotate or scale. When you said "an instance buffer to deliver model transforms to the GPU", could I use two VBOs and use gl_InstanceID to index them? I could use one VBO to store transform data that changes (position), and the other to store transform data that doesn't change. In the vertex shader, I can access both with gl_InstanceID to assemble the model matrix. That way I can send all the updated positions to the GPU each frame as one compact, minimum sized VBO, with one function call (and avoid needlessly sending other data)? \$\endgroup\$ Commented Apr 12 at 16:38
  • \$\begingroup\$ this post seems highly relevant here, where someone did a test to render 40 thousand triangles and update their transforms every frame, and the fastest method they found was to use "instanced mat4 attribute (glVertexAttribDivisor) and sending the mat4s into the VBO each frame (glBufferData)", which was faster compared to using individual uniforms or uniform buffer objects \$\endgroup\$ Commented Apr 12 at 16:48
  • \$\begingroup\$ @greenlagoon Yes, absolutely, if you have only translations changing then using a buffer containing only translations makes sense. (I have forgotten a lot of OpenGL, but note that you will not be “using gl_InstanceID to index them” — you would set up the vertex attribute to provide the per-instance data. According to this page you would do that with glVertexAttribDivisor.) \$\endgroup\$ Commented Apr 12 at 17:27
0
\$\begingroup\$

In addition to @KevinReid's answer which is spot on, you may look into compressing that transform data as much as possible.

If you can use 10 bits per each of your x, y, z positional components, for example, then 3x10 = 30 bits can be fit into a single 32-bit value passed to the GPU. That uses 4x less bandwidth than passing a 32-bit float per component (= 96 bits, then round up to 128 bits / 16B to align to word boundaries, depending on whether or not you additionally send w-coordinate).

\$\endgroup\$
1
  • \$\begingroup\$ Some good examples of using this data compression strategy to gain performance improvements in a voxel game here: youtube.com/watch?v=40JzyaOYJeY \$\endgroup\$ Commented Apr 13 at 18:38

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.