Client state tracking
If you take a step back on this problem, you'll see that it's fundamentally a hard problem that forces you to make an efficiency vs correctness trade-off.
Why?
Because to provide the non-repeating property you want, while returning a different random set of images to each user requires you to keep track of the seen / unseen images for each user somewhere, somehow.
For a lot of clients, that's a lot of state.
If you push the state to the client side and keep a list of seen images that they send with every request and append to, that pushes the state tracking load to the client, but it makes your queries unwieldy - you'll probably want to do an anti-join on a VALUES list to exclude the seen images because NOT IN will get inefficient at scale. Plus there's all the extra data the client has to send to the server, the server has to process, etc.
Gordon's solution is a variant of this that simplifies the client state by forcing a stable random sort, so the client state is only "how many images have I seen" not "which images have I seen". The downside is that the order is stable - if a client requests it again, it'll start at the beginning of the same random set, not a different one.
If you don't push the state to the client side, the server has to know which images each client has seen. There are many ways to do that but they're all going to require keeping track of a bunch of client state and expiring that state effectively. Options include:
CREATE TABLE AS SELECT ... when you first get a request. Then return results from that table. Easy and is very efficient for subsequent requests, but extremely slow for the first request. Doesn't require you to keep a transaction or session open. Wastes lots of storage and requires you to expire the copies. Not a good way to to it.
Using WITH HOLD cursors, or using regular cursors with an open transaction. Can produce fairly fast first results and is fairly storage efficient - though it can sometimes consume lots of temporary storage. Requires you to keep a session open and associated with a particular client, though, so it won't scale for large numbers of clients. Requires you to expire old sessions too.
Send a random value that's generated by the client on first request as the random seed in Gordon's approach. Because his approach will require a full table scan and sort I don't recommend that, but it'd at least solve the problem of the same random values repeating for each client / new request. You'd send "offset=50&&seed=1231" from the client.
Using client session tables. Keep track of HTTP sessions in your application using the usual methods (cookies, URL session IDs, etc) and associate them with state in the DB or elsewhere. The client just provides the session ID to the server, and the server looks up the client's session data in its local storage to figure out what the client has seen. With this you can use a NOT IN list or left-anti-join against a VALUES list against a list of IDs without having to send IDs to/from the client.
So. Lots of options. I'm sure I haven't listed them all either.
Personally, I would use the client's HTTP session - either directly, or to store a random request ID that I generated when the client first asked for a new random set of images. I'd store a list of seen images in a server-side session cache, which would be an UNLOGGED table containing (sessionid, imageid) pairs, or (requestid, imageid) if using the latter approach. I'd do a left anti-join against the session table to exclude seen images when generating a random set.
Getting random rows
Hah, you're not done yet.
The naïve approach of ORDER BY random() does a full table scan and sort. That's going to be really painful for a large table.
It'd be nice if PostgreSQL offered a way to read random rows from a table by just picking a table page and reading a row from it. It doens't, unfortunately, not even a repeats-possible version.
Since your IDs are sparse you can't easily generate a block of random IDs to pick either.
Picking random rows from a table turns out to be a hard problem. Again, you have options to simplify the problem in exchange for reduced randomness, like:
Pick a block of sequential rows ordered by ID at a random OFFSET into the data
CREATE UNLOGGED TABLE AS SELECT ... cached randomized copies of the data. Either create a bunch of them and pick one of them at random when the first request from a client comes in, or create one and just re-create it regularly.
... probably more
Simplify the problem
So, what made this so hard in the first place?
- Non-repeating results
- Random rows
- Unique for each client or request
What can we do to make this easier?
- Relax the requirement of randomness. e.g. pick sequential rows at random offsets.
- Remove the requirement of non-repeating images or relax it to (say) only the last two sets of images shouldn't be repeated.
- Relax the requirement of per-client or per-request uniqueness. Use the same randomization for all clients and cache it.
Another useful trick might be to share state between clients. If you don't want to repeat images sent to one client, but you don't care if a given client might never see an image you can effectively track the seen images for a group of clients together. For example, you might have a pool of WITH HOLD cursors, which you assign clients to by storing a mapping in their HTTP session. Each group of clients gets a result from a particular cursor. When one client reads a block of results from the cursor, no other client in the same pool will ever see those results in this session. So this approach only works if you have a "very large" image set, i.e. the clients won't realistically run out in one browsing session.
Similarly, you might have a pool of cached UNLOGGED tables of randomized data. When a client sends their first request you assign them to one of the tables using their HTTP session ID - either by hash bucketing or by storing a mapping. Then you can just return results from that table for subsequent requests.
Phew. Wow. That became a bit long. I hope it made some sense, and has some useful ideas.