4

I have an interesting problem that I'm looking for some guidance on. I have a table "image" that contains many thousands of rows. I want to be able to return a selection of random rows with a limit of 50 at a time.

Client side, I have an initial GetImages() method which will initially return 50 "random" images (if there are that many). when the user is scrolling through them and gets to a certain number (roughly 40+) then another function fires - GetMoreImages().

The issue is that I'm not sure how to retrieve more images without also having the risk of returning the same results.

For example, if there are 60 images total, I would want the GetMoreImages() call to only return the remaining 10 images.

I feel I should also mention that my Id table is non-contiguous as I am using the Instagram approach (http://instagram-engineering.tumblr.com/post/10853187575/sharding-ids-at-instagram) that leaves me with potentially big gaps between each row id.

One way I could attempt it would be to pass in all of the id's of the images I already have but that will get unwieldy if the user is scrolling through thousands of images.

I guess another way could be to store a cached "random" set of values on the application server per user but I don't really like the idea of this either.

If there are any best practices you can guide me towards it would be appreciated.

3 Answers 3

4

You can get random images with a query such as:

select *
from images
order by random()
limit 50;

I'm not 100% that the following will work, but it might. What you want is a random number generator that reproduces the same values. For that, use setseed(). So, you can do:

with t as (
      select setseed(1)
     )
select *
from images cross join t
order by random()
limit 50;

Then you can get subsequent values as:

with t as (
      select setseed(1)
     ) 
select *
from images cross join t
order by random()
limit 50;

The issue is whether random() is called in exactly the same order on subsequent calls. You might be able to enforce this with:

with t as (
      select setseed(1)
     ),
     i as (
      select i.*, random() as rand
      from images i cross join t
     )
select *
from i
order by i.rand
limit 50;

However, this still assumes that multiple calls to the same table will be in the same order. You can then run the same query with limit 10 offset 50 and so on.

You can change the value of the seed for each call, using a counter, a function related to the current date time, or just a random number generator.

EDIT:

My normal approach to this is to use a pseudo-random number generator. I just take relatively large prime numbers, do some arithmetic and use that value.

By changing values in the equation, you can adjust the parameters for what you want. For instance, I remember that 8,191 and 131,071 are prime numbers (because they are Mersenne primes). So, I might approach this as:

select i.*
from images i
order by mod(i.id * 8191 + 1, 131071)
limit 50 offset xxx;

You can adjust the "+1" to create different sequences. This isn't truly "random" and it depends on id being an integer type, but it avoids the instability of the random number generator approach. This is still doing an order by, so it might be inefficient, depending on the size of your data.

Sign up to request clarification or add additional context in comments.

8 Comments

The final query looks OK. The first two won't produce stable results. Note that the final query forces a read and sort of the entire table every time, so it'll be horrifying inefficient for a large table.
... but then, so does any ORDER BY random(). so never mind.
It's actually a really hard problem to solve efficiently. Your thoughts on my answer would be appreciated.
@CraigRinger . . . It is an interesting discussion. However, the OP doesn't actually mention performance as an issue (and s/he is presumably already using a "random" order by for the first 50 rows). For scaling, it is an issue. Some databases have built-in support for random samples.
The setseed(1) example doesn't work (in PG 9.1) because the CTE is unreferenced so is optimised out; you have to reference t inside the definition of i, e.g. from images i cross join t.
|
2

Client state tracking

If you take a step back on this problem, you'll see that it's fundamentally a hard problem that forces you to make an efficiency vs correctness trade-off.

Why?

Because to provide the non-repeating property you want, while returning a different random set of images to each user requires you to keep track of the seen / unseen images for each user somewhere, somehow.

For a lot of clients, that's a lot of state.

If you push the state to the client side and keep a list of seen images that they send with every request and append to, that pushes the state tracking load to the client, but it makes your queries unwieldy - you'll probably want to do an anti-join on a VALUES list to exclude the seen images because NOT IN will get inefficient at scale. Plus there's all the extra data the client has to send to the server, the server has to process, etc.

Gordon's solution is a variant of this that simplifies the client state by forcing a stable random sort, so the client state is only "how many images have I seen" not "which images have I seen". The downside is that the order is stable - if a client requests it again, it'll start at the beginning of the same random set, not a different one.

If you don't push the state to the client side, the server has to know which images each client has seen. There are many ways to do that but they're all going to require keeping track of a bunch of client state and expiring that state effectively. Options include:

  • CREATE TABLE AS SELECT ... when you first get a request. Then return results from that table. Easy and is very efficient for subsequent requests, but extremely slow for the first request. Doesn't require you to keep a transaction or session open. Wastes lots of storage and requires you to expire the copies. Not a good way to to it.

  • Using WITH HOLD cursors, or using regular cursors with an open transaction. Can produce fairly fast first results and is fairly storage efficient - though it can sometimes consume lots of temporary storage. Requires you to keep a session open and associated with a particular client, though, so it won't scale for large numbers of clients. Requires you to expire old sessions too.

  • Send a random value that's generated by the client on first request as the random seed in Gordon's approach. Because his approach will require a full table scan and sort I don't recommend that, but it'd at least solve the problem of the same random values repeating for each client / new request. You'd send "offset=50&&seed=1231" from the client.

  • Using client session tables. Keep track of HTTP sessions in your application using the usual methods (cookies, URL session IDs, etc) and associate them with state in the DB or elsewhere. The client just provides the session ID to the server, and the server looks up the client's session data in its local storage to figure out what the client has seen. With this you can use a NOT IN list or left-anti-join against a VALUES list against a list of IDs without having to send IDs to/from the client.

So. Lots of options. I'm sure I haven't listed them all either.

Personally, I would use the client's HTTP session - either directly, or to store a random request ID that I generated when the client first asked for a new random set of images. I'd store a list of seen images in a server-side session cache, which would be an UNLOGGED table containing (sessionid, imageid) pairs, or (requestid, imageid) if using the latter approach. I'd do a left anti-join against the session table to exclude seen images when generating a random set.

Getting random rows

Hah, you're not done yet.

The naïve approach of ORDER BY random() does a full table scan and sort. That's going to be really painful for a large table.

It'd be nice if PostgreSQL offered a way to read random rows from a table by just picking a table page and reading a row from it. It doens't, unfortunately, not even a repeats-possible version.

Since your IDs are sparse you can't easily generate a block of random IDs to pick either.

Picking random rows from a table turns out to be a hard problem. Again, you have options to simplify the problem in exchange for reduced randomness, like:

  • Pick a block of sequential rows ordered by ID at a random OFFSET into the data

  • CREATE UNLOGGED TABLE AS SELECT ... cached randomized copies of the data. Either create a bunch of them and pick one of them at random when the first request from a client comes in, or create one and just re-create it regularly.

  • ... probably more

Simplify the problem

So, what made this so hard in the first place?

  • Non-repeating results
  • Random rows
  • Unique for each client or request

What can we do to make this easier?

  • Relax the requirement of randomness. e.g. pick sequential rows at random offsets.
  • Remove the requirement of non-repeating images or relax it to (say) only the last two sets of images shouldn't be repeated.
  • Relax the requirement of per-client or per-request uniqueness. Use the same randomization for all clients and cache it.

Another useful trick might be to share state between clients. If you don't want to repeat images sent to one client, but you don't care if a given client might never see an image you can effectively track the seen images for a group of clients together. For example, you might have a pool of WITH HOLD cursors, which you assign clients to by storing a mapping in their HTTP session. Each group of clients gets a result from a particular cursor. When one client reads a block of results from the cursor, no other client in the same pool will ever see those results in this session. So this approach only works if you have a "very large" image set, i.e. the clients won't realistically run out in one browsing session.

Similarly, you might have a pool of cached UNLOGGED tables of randomized data. When a client sends their first request you assign them to one of the tables using their HTTP session ID - either by hash bucketing or by storing a mapping. Then you can just return results from that table for subsequent requests.

Phew. Wow. That became a bit long. I hope it made some sense, and has some useful ideas.

Comments

0

Table definitions at the bottom. For an extremely fast random set of unique ids:

with r as (
    select distinct
        ceil(
            random() *
            (select max(image_id) from image)
        )::int as image_id
    from generate_series(1, 200)
    limit 100
)
select image_id
from image inner join r using (image_id)
limit 50

Since the primary key has gaps it is necessary to join the table to more than 50 random generated ids to be sure there will be at least 50. How many more will depend on the "gappyness" of the PK. In the sample table it is missing 2 at each 5.

To have distinct random generated ids it is also necessary to generated more than (in the sample above) the 100 to be joined. How many more will depend on how big is the table. For a really big table just a few more is enough.

Even if the numbers above are exaggerated the impact on performance is negligible. To make it not return already seem images I would create a seem_images table with session_id (or user_id for a longer expiration time) and image_id and insert into it at each GetImages. No need for an additional function GetMoreImages.

with r as (
    select distinct
        ceil(
            random() *
            (select max(image_id) from image)
        )::int as image_id
    from generate_series(1, 200)
    limit 100
), t as (
    select image_id, image
    from image inner join r using (image_id)
    where not exists (
        select 1
        from seem_image
        where image_id = image.image_id and session_id = 1
    )
    limit 50
), i as (
    insert into seem_image (image_id, session_id)
    select image_id, 1
    from t
)
select * from t;

The query above will only return not seem images. With the sample 3 million rows image table it is very fast. For long image browsing and sessions it will be necessary to change from 100 and 200 above to appropriate numbers. Expired sessions should be deleted at a regular interval from the seem_image table, depending on the session expiration time, to avoid it to grow too big.

Sample image (with an integer as primary key with gaps) and seem_image tables

create table image (
    image_id integer primary key,
    image bytea
);
insert into image (image_id, image)
select image_id, image
from
    generate_series(1, 5000000) g (image_id)
    cross join
    (values (decode(rpad('', 1024 * 100, 'F'), 'hex'))) i (image)
where mod (image_id, 5) not in (0, 1)
;
analyze image;

create table seem_image (
    session_id integer,
    image_id integer,
    primary key (image_id, session_id)
);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.