PostgreSQL random in subquery

Question

I try to generate several thousands of random test data for some synthetic load testing, but I've run into a weird bug I don't understand.

Here's the basic minimum reproducible code I managed to narrow it down. Let's create a table that has some unique rows:

CREATE TABLE vals (
    value INT PRIMARY KEY
);
INSERT INTO vals SELECT generate_series(1, 10);

Let's check there are unique values:

SELECT array(SELECT * FROM vals);

>> {1,2,3,4,5,6,7,8,9,10}

Yep, that's good. Now let's create a table that has lots of user data that references to the vals table:

CREATE TABLE tmp (
    a INT REFERENCES vals,
    b INT[]
);

And fill it with lots of random data:

  WITH test_count AS (SELECT generate_series(1, 10000))
    -- some more CTEs, so I cannot give up on them
INSERT
  INTO tmp
SELECT
       (SELECT value FROM vals ORDER BY random() LIMIT 1),
       array(SELECT value FROM vals WHERE random() > 0.85)
  FROM test_count;

But when we check it, there are 10000 rows filled with the same values:

SELECT DISTINCT a, b FROM tmp;

>> a | b
   ---------
   2 | {8,5}

I've found out that sometimes postgres optimizes calls random() in the same row to the same value, e.g. SELECT random(), random() would return 0.345, 0.345: link.

But in this case random in-row are different, but random over all the rows is the same.

➥ What is the way to fix it?

Gordon Linoff · Accepted Answer · 2019-11-24 18:08:02Z

6

The problem is premature optimization. Although there are other ways to phrase the query, adding a (non-sensical) correlation clause causes the subqueries to be run for each iteration:

WITH test_count AS (
      SELECT generate_series(1, 10000) as id
     )
INSERT INTO tmp
SELECT (SELECT value FROM vals WHERE tc.id is not null ORDER BY random() LIMIT 1),
       (SELECT value FROM vals WHERE tc.id is not null ORDER BY random() LIMIT 1)
FROM test_count tc;

Here is a db<>fiddle.

answered Nov 24, 2019 at 18:08

Gordon Linoff

1.3m62 gold badges705 silver badges857 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Xobotun Over a year ago

That is really unexpected. I thought there was some kind of resettable seed in random(). Tested out your advice, it works flawlessly, so I'll mark your answer as a solution in 7 more minutes. Thanks for the quickest answer I've ever got! :D

Gordon Linoff Over a year ago

@Xobotun . . . I consider this a bug, because random() should be considered a volatile function and not optimized away, as done without the correlation clause. But SQL Server (and perhaps other databases) behave the same way. Alas.

Collectives™ on Stack Overflow

PostgreSQL random in subquery

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related