4

I try to generate several thousands of random test data for some synthetic load testing, but I've run into a weird bug I don't understand.

Here's the basic minimum reproducible code I managed to narrow it down. Let's create a table that has some unique rows:

CREATE TABLE vals (
    value INT PRIMARY KEY
);
INSERT INTO vals SELECT generate_series(1, 10);

Let's check there are unique values:

SELECT array(SELECT * FROM vals);

>> {1,2,3,4,5,6,7,8,9,10}

Yep, that's good. Now let's create a table that has lots of user data that references to the vals table:

CREATE TABLE tmp (
    a INT REFERENCES vals,
    b INT[]
);

And fill it with lots of random data:

  WITH test_count AS (SELECT generate_series(1, 10000))
    -- some more CTEs, so I cannot give up on them
INSERT
  INTO tmp
SELECT
       (SELECT value FROM vals ORDER BY random() LIMIT 1),
       array(SELECT value FROM vals WHERE random() > 0.85)
  FROM test_count;

But when we check it, there are 10000 rows filled with the same values:

SELECT DISTINCT a, b FROM tmp;

>> a | b
   ---------
   2 | {8,5}

I've found out that sometimes postgres optimizes calls random() in the same row to the same value, e.g. SELECT random(), random() would return 0.345, 0.345: link.

But in this case random in-row are different, but random over all the rows is the same.

➥ What is the way to fix it?

1 Answer 1

6

The problem is premature optimization. Although there are other ways to phrase the query, adding a (non-sensical) correlation clause causes the subqueries to be run for each iteration:

WITH test_count AS (
      SELECT generate_series(1, 10000) as id
     )
INSERT INTO tmp
SELECT (SELECT value FROM vals WHERE tc.id is not null ORDER BY random() LIMIT 1),
       (SELECT value FROM vals WHERE tc.id is not null ORDER BY random() LIMIT 1)
FROM test_count tc;

Here is a db<>fiddle.

Sign up to request clarification or add additional context in comments.

2 Comments

That is really unexpected. I thought there was some kind of resettable seed in random(). Tested out your advice, it works flawlessly, so I'll mark your answer as a solution in 7 more minutes. Thanks for the quickest answer I've ever got! :D
@Xobotun . . . I consider this a bug, because random() should be considered a volatile function and not optimized away, as done without the correlation clause. But SQL Server (and perhaps other databases) behave the same way. Alas.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.