I try to generate several thousands of random test data for some synthetic load testing, but I've run into a weird bug I don't understand.
Here's the basic minimum reproducible code I managed to narrow it down. Let's create a table that has some unique rows:
CREATE TABLE vals (
value INT PRIMARY KEY
);
INSERT INTO vals SELECT generate_series(1, 10);
Let's check there are unique values:
SELECT array(SELECT * FROM vals);
>> {1,2,3,4,5,6,7,8,9,10}
Yep, that's good. Now let's create a table that has lots of user data that references to the vals table:
CREATE TABLE tmp (
a INT REFERENCES vals,
b INT[]
);
And fill it with lots of random data:
WITH test_count AS (SELECT generate_series(1, 10000))
-- some more CTEs, so I cannot give up on them
INSERT
INTO tmp
SELECT
(SELECT value FROM vals ORDER BY random() LIMIT 1),
array(SELECT value FROM vals WHERE random() > 0.85)
FROM test_count;
But when we check it, there are 10000 rows filled with the same values:
SELECT DISTINCT a, b FROM tmp;
>> a | b
---------
2 | {8,5}
I've found out that sometimes postgres optimizes calls random() in the same row to the same value, e.g. SELECT random(), random() would return 0.345, 0.345: link.
But in this case random in-row are different, but random over all the rows is the same.
➥ What is the way to fix it?