2

I would like to generate from a vector that for simplicity we can call "serie1" another vector of dimension 1000x1 where each element of this new vector is the sum of j random elements of the vector "serie1".

I was thinking about creating a random matrix from the vector of dimension 1000xj and them sum horizontally.

How would you suggest to do in Python?

In order to get a random vector I could do

Vector=np.random.choice(serie1, 1000, replace=True)

but I would not know how to proceed and if there is an efficient solution.

3 Answers 3

1

The basic problem is getting j unique elements for 1000 rows. We can't use np.random.choice(.....replace=True) directly there as then we won't have j unique elements. To solve our case, one vectorized approach would be to use a random matrix of shape (1000,len(input_array)), perform argsort along the second axis and get j unique indices per row, then index into the input array with it and finally sum along the second axis.

To implement it, we would have two approaches -

def app1(serie1, j, N=1000):
    idx = np.random.rand(N,serie1.size).argsort(1)[:,:j]
    return serie1[idx].sum(1)

Using efficient np.argpartition for selecting random j elements and then np.take for efficient indexing -

def app2(serie1, j, N=1000):
    idx = np.random.rand(N,serie1.size).argpartition(j,axis=1)[:,:j]
    return np.take(serie1, idx).sum(1)

Sample run to demo creating the indices idx -

In [35]: serie1 = np.random.randint(0,9,(20))

In [36]: idx = np.random.rand(1000,serie1.size).argsort(1)[:,:5]

In [37]: idx
Out[37]: 
array([[16, 13, 19,  0, 15],
       [ 7,  4, 13, 15, 14],
       [ 8,  3, 15,  1,  9],
       ..., 
       [11, 15, 17,  4, 19],
       [19,  0,  3,  7,  9],
       [10,  1, 19, 12,  6]])

Verifying uniform random sampling -

In [81]: serie1 = np.arange(20)

In [82]: j = 5

In [83]: idx = np.random.rand(1000000,serie1.size).argsort(1)[:,:j]

In [84]: np.bincount(idx.ravel())
Out[84]: 
array([250317, 250298, 250645, 249544, 250396, 249972, 249492, 250512,
       249968, 250133, 249622, 250170, 250291, 250060, 250102, 249446,
       249398, 249003, 250249, 250382])

Having fairly equal counts across the length of 20 elems in the input array, I think its pretty uniformly distributed.

Runtime test -

In [140]: serie1 = np.random.randint(0,9,(20))

In [141]: j = 5

# @elcombato's soln
In [142]: %timeit [sum(sample(serie1, j)) for _ in range(1000)]
100 loops, best of 3: 10.7 ms per loop

# Posted solutions in this post
In [143]: %timeit app1(serie1, j, N=1000)
     ...: %timeit app2(serie1, j, N=1000)
     ...: 
1000 loops, best of 3: 943 µs per loop
1000 loops, best of 3: 870 µs per loop
Sign up to request clarification or add additional context in comments.

3 Comments

Ok, nice answer but I get ValueError: kth(=1300) out of bounds (800) for everyone of the 1000 simulations I have to sum 1300 elements in my specific case (so j=1300) taken randomly from the array series1 which has 800 elements
@Klapaucius So, you are supposed to take 1300 elements out off an array that has 800 elements? So, you are okay with some of those elements being duplicated?
@Klapaucius Then, I guess Paul Panzer's solution would be the one to solve your case.
1

You are close:

vector = np.random.choice(serie1, (1000, j), replace=True).sum(axis=-1, keepdims=True)

Note that this is with replacement.

For not too large j an acceptance-rejection scheme could be applied to eliminate repeats.

def accept_reject(serie1, j):
    efficiency_ratio = 2 # just a guess
    M = len(serie1)
    accept_rate = np.prod(np.linspace(1-(j-1)/M, 1, j))
    n_draw = int(1000 / accpet_rate + 4 * np.sqrt(1000*(1 - accept_rate)))
    if n_draw * j * efficiency_ratio > 1000 * M:
        return use_other_solution(serie1, j)
    raw = np.random.randint(0, M, (n_draw, j))
    raw.sort(axis=-1)
    raw = raw[np.all(np.diff(raw, axis=-1) > 0, axis=-1), :]
    if len(raw)>1000:
        raw = raw[:1000, :]
    elif len(raw)<1000:
        return use_other_solution(serie1, j)
    return serie1[raw].sum(axis=-1, keepdims=True)

3 Comments

This would create duplicate indices, right? I thought OP wanted j random elements and by that I assumed j unique elements off serie1.
@Divakar well they have a replace=True in their code snippet. Btw. ingenious as it is are you sure your argpartition solution is correct in the sense of sampling correctly?
Well OP mnetioned : In order to get a random vector ..but I would not know how to proceed . Don't think that leads us to what was mentioned as the goal of the question. Added a section in my post on the sampling.
1

Base Python

from random import sample
vector = [sum(sample(serie1, j)) for _ in range(1000)]

With Numpy to enable replacement

import numpy as np
vector = [sum(np.random.choice(serie1, j, replace=True)) for _ in range(1000)]

3 Comments

I guess that in this case it is not allowed for replacement=true
That's right. So with replacement: vector = [sum(np.random.choice(serie1, j, replace=True)) for _ in range(1000)]
So, if both your solution and Paul's one are both correct these two: Vector = [sum(np.random.choice(serie1, N, replace=True)) for _ in range(1000000)] and Vector = np.random.choice(serie1, (1000000, N), replace=True).sum(axis=-1, keepdims=True) should yield the some result

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.