0

I have a block of string as below. How do I read this into a numpy array?

   5.780326E+03   7.261185E+03   7.749190E+03   8.488770E+03   5.406134E+03   2.828410E+03   9.620957E+02  1.0000000E+00
   3.097372E+03   3.885160E+03   5.432678E+03   8.060628E+03   2.768457E+03   6.574258E+03   7.268591E+02  2.0000000E+00
   2.061429E+03   4.665282E+03   8.214119E+03   3.579380E+03   8.542057E+03   2.089062E+03   8.829263E+02  3.0000000E+00
   3.572444E+03   9.920473E+03   3.573251E+03   6.423813E+03   2.469338E+03   4.652253E+03   8.211962E+02  4.0000000E+00
   7.460966E+03   7.691966E+03   7.501826E+03   3.414511E+03   8.590221E+03   6.737868E+03   8.586273E+02  5.0000000E+00
   3.250046E+03   9.611985E+03   9.195165E+03   1.064800E+03   7.944535E+03   2.685740E+03   8.212849E+02  6.0000000E+00
   8.069926E+03   9.208576E+03   4.267749E+03   2.491888E+03   9.036555E+03   5.001732E+03   7.202407E+02  7.0000000E+00
   5.691460E+03   3.868344E+03   3.103342E+03   6.567618E+03   7.274860E+03   8.393253E+03   5.628069E+02  8.0000000E+00
   2.887292E+03   9.081563E+02   6.955551E+03   6.763133E+03   2.146178E+03   2.033861E+03   9.725472E+02  9.0000000E+00
   6.127778E+03   8.065057E+02   7.474341E+03   4.185868E+03   4.516230E+03   8.714840E+03   8.254562E+02  1.0000000E+01
   1.594643E+03   6.060956E+03   2.137153E+03   3.505950E+03   7.714227E+03   6.249693E+03   5.724376E+02  1.1000000E+01
   5.039059E+03   3.138161E+03   5.570104E+03   4.594189E+03   7.889644E+03   1.891062E+03   7.085753E+02  1.2000000E+01
   3.263593E+03   6.085087E+03   7.136061E+03   9.895028E+03   6.139666E+03   6.670919E+03   5.018248E+02  1.3000000E+01
   9.954830E+03   6.777074E+03   3.013747E+03   3.638458E+03   4.357685E+03   1.876539E+03   5.969378E+02  1.4000000E+01
   9.920853E+03   3.414156E+03   5.534430E+03   2.011815E+03   7.791122E+03   3.893439E+03   5.229754E+02  1.5000000E+01
   5.447470E+03   7.184321E+03   1.382575E+03   9.134295E+03   7.883753E+02   9.160537E+03   7.521197E+02  1.6000000E+01
   3.344917E+03   8.151884E+03   3.596052E+03   3.953284E+03   7.456115E+03   7.749632E+03   9.773521E+02  1.7000000E+01
   6.310496E+03   1.472792E+03   1.812452E+03   9.535100E+03   1.581263E+03   3.649150E+03   6.562440E+02  1.8000000E+01

I am trying to use numpy native methods so as to speed up the data reading. I am trying to read in couple of GBs of data from a custom file format. I am able to seek and reach the area where a block of text as shown above will appear. Doing regular python string operations on this is always possible, however, I wanted to know if there is any native numpy methods to read in fixed width format.

I tried using np.frombuffer with dtype=float which did not work. It seems to read if I use dtype='S15' however, shows up as bytes and not numbers.

5
  • 1
    'block of strings' - that's not clear. Is this one multiline string? A csv file? Can you provide a sample we can copy-n-paste? Keep in mind that numpy's fast stuff is numeric. String manipulation depends more on native pythin. Commented Feb 11, 2020 at 16:22
  • I'm out of votes, but @hpaulj is asking some important questions. Commented Feb 11, 2020 at 19:10
  • @hpaulj, sorry that my question lacked more context. I wanted the question to be simple enough so that I get some answers and detailed enough to make those answers useful for me. I have added some more details to the question now. I hope that answers your question. Your comment that string manipulation has to be done with native python more or less answers my question. I will have to split the fixed width string using native python list comprehension then! Commented Feb 12, 2020 at 8:44
  • 1
    The delimiter parameter of genfromtxt and loadtxt lets you specify column widths Commented Feb 12, 2020 at 9:35
  • Thanks. That did the trick for me. So here's what I ended up with now. np.genfromtxt(f, delimiter=[15]*8, max_rows=18) Commented Feb 12, 2020 at 10:13

5 Answers 5

2
In [294]: txt = """5.780326E+03   7.261185E+03   7.749190E+03   8.488770E+03   5.406134E+03   2
     ...: .828410E+03   9.620957E+02  1.0000000E+00 
     ...:    3.097372E+03   3.885160E+03   5.432678E+03   8.060628E+03   2.768457E+03   6.57425
     ...: 8E+03   7.268591E+02  2.0000000E+00 
     ...:    2.061429E+03   4.665282E+03   8.214119E+03   3.579380E+03   8.542057E+03   2.08906
     ...: 2E+03   8.829263E+02  3.0000000E+00 
     ...:    """                                                                               

With this copy-n-paste I'm assuming your block is a multiline string.

Treating it like a csv file.

In [296]: np.loadtxt(txt.splitlines())                                                         
Out[296]: 
array([[5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
        5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00],
       [3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
        2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00],
       [2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
        8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00]])

There's a lot going on under the covers, so this isn't particularly fast. pandas has a faster csv reader.

fromstring works, but returns 1d. You can reshape the result

n [299]: np.fromstring(txt, sep='  ')                                                         
Out[299]: 
array([5.780326e+03, 7.261185e+03, 7.749190e+03, 8.488770e+03,
       5.406134e+03, 2.828410e+03, 9.620957e+02, 1.000000e+00,
       3.097372e+03, 3.885160e+03, 5.432678e+03, 8.060628e+03,
       2.768457e+03, 6.574258e+03, 7.268591e+02, 2.000000e+00,
       2.061429e+03, 4.665282e+03, 8.214119e+03, 3.579380e+03,
       8.542057e+03, 2.089062e+03, 8.829263e+02, 3.000000e+00])

This is a string, not a buffer, so frombuffer is wrong.

This list comprehension works:

np.array([row.strip().split('  ') for row in txt.strip().splitlines()], float) 

I had to add strip to clear out excess blanks that produced empty lists or strings.

At least with this small sample, the list comprehension isn't that much slower than the fromstring, and still a lot better than the more general loadtxt.

Sign up to request clarification or add additional context in comments.

Comments

0

You could use several string operations to convert the the data to a string which is convertible to float. Such as:

import numpy as np

with open('data.txt', 'r') as f:
    data = f.readlines()

result = []
for line in data:
    splitted_data = line.split(' ')
    splitted_data = [item for item in splitted_data if item]
    splitted_data = [item.replace('E+', 'e') for item in splitted_data]

    result.append(splitted_data)

result = np.array(result, dtype = 'float64')

Where data.txt is the data you pasted in your question.

1 Comment

Most of the operations in this answer are done in native python. I wanted to know if numpy has any native methods like frombuffer or fromstring.
0

I just did a regular python split and assigned the dtype to np.float32

>>> y=np.array(x.split(), dtype=np.float32())
>>> y
array([  5.78032617e+03,   7.26118506e+03,   7.74918994e+03,
         8.48876953e+03,   5.40613379e+03,   2.82840991e+03,
         9.62095703e+02,   1.00000000e+00,   3.09737207e+03,
         3.88515991e+03,   5.43267822e+03,   8.06062793e+03,
         2.76845703e+03,   6.57425781e+03,   7.26859070e+02,
         2.00000000e+00,   2.06142896e+03,   4.66528223e+03,
         8.21411914e+03,   3.57937988e+03,   8.54205664e+03,
         2.08906201e+03,   8.82926270e+02,   3.00000000e+00], dtype=float32)

P.S. I copied a chunk of your sample data and assigned it to variable “x”

Ok, this doesn’t rely on any blank spaces or use split(), except for the lines, and maintains the shape of the array but does still use non Numpy python.

>>> n=15
>>> x='   5.780326E+03   7.261185E+03   7.749190E+03   8.488770E+03   5.406134E+03   2.828410E+03   9.620957E+02  1.0000000E+00\n   3.097372E+03   3.885160E+03   5.432678E+03   8.060628E+03   2.768457E+03   6.574258E+03   7.268591E+02  2.0000000E+00\n   2.061429E+03   4.665282E+03   8.214119E+03   3.579380E+03   8.542057E+03   2.089062E+03   8.829263E+02  3.0000000E+00\n   3.572444E+03   9.920473E+03   3.573251E+03   6.423813E+03   2.469338E+03   4.652253E+03   8.211962E+02  4.0000000E+00\n   7.460966E+03   7.691966E+03   7.501826E+03   3.414511E+03   8.590221E+03   6.737868E+03   8.586273E+02  5.0000000E+00\n   3.250046E+03   9.611985E+03   9.195165E+03   1.064800E+03   7.944535E+03   2.685740E+03   8.212849E+02  6.0000000E+00\n   8.069926E+03   9.208576E+03   4.267749E+03   2.491888E+03   9.036555E+03   5.001732E+03   7.202407E+02  7.0000000E+00\n   5.691460E+03   3.868344E+03   3.103342E+03   6.567618E+03   7.274860E+03   8.393253E+03   5.628069E+02  8.0000000E+00\n   2.887292E+03   9.081563E+02   6.955551E+03   6.763133E+03   2.146178E+03   2.033861E+03   9.725472E+02  9.0000000E+00\n   6.127778E+03   8.065057E+02   7.474341E+03   4.185868E+03   4.516230E+03   8.714840E+03   8.254562E+02  1.0000000E+01\n   1.594643E+03   6.060956E+03   2.137153E+03   3.505950E+03   7.714227E+03   6.249693E+03   5.724376E+02  1.1000000E+01\n   5.039059E+03   3.138161E+03   5.570104E+03   4.594189E+03   7.889644E+03   1.891062E+03   7.085753E+02  1.2000000E+01\n   3.263593E+03   6.085087E+03   7.136061E+03   9.895028E+03   6.139666E+03   6.670919E+03   5.018248E+02  1.3000000E+01\n   9.954830E+03   6.777074E+03   3.013747E+03   3.638458E+03   4.357685E+03   1.876539E+03   5.969378E+02  1.4000000E+01\n   9.920853E+03   3.414156E+03   5.534430E+03   2.011815E+03   7.791122E+03   3.893439E+03   5.229754E+02  1.5000000E+01\n   5.447470E+03   7.184321E+03   1.382575E+03   9.134295E+03   7.883753E+02   9.160537E+03   7.521197E+02  1.6000000E+01\n   3.344917E+03   8.151884E+03   3.596052E+03   3.953284E+03   7.456115E+03   7.749632E+03   9.773521E+02  1.7000000E+01\n   6.310496E+03   1.472792E+03   1.812452E+03   9.535100E+03   1.581263E+03   3.649150E+03   6.562440E+02  1.8000000E+01'
>>> s=np.array([[y[i:i+n] for i in range(0, len(y) - n + 1, n)] for y in x.splitlines()], dtype=np.float32)
>>> s
array([[  5.78032617e+03,   7.26118506e+03,   7.74918994e+03,
          8.48876953e+03,   5.40613379e+03,   2.82840991e+03,
          9.62095703e+02,   1.00000000e+00],
       [  3.09737207e+03,   3.88515991e+03,   5.43267822e+03,
          8.06062793e+03,   2.76845703e+03,   6.57425781e+03,
          7.26859070e+02,   2.00000000e+00],
       [  2.06142896e+03,   4.66528223e+03,   8.21411914e+03,
          3.57937988e+03,   8.54205664e+03,   2.08906201e+03,
          8.82926270e+02,   3.00000000e+00],
       [  3.57244409e+03,   9.92047266e+03,   3.57325098e+03,
          6.42381299e+03,   2.46933789e+03,   4.65225293e+03,
          8.21196228e+02,   4.00000000e+00],
       [  7.46096582e+03,   7.69196582e+03,   7.50182617e+03,
          3.41451099e+03,   8.59022070e+03,   6.73786816e+03,
          8.58627319e+02,   5.00000000e+00],
       [  3.25004590e+03,   9.61198535e+03,   9.19516504e+03,
          1.06480005e+03,   7.94453516e+03,   2.68573999e+03,
          8.21284912e+02,   6.00000000e+00],
       [  8.06992578e+03,   9.20857617e+03,   4.26774902e+03,
          2.49188794e+03,   9.03655469e+03,   5.00173193e+03,
          7.20240723e+02,   7.00000000e+00],
       [  5.69145996e+03,   3.86834399e+03,   3.10334204e+03,
          6.56761816e+03,   7.27485986e+03,   8.39325293e+03,
          5.62806885e+02,   8.00000000e+00],
       [  2.88729199e+03,   9.08156311e+02,   6.95555078e+03,
          6.76313281e+03,   2.14617798e+03,   2.03386096e+03,
          9.72547180e+02,   9.00000000e+00],
       [  6.12777783e+03,   8.06505676e+02,   7.47434082e+03,
          4.18586816e+03,   4.51622998e+03,   8.71483984e+03,
          8.25456177e+02,   1.00000000e+01],
       [  1.59464294e+03,   6.06095605e+03,   2.13715308e+03,
          3.50594995e+03,   7.71422705e+03,   6.24969287e+03,
          5.72437622e+02,   1.10000000e+01],
       [  5.03905908e+03,   3.13816089e+03,   5.57010400e+03,
          4.59418896e+03,   7.88964404e+03,   1.89106201e+03,
          7.08575317e+02,   1.20000000e+01],
       [  3.26359302e+03,   6.08508691e+03,   7.13606104e+03,
          9.89502832e+03,   6.13966602e+03,   6.67091895e+03,
          5.01824799e+02,   1.30000000e+01],
       [  9.95483008e+03,   6.77707422e+03,   3.01374707e+03,
          3.63845801e+03,   4.35768506e+03,   1.87653894e+03,
          5.96937805e+02,   1.40000000e+01],
       [  9.92085254e+03,   3.41415601e+03,   5.53443018e+03,
          2.01181494e+03,   7.79112207e+03,   3.89343896e+03,
          5.22975403e+02,   1.50000000e+01],
       [  5.44747021e+03,   7.18432080e+03,   1.38257495e+03,
          9.13429492e+03,   7.88375305e+02,   9.16053711e+03,
          7.52119690e+02,   1.60000000e+01],
       [  3.34491699e+03,   8.15188379e+03,   3.59605200e+03,
          3.95328394e+03,   7.45611523e+03,   7.74963184e+03,
          9.77352112e+02,   1.70000000e+01],
       [  6.31049609e+03,   1.47279199e+03,   1.81245203e+03,
          9.53509961e+03,   1.58126294e+03,   3.64914990e+03,
          6.56244019e+02,   1.80000000e+01]], dtype=float32)

1 Comment

I wanted to avoid the split() operation. Also, this is fixed width format. So the space in between is not always guarenteed.
0

Thanks to @hpaulj's comments. Here's the answer I ended up with.

data = np.genfromtxt(f, delimiter=[15]*8, max_rows=18)

More explanation

Since I am reading this from a custom file format, I will post how I'm doing the whole thing as well. I do some initial processing of the file to identify the positions where the block of text is residing and end up with an array of 'locations' where I can seek to start the reading process and then I use the above method to read the 'block' of text.

data = np.array([])
r = 18 # rows per block
c = 8 # columns per block
w = 15 # width of a column
with open('mycustomfile.xyz') as f:
    for location in locations:
        f.seek(location)
        data = np.append(data, np.genfromtxt(f, delimiter=[w]*c, max_rows=r))
data = data.reshape((r*len(locations),c))

Comments

-1

If you want an array with dtype=float you have to convert your string to float beforehand.

import numpy as np

string_list = ["1", "0.1", "1.345e003"]
array = np.array([float(string) for string in string_list])
array.dtype

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.