4

Given a numpy array my_arr filled with strings, how do I set the datatype of one of the columns to be float? I need it as an numpy array in order to use it with my existing code afterwards. See example of a failed attempt below:

import numpy as np

dat = [['User1', 'Male', '2.2'], ['User2', 'Female', '3.777'], ['User3', 'Unknown', '0.0']]
my_arr = np.array(dat)
print my_arr
# [['User1' 'Male' '2.2'], ['User2' 'Female' '3.777'], ['User3' 'Unknown' '0.0']]

my_arr[:,2] = my_arr[:,2].astype(np.float)
print my_arr
# [['User1' 'Male' '2.2'], ['User2' 'Female' '3.777'], ['User3' 'Unknown' '0.0']]
1
  • What kind of strings do you have? Something like "2.3", "7.89" or "myString", "myString2"? And what do you mean by "without success"? What went wrong? Commented Jul 9, 2015 at 21:19

2 Answers 2

3

You could convert your 2d array into a structured array with a mixed dtype:

In [137]: my_arr
Out[137]: 
array([['User1', 'Male', '2.2'],
       ['User2', 'Female', '3.777'],
       ['User3', 'Unknown', '0.0']], 
      dtype='<U7')

In [138]: dt=np.dtype('U7,U7,f')  # complex dtype

In [139]: np.array([tuple(row) for row in my_arr], dtype=dt)
Out[139]: 
array([('User1', 'Male', 2.200000047683716),
       ('User2', 'Female', 3.7769999504089355), ('User3', 'Unknown', 0.0)], 
      dtype=[('f0', '<U7'), ('f1', '<U7'), ('f2', '<f4')])

In [140]: _.shape
Out[140]: (3,)

Now it is a 1d array with 3 fields. Instead of accessing columns by number you access fields by name, arr['f0'] etc.

I used [tuple(row) for row in my_arr] because the input to structured arrays has to be a list of tuples. I could have used your dat list, [tuple(row) for row in dat].

Sign up to request clarification or add additional context in comments.

Comments

2

There might be smarter ways on doing this but the following gives you the correct output I think; you can use structured arrays:

import numpy as np
dat = [['User1', 'Male', '2.2'], ['User2', 'Female', '3.777'], ['User3', 'Unknown', '0.0']]

# create data types: two strings of length 10 and float
dt = np.dtype('a10, a10, float')

# convert the inner lists to tuples so that a structured array can be used
for ind, l in enumerate(dat):
    dat[ind] = tuple(l)

# convert dat to an array
my_arr = np.array(dat, dt)

Output:

array([('User1', 'Male', 2.2), ('User2', 'Female', 3.777),
       ('User3', 'Unknown', 0.0)], 
      dtype=[('f0', 'S10'), ('f1', 'S10'), ('f2', '<f8')])

You can also give names to the columns by doing:

dt = {'names': ['user', 'gender', 'number'], 'formats':['a10', 'a10', 'float']}
my_arr = np.array(dat, dt)  # dat is the list with tuples, see above

The output now is:

array([('User1', 'Male', 2.2), ('User2', 'Female', 3.777),
       ('User3', 'Unknown', 0.0)], 
      dtype=[('user', 'S10'), ('gender', 'S10'), ('number', '<f8')])

And you can then access a single column by doing e.g.

my_arr['number']
array([ 2.2  ,  3.777,  0.   ])

my_arr['user']
array(['User1', 'User2', 'User3'], dtype='|S10')

I would recommend to use a dataframe from Python pandas where you can easily deal with different data types and complex data structures.

For your example:

import pandas as pd
pd.DataFrame(dat, columns=['user', 'gender', 'some number'])

would then simply give you:

    user   gender some number
0  User1     Male         2.2
1  User2   Female       3.777
2  User3  Unknown         0.0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.