2D Numpy Array of Strings WITH column access

Question

New to Python and Numpy and MatPlotLib.

I am trying to create a 2D Numpy array from a CSV of various data types, but I will treat them all as strings. The killer is that I need to be able to access them with tuple indices, like: [:,5] to get the 5th column, or [5] to get the 5th row.

Is there any way to do this?

It seems that this is a limitation of Numpy due to the memory-access calculations:

dataSet = np.loadtxt(open("adult.data.csv", "rb"), delimiter=" ,")
print dataSet[:, 4] <---results in IndexError: Invalid Index

I have also tried loadfromgen, dtype = str and dtype = "a16", as well as dtype = object. Nothing works. I can either load the data and it does not have column access, or I can't load the data at all.

Your delimiter is " ,". Is that actually what separates elements in each row of your input file? A space, then a comma? — user2357112
– user2357112, Commented Jan 22, 2016 at 20:47
Yes there is also a space. I could just use a comma, but it doesn't matter. Edit: Or at least I don't think it matters.... — unwrittenrainbow
– unwrittenrainbow, Commented Jan 22, 2016 at 20:48
You need to examine dataSet. Print some rows, print the shape and dtype. Don't jump into indexing before you know what you have got. A sample of the csv file might also help us diagnose your problem. — hpaulj
– hpaulj, Commented Jan 22, 2016 at 21:06
The rows are very clean, all look like this. About 32,000 lines. 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K — unwrittenrainbow
– unwrittenrainbow, Commented Jan 22, 2016 at 21:12

hpaulj · Accepted Answer · 2016-01-23 02:40:46Z

Simulate you file from the comment line - replicated several time (i.e. one string per row of the file):

In [8]: txt = b" 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
In [9]: txt = [txt for _ in range(5)]

In [10]: txt
Out[10]: 
[b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K']

Load with genfromtxt, with delimiter. Let it choose the best dtype per column:

In [12]: A=np.genfromtxt(txt, delimiter=',',dtype=None)
In [13]: A
Out[13]: 
array([ (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),
       (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),...], 
      dtype=[('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<i4'), ('f5', 'S14'), ('f6', 'S13'), ('f7', 'S14'), ('f8', 'S6'), ('f9', 'S5'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', 'S14'), ('f14', 'S6')])

5 element array with a compound dtype

In [14]: A.shape
Out[14]: (5,)
In [15]: A.dtype
Out[15]: dtype([('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'),
    ('f3', 'S10'), ('f4', '<i4'), ....])

Access a 'column' with a field name (not column number)

In [16]: A['f4']
Out[16]: array([13, 13, 13, 13, 13])

Or load as dtype=str:

In [17]: A=np.genfromtxt(txt, delimiter=',',dtype=str)
In [18]: A
Out[18]: 
array([['39', ' State-gov', ' 77516', ' Bachelors', ' 13',
        ' Never-married', ' Adm-clerical', ' Not-in-family', ' White',
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K'],
        ...
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K']], 
      dtype='<U14')
In [19]: A.dtype
Out[19]: dtype('<U14')
In [20]: A.shape
Out[20]: (5, 15)
In [21]: A[:,4]
Out[21]: 
array([' 13', ' 13', ' 13', ' 13', ' 13'], 
      dtype='<U14')

Now it is 15 column 2d array that can be indexed with column number.

With the wrong delimiter, and it loads one column per row

In [24]: A=np.genfromtxt(txt, delimiter=' ,',dtype=str)
In [25]: A
Out[25]: 
array([ '39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
      ...], 
      dtype='<U127')
In [26]: A.shape
Out[26]: (5,)

A 1d array with a long string dtype.

A CSV file might loaded in various ways, some intentional, some not. You have to look at the results, and try to understand them before blindly trying to index columns.

WOW! Thanks so much for all that work. Through it I was able to solve it and everything is happy now. :)

Collectives™ on Stack Overflow

2D Numpy Array of Strings WITH column access

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related