0

New to Python and Numpy and MatPlotLib.

I am trying to create a 2D Numpy array from a CSV of various data types, but I will treat them all as strings. The killer is that I need to be able to access them with tuple indices, like: [:,5] to get the 5th column, or [5] to get the 5th row.

Is there any way to do this?

It seems that this is a limitation of Numpy due to the memory-access calculations:

dataSet = np.loadtxt(open("adult.data.csv", "rb"), delimiter=" ,")
print dataSet[:, 4] <---results in IndexError: Invalid Index

I have also tried loadfromgen, dtype = str and dtype = "a16", as well as dtype = object. Nothing works. I can either load the data and it does not have column access, or I can't load the data at all.

6
  • Your delimiter is " ,". Is that actually what separates elements in each row of your input file? A space, then a comma? Commented Jan 22, 2016 at 20:47
  • Yes there is also a space. I could just use a comma, but it doesn't matter. Edit: Or at least I don't think it matters.... Commented Jan 22, 2016 at 20:48
  • What are dataSet.shape and dataSet.dtype? Commented Jan 22, 2016 at 21:02
  • You need to examine dataSet. Print some rows, print the shape and dtype. Don't jump into indexing before you know what you have got. A sample of the csv file might also help us diagnose your problem. Commented Jan 22, 2016 at 21:06
  • The rows are very clean, all look like this. About 32,000 lines. 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K Commented Jan 22, 2016 at 21:12

1 Answer 1

1

Simulate you file from the comment line - replicated several time (i.e. one string per row of the file):

In [8]: txt = b" 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K"
In [9]: txt = [txt for _ in range(5)]

In [10]: txt
Out[10]: 
[b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
 b' 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K']

Load with genfromtxt, with delimiter. Let it choose the best dtype per column:

In [12]: A=np.genfromtxt(txt, delimiter=',',dtype=None)
In [13]: A
Out[13]: 
array([ (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),
       (39, b' State-gov', 77516, b' Bachelors', 13, b' Never-married', b' Adm-clerical', b' Not-in-family', b' White', b' Male', 2174, 0, 40, b' United-States', b' <=50K'),...], 
      dtype=[('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'), ('f3', 'S10'), ('f4', '<i4'), ('f5', 'S14'), ('f6', 'S13'), ('f7', 'S14'), ('f8', 'S6'), ('f9', 'S5'), ('f10', '<i4'), ('f11', '<i4'), ('f12', '<i4'), ('f13', 'S14'), ('f14', 'S6')])

5 element array with a compound dtype

In [14]: A.shape
Out[14]: (5,)
In [15]: A.dtype
Out[15]: dtype([('f0', '<i4'), ('f1', 'S10'), ('f2', '<i4'),
    ('f3', 'S10'), ('f4', '<i4'), ....])

Access a 'column' with a field name (not column number)

In [16]: A['f4']
Out[16]: array([13, 13, 13, 13, 13])

Or load as dtype=str:

In [17]: A=np.genfromtxt(txt, delimiter=',',dtype=str)
In [18]: A
Out[18]: 
array([['39', ' State-gov', ' 77516', ' Bachelors', ' 13',
        ' Never-married', ' Adm-clerical', ' Not-in-family', ' White',
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K'],
        ...
        ' Male', ' 2174', ' 0', ' 40', ' United-States', ' <=50K']], 
      dtype='<U14')
In [19]: A.dtype
Out[19]: dtype('<U14')
In [20]: A.shape
Out[20]: (5, 15)
In [21]: A[:,4]
Out[21]: 
array([' 13', ' 13', ' 13', ' 13', ' 13'], 
      dtype='<U14')

Now it is 15 column 2d array that can be indexed with column number.

With the wrong delimiter, and it loads one column per row

In [24]: A=np.genfromtxt(txt, delimiter=' ,',dtype=str)
In [25]: A
Out[25]: 
array([ '39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, <=50K',
      ...], 
      dtype='<U127')
In [26]: A.shape
Out[26]: (5,)

A 1d array with a long string dtype.

A CSV file might loaded in various ways, some intentional, some not. You have to look at the results, and try to understand them before blindly trying to index columns.

Sign up to request clarification or add additional context in comments.

1 Comment

WOW! Thanks so much for all that work. Through it I was able to solve it and everything is happy now. :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.