10

I generate a data frame that looks like this (summaryDF):

   accuracy        f1  precision    recall
0     0.494  0.722433   0.722433  0.722433
0     0.290  0.826087   0.826087  0.826087
0     0.274  0.629630   0.629630  0.629630
0     0.278  0.628571   0.628571  0.628571
0     0.288  0.718750   0.718750  0.718750
0     0.740  0.740000   0.740000  0.740000
0     0.698  0.765133   0.765133  0.765133
0     0.582  0.778547   0.778547  0.778547
0     0.682  0.748235   0.748235  0.748235
0     0.574  0.767918   0.767918  0.767918
0     0.398  0.711656   0.711656  0.711656
0     0.530  0.780083   0.780083  0.780083

Because I know what each row in this should be, I then am using this code to set the names of each row (these aren't the actual row names but just for argument's sake).

summaryDF = summaryDF.set_index(['A','B','C', 'D','E','F','G','H','I','J','K','L'])

However, I am getting:

level = frame[col].values
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1797, in __getitem__
    return self._getitem_column(key)
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/frame.py", line 1804, in _getitem_column
    return self._get_item_cache(key)
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/generic.py", line 1084, in _get_item_cache
    values = self._data.get(item)
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/internals.py", line 2851, in get
    loc = self.items.get_loc(item)
  File "/Users/me/anaconda/lib/python2.7/site-packages/pandas/core/index.py", line 1572, in get_loc
    return self._engine.get_loc(_values_from_object(key))
  File "pandas/index.pyx", line 134, in pandas.index.IndexEngine.get_loc (pandas/index.c:3824)
  File "pandas/index.pyx", line 154, in pandas.index.IndexEngine.get_loc (pandas/index.c:3704)
  File "pandas/hashtable.pyx", line 686, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12280)
  File "pandas/hashtable.pyx", line 694, in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:12231)
KeyError: 'A'

I have no idea what I am doing wrong and have researched far and wide. Any ideas?

2 Answers 2

8

I guess you and @jezrael misunderstood an example from the pandas docs:

df.set_index(['A', 'B'])

A and B are column names / labels in this example:

In [55]: df = pd.DataFrame(np.random.randint(0, 10, (5,4)), columns=list('ABCD'))

In [56]: df
Out[56]:
   A  B  C  D
0  6  9  7  4
1  5  1  3  4
2  4  4  0  5
3  9  0  9  8
4  6  4  5  7

In [57]: df.set_index(['A','B'])
Out[57]:
     C  D
A B
6 9  7  4
5 1  3  4
4 4  0  5
9 0  9  8
6 4  5  7

The documentation says it should be list of column labels / arrays.

so you were looking for:

In [58]: df.set_index([['A','B','C','D','E']])
Out[58]:
   A  B  C  D
A  6  9  7  4
B  5  1  3  4
C  4  4  0  5
D  9  0  9  8
E  6  4  5  7

but as @jezrael has suggested df.index = ['A','B',...] is faster and more idiomatic method...

Sign up to request clarification or add additional context in comments.

Comments

2

You need assign list to summaryDF.index, if length of list is same as length of DataFrame:

summaryDF.index = ['A','B','C', 'D','E','F','G','H','I','J','K','L']
print (summaryDF)
   accuracy        f1  precision    recall
A     0.494  0.722433   0.722433  0.722433
B     0.290  0.826087   0.826087  0.826087
C     0.274  0.629630   0.629630  0.629630
D     0.278  0.628571   0.628571  0.628571
E     0.288  0.718750   0.718750  0.718750
F     0.740  0.740000   0.740000  0.740000
G     0.698  0.765133   0.765133  0.765133
H     0.582  0.778547   0.778547  0.778547
I     0.682  0.748235   0.748235  0.748235
J     0.574  0.767918   0.767918  0.767918
K     0.398  0.711656   0.711656  0.711656
L     0.530  0.780083   0.780083  0.780083

print (summaryDF.index)
Index(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L'], dtype='object')

Timings:

In [117]: %timeit summaryDF.index = ['A','B','C', 'D','E','F','G','H','I','J','K','L']
The slowest run took 6.86 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 76.2 µs per loop

In [118]: %timeit summaryDF.set_index(pd.Index(['A','B','C', 'D','E','F','G','H','I','J','K','L']))
The slowest run took 6.77 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 227 µs per loop

Another solution is convert list to numpy array:

summaryDF.set_index(np.array(['A','B','C', 'D','E','F','G','H','I','J','K','L']), inplace=True)
print (summaryDF)
   accuracy        f1  precision    recall
A     0.494  0.722433   0.722433  0.722433
B     0.290  0.826087   0.826087  0.826087
C     0.274  0.629630   0.629630  0.629630
D     0.278  0.628571   0.628571  0.628571
E     0.288  0.718750   0.718750  0.718750
F     0.740  0.740000   0.740000  0.740000
G     0.698  0.765133   0.765133  0.765133
H     0.582  0.778547   0.778547  0.778547
I     0.682  0.748235   0.748235  0.748235
J     0.574  0.767918   0.767918  0.767918
K     0.398  0.711656   0.711656  0.711656
L     0.530  0.780083   0.780083  0.780083

1 Comment

Thanks @jezrael - The pandas docs really did suggest that my method was possible from their example they give - pandas.pydata.org/pandas-docs/stable/generated/… - I hope someone changes this!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.