Pandas + scikit-learn - how to apply 2D-array transformations to a DataFrame

Question

Background

The scikit-learn API is based on stateful objects, which take 2D numpy arrays as input, compute a transformation (internally, within the object), and later apply it to other 2D arrays. e.g.:

arr = np.arange(4).reshape(2,2)
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(arr) # scaler state has changed, nothing returns
scaler.transform(arr) # a transformed version of arr returns

My Question

I want to apply a transformation to data stored in a pandas DataFrame, and put the transformed data back into the same DataFrame.

The problem is that df.apply(scaler.transform) feeds data into the scaler column-by-column (1D arrays), where scaler expects a 2D array.

Following the answers here and here, I'm currently doing:

transformed_array = scaler.transform(df.values)
transformed_df = pd.DataFrame(data=transformed_array, index=df.index, columns=df.columns)

But that seems rather clunky and inefficient. Also, I'm feeling there's a corner case where I'll lose the DataFrame's metadata.

Is there a better way?

Vivek Kumar · Accepted Answer · 2018-04-10 09:33:56Z

0

You can use the iloc[:,:].

According to the documentation

Pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow closely python and numpy slicing. These are 0-based indexing. When slicing, the start bounds is included, while the upper bound is excluded. Note that setting works as well.

Example:

df = pd.DataFrame([[1, 2.], [3, 4.]], columns=['a', 'b'])
df2 = pd.DataFrame([[3, 4.], [5, 6.]], columns=['c', 'd'])

df.iloc[:,:]=df2.values
print(df)
     a    b
0  3.0  4.0
1  5.0  6.0

So in your case, it will be:

df.iloc[:,:] = scaler.transform(df.values) # On an already fitted scaler

answered Apr 10, 2018 at 9:33

Vivek Kumar

36.8k9 gold badges116 silver badges139 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

OmerB Over a year ago

Thanks, do you know if assigning like this is more/less efficient than using the constructor? also is iloc better than loc in this sense?

Vivek Kumar Over a year ago

@OmerB No, I am sorry, I dont know about performance. But .loc cannot be used for this, because thats for label-based indexing. In '.loc' you cannot specify index of entries.

OmerB Over a year ago

but I can do .loc[:,:] or even just df[:]... They might be all equivalent, but I'll wait to see if someone has a definitive answer for that...

hellpanderr Over a year ago

@OmerB they are not equivalent perfomance wise: stackoverflow.com/a/45983830/4016674

MaxU - stand with Ukraine · Accepted Answer · 2018-04-11 10:21:21Z

0

Consider the following demo:

In [198]: df = (pd.DataFrame(np.random.randint(10**5, size=(5,3)), columns=list('abc'))
                  .assign(d=list('abcde')))

In [199]: df
Out[199]:
       a      b      c  d
0  17821  80092  11803  a
1  91198  19663  78665  b
2  77674  46347  72550  c
3  67390  63699  16347  d
4  50445  31346  95608  e

In [200]: cols = ['a','b','c']

In [201]: df[cols] = scaler.fit_transform(df[cols])

In [202]: df
Out[202]:
          a         b         c  d
0 -1.701325  1.466854 -1.259806  a
1  1.196186 -1.315108  0.690414  b
2  0.662151 -0.086660  0.512053  c
3  0.256056  0.712172 -1.127267  d
4 -0.413068 -0.777259  1.184605  e

answered Apr 11, 2018 at 10:21

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Collectives™ on Stack Overflow

Pandas + scikit-learn - how to apply 2D-array transformations to a DataFrame

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related