1

Background

The scikit-learn API is based on stateful objects, which take 2D numpy arrays as input, compute a transformation (internally, within the object), and later apply it to other 2D arrays. e.g.:

arr = np.arange(4).reshape(2,2)
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(arr) # scaler state has changed, nothing returns
scaler.transform(arr) # a transformed version of arr returns

My Question

I want to apply a transformation to data stored in a pandas DataFrame, and put the transformed data back into the same DataFrame.

The problem is that df.apply(scaler.transform) feeds data into the scaler column-by-column (1D arrays), where scaler expects a 2D array.

Following the answers here and here, I'm currently doing:

transformed_array = scaler.transform(df.values)
transformed_df = pd.DataFrame(data=transformed_array, index=df.index, columns=df.columns)

But that seems rather clunky and inefficient. Also, I'm feeling there's a corner case where I'll lose the DataFrame's metadata.

Is there a better way?

2 Answers 2

0

You can use the iloc[:,:].

According to the documentation

Pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow closely python and numpy slicing. These are 0-based indexing. When slicing, the start bounds is included, while the upper bound is excluded. Note that setting works as well.

Example:

df = pd.DataFrame([[1, 2.], [3, 4.]], columns=['a', 'b'])
df2 = pd.DataFrame([[3, 4.], [5, 6.]], columns=['c', 'd'])

df.iloc[:,:]=df2.values
print(df)
     a    b
0  3.0  4.0
1  5.0  6.0

So in your case, it will be:

df.iloc[:,:] = scaler.transform(df.values) # On an already fitted scaler
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks, do you know if assigning like this is more/less efficient than using the constructor? also is iloc better than loc in this sense?
@OmerB No, I am sorry, I dont know about performance. But .loc cannot be used for this, because thats for label-based indexing. In '.loc' you cannot specify index of entries.
but I can do .loc[:,:] or even just df[:]... They might be all equivalent, but I'll wait to see if someone has a definitive answer for that...
@OmerB they are not equivalent perfomance wise: stackoverflow.com/a/45983830/4016674
0

Consider the following demo:

In [198]: df = (pd.DataFrame(np.random.randint(10**5, size=(5,3)), columns=list('abc'))
                  .assign(d=list('abcde')))

In [199]: df
Out[199]:
       a      b      c  d
0  17821  80092  11803  a
1  91198  19663  78665  b
2  77674  46347  72550  c
3  67390  63699  16347  d
4  50445  31346  95608  e

In [200]: cols = ['a','b','c']

In [201]: df[cols] = scaler.fit_transform(df[cols])

In [202]: df
Out[202]:
          a         b         c  d
0 -1.701325  1.466854 -1.259806  a
1  1.196186 -1.315108  0.690414  b
2  0.662151 -0.086660  0.512053  c
3  0.256056  0.712172 -1.127267  d
4 -0.413068 -0.777259  1.184605  e

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.