1

I have a transformer which calculates the percentage of the values per group. Initially, pandas was used because I started with pandas and colnames are nicer to handle. However, now I need to integrate into sklearn-pipeline.

How can I convert my Transformer to support numpy arrays from a sklearn pipeline instead of pandas data frames? The point is that self.colname cant be used for numpy arrays and I think the grouping needs to be performed differently.

How to implement persistence of such a transformer as these weights need to be loadable from disk in order to deploy such a Transformer in a pipeline.

class PercentageTransformer(TransformerMixin):
    def __init__(self, colname,typePercentage='totalTarget', _target='TARGET', _dropOriginal=True):
        self.colname = colname
        self._target = _target
        self._dropOriginal = _dropOriginal
        self.typePercentage = typePercentage

    def fit(self, X, y, *_):
        original = pd.concat([y,X], axis=1)
        grouped = original.groupby([self.colname, self._target]).size()
        if self.typePercentage == 'totalTarget':
            df = grouped / original[self._target].sum()
        else:
            df = (grouped / grouped.groupby(level=0).sum())

        if self.typePercentage == 'totalTarget':
            nameCol = "pre_" + self.colname
        else:
            nameCol = "pre2_" + self.colname
        self.nameCol = nameCol
        grouped = df.reset_index(name=nameCol)
        groupedOnly = grouped[grouped[self._target] == 1]
        groupedOnly = groupedOnly.drop(self._target, 1)

        self.result =  groupedOnly
        return self

    def transform(self, dataF):
        mergedThing = pd.merge(dataF, self.result, on=self.colname, how='left')
        mergedThing.loc[(mergedThing[self.nameCol].isnull()), self.nameCol] = 0
        if self._dropOriginal:
            mergedThing = mergedThing.drop(self.colname, 1)
        return mergedThing

It would be used in a pipeline like this:

pipeline =  Pipeline([
    ('features', FeatureUnion([
        ('continuous', Pipeline([
            ('extract', ColumnExtractor(CONTINUOUS_FIELDS)),
        ])),
        ('factors', Pipeline([
            ('extract', ColumnExtractor(FACTOR_FIELDS)),
            # using labelencoding and all bias
            ('bias',  PercentageAllTransformer(FACTOR_FIELDS, _dropOriginal=True, typePercentage='totalTarget')),
        ]))
    ], n_jobs=-1)),
    ('estimator', estimator)
])

The pipeline will be fitted with X and y where both are data frames. I am unsure of X.as_matrix would help.

4
  • pandas objects are wrappers around numpy objects. There is no pandas array, I believe you mean Series? Anyway, maybe your problem would be solved simply by returning self.values instead of self. Commented Oct 23, 2016 at 17:30
  • As for persistence, there are several ways to go about it. Generally, object serialization in Python will use the pickle module. Commented Oct 23, 2016 at 17:32
  • Indeed I meant pandas data frames. The point is if I understand it correctly: orignal original.groupby([self.colname, self._target]is no longer a dataframe but a numpy array e.g. the colnames do no longer work. so a self.values does not seem to be enough. Commented Oct 23, 2016 at 17:33
  • 1
    No, groupby returns a groupby object, which usually is used to generate a new DataFrame. You can't access self.colname, self._target as you normally would because by default, these are used as the index to the new DataFrame. Pass the as_index=False to groupby to retain your grouping columns as columns. Commented Oct 23, 2016 at 17:39

1 Answer 1

3
  • Converting Things Back and Forth

Pandas has a .to_records() method, and, as you mentioned, a .as_matrix() method. The .to_records() method will actually keep your column names for you. Numpy does support named columns in arrays. See here.

  • Persistence

Pandas has a pandas.to_pickle(obj, filename) method, which takes a pandas object and pickles it. There is a corresponding pandas.read_pickle(filename) method.

Numpy has a save and load function as well.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.