Creating many feature columns in Tensorflow

Question

I'm getting started on a Tensorflow project, and am in the middle of defining and creating my feature columns. However, I have hundreds and hundreds of features- it's a pretty extensive dataset. Even after preprocessing and scrubbing, I have a lot of columns.

The traditional way of creating a feature_column is defined in the Tensorflow tutorial and even this StackOverflow post. You essentially declare and initialize a Tensorflow object for each feature column:

gender = tf.feature_column.categorical_column_with_vocabulary_list(
    "gender", ["Female", "Male"])

This works all well and good if your dataset has only a few columns, but in my case, I surely don't want to have hundreds of lines of code initializing different feature_column objects.

What's the best way to resolve this issue? I notice that in the tutorial, all the columns are collected as a list:

base_columns = [
    gender, native_country, education, occupation, workclass, relationship,
    age_buckets,
]

Which is ultimately passed into your estimator:

m = tf.estimator.LinearClassifier(
    model_dir=model_dir, feature_columns=base_columns)

So would the ideal way of handling feature_column creation for hundreds of columns be to append them directly into a list? Something like this?

my_columns = []

for col in df.columns:
    if is_string_dtype(df[col]): #is_string_dtype is pandas function
        my_column.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
            hash_bucket_size= len(df[col].unique())))

    elif is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
        my_column.append(tf.feature_column.numeric_column(col))

Is this the best way of creating these feature columns? Or am I missing some functionality to Tensorflow that allows me to work around this step?

What you have makes sense to me. :)

greeness
– greeness

2017-10-20 20:04:20 +00:00
Commented Oct 20, 2017 at 20:04 — greeness
– greeness, Commented Oct 20, 2017 at 20:04
Can you promote that to an answer, @greeness ? Thanks! :)

dga
– dga

2017-11-18 15:30:17 +00:00
Commented Nov 18, 2017 at 15:30 — dga
– dga, Commented Nov 18, 2017 at 15:30
alright, it does not add anything to op's question though.

greeness
– greeness

2017-11-20 04:13:21 +00:00
Commented Nov 20, 2017 at 4:13 — greeness
– greeness, Commented Nov 20, 2017 at 4:13

Phlogi · Accepted Answer · 2021-01-01 23:25:34Z

8

What you have posted in the question makes sense. Small extension based on your own code:

import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
  if ptypes.is_string_dtype(df[col]): 
    my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
        hash_bucket_size= len(df[col].unique())))

  elif ptypes.is_numeric_dtype(df[col]): 
    my_columns.append(tf.feature_column.numeric_column(col))

  elif ptypes.is_categorical_dtype(df[col]): 
    my_columns.append(tf.feature_column.categorical_column(col, 
        hash_bucket_size= len(df[col].unique())))

edited Jan 1, 2021 at 23:25

Phlogi

3993 silver badges7 bronze badges

answered Nov 20, 2017 at 4:12

greeness

16.2k5 gold badges55 silver badges82 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Maxim Zh · Accepted Answer · 2018-02-18 03:04:05Z

1

I used your own answer. Just edited a little bit (there should be my_columns instead of my_column in for loop) and posting it the way it worked for me.

import pandas.api.types as ptypes

my_columns = []

for col in df.columns:
  if ptypes.is_string_dtype(df[col]): #is_string_dtype is pandas function
    my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col, 
        hash_bucket_size= len(df[col].unique())))

  elif ptypes.is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
    my_columns.append(tf.feature_column.numeric_column(col))

answered Feb 18, 2018 at 3:04

Maxim Zh

3924 silver badges4 bronze badges

Comments

MNA · Accepted Answer · 2018-09-11 06:04:33Z

0

The above two methods works only if the data is provided in pandas data frame where you have column name for each column. But, in case you have all numeric column and you don't want to name those columns. for e.g. reading several numerical columns from a numpy array, you can use something like this:-

feature_column = [tf.feature_column.numeric_column(key='image',shape=(784,))] 

input_fn = tf.estimator.inputs.numpy_input_fn(dict({'image':x_train})

where X_train is your numy array with 784 columns. You can check this post by Vikas Sangwan for more details.

answered Sep 11, 2018 at 6:04

MNA

3334 silver badges11 bronze badges

Collectives™ on Stack Overflow

Creating many feature columns in Tensorflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related