Python Sklearn Pipeline with array

Question

I am trying to create a classifier using Python and Sklearn. I currently have all my data imported successfully. I have been trying to follow a tutorial from here, changing it a bit as I go. Later into the project I realized that their training and testing data was much different then mine. If I understand it right they had something like this:

X_train = ['Article or News article here', 'Anther News Article or Article here', ...]
y_train = ['Article Type', 'Article Type', ...]
#Same for the X_test and y_test

While I had something like this:

X_train = [['Dylan went in the house. Robert left the house', 'Where is Dylan?'], ['Mary ate the apple. Tom ate the cake', 'Who ate the cake?'], ...]
y_train = ['In the house.', 'Tom ate the cake']
#Same for the X_test and y_test

When I tried to train the classifier with there pipeline:

text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')),
     ('tfidf', TfidfTransformer(use_idf=True)),
     ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, 
     verbose=1)),])

I get the error:

AttributeError: 'list' object has no attribute 'lower'

At this line:

text_clf.fit(X_train, y_train)

After doing research I now know that is because I am inputting a array for my X_train data instead of a string. So my question is, how do I construct a pipeline that will accept arrays for my X_train data and a string for my y_train data? Is this possible to do with a pipeline?

Gambit1614 · Accepted Answer · 2018-07-06 22:51:47Z

1

You can use the tokenizer attribute to tell the CountVectorizer to each list as a single document and turn the lowercase option to False like this

text_clf = Pipeline([('vect', CountVectorizer(tokenizer=lambda single_doc: single_doc,stop_words='english',lowercase=False)),
 ('tfidf', TfidfTransformer(use_idf=True)),
 ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, 
 verbose=1)),])

edited Jul 6, 2018 at 22:51

answered Jul 6, 2018 at 21:40

Gambit1614

8,8411 gold badge29 silver badges52 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Sklearn Pipeline with array

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related