Python error: TypeError: Expected string or bytes-like object

Question

I'm currently working on a sentiment analysis project using nltk in python. I can't get my script to pass in rows of text from my csv to perform tokenization on. However, if I pass the text in one entry at a time it works fine. I am getting one persistent error: 'TypeError: expected string or bytes-like object' when I try and pass the whole csv in. Here is the printed data frame and python code I'm using. Any help to resolve this issue would be great.

                              abstract
0    Allergic diseases are often triggered by envir...
1    omal lymphopoietin (TSLP) has important roles ...
2    of atrial premature beats, and a TSLP was high...
3     deposition may play an important role in the ...
4    ted by TsPLP was higher than that mediated by ...
5    nal Stat5 transcription factor in that TSLP st...

data = pd.read_csv('text.csv', sep=';', encoding = 'utf-8')
x = data.loc[:, 'abstract']
print(x.head())
tokens = nltk.word_tokenize(x)
print(tokens)

Attached is the full stack trace error. EDIT: print statement

EDIT: Output

Please update your question with the full Traceback message. — quamrana
– quamrana, Commented Mar 17, 2020 at 13:21
tokens = nltk.word_tokenize(x) is the reason of error. Here x is a df. You must pass String in nltk.word_tokenize() function. One thing you can do, you can iterate over x and pass the each line of string in nltk.word_tokenize() — Ta_Req
– Ta_Req, Commented Mar 17, 2020 at 13:24
@0buz Sorry should have clarified it's this line: tokens = nltk.word_tokenize(x) — Benedict Groves
– Benedict Groves, Commented Mar 17, 2020 at 13:24
@quamrana I have updated the question with a link to the full stack trace error — Benedict Groves
– Benedict Groves, Commented Mar 17, 2020 at 13:26

Ta_Req · Accepted Answer · 2020-03-17 13:31:57Z

1

tokens = [nltk.word_tokenize(line) for line in x ]

answered Mar 17, 2020 at 13:31

Ta_Req

1066 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Benedict Groves Over a year ago

Thanks this worked! So the input text is formatted in csv, so one bit of text per cell, so how can I make sure that the output from this is printed by cell, not just as one block of text?

Ta_Req Over a year ago

I do not understand. If you write the expected output I can have a look.

Benedict Groves Over a year ago

There are multiple entries in the csv file, the script you suggested tokenizes them however the output is printed as one block of text with no delimiters. It would be very helpful if I could get each cell of text in the csv returned like so: "Allergic', 'diseases', 'are', 'often', 'triggered', 'by', 'environmental', 'allergens', 'that', 'induce', 'dominant', 'type', '2', 'immune', 'responses', ',', 'characterized', 'by', 'the', 'infiltrated', 'T-helper', 'type', '2', '(', 'TH2', ')', 'lymphocytes', ',', 'eosinophils'] Then followed by the next entry and so on with clear separations

Ta_Req Over a year ago

You can follow the answer by @0buz. Concatenate the lines in to a big text and then tokenize

Ta_Req Over a year ago

I think the solution I suggested must return each cell content with separation. e. g . [ [first cell tokens],[second cel token], [third cell token][...],.,.,]

|

0buz · Accepted Answer · 2020-03-17 14:15:54Z

1

The nltk documentation give an example of nltk.word_tokenize usage where you may notice "sentence" is a string.

In your situation, x is a dataframe Series(of strings), which you need to reconstruct into a string before passing it to nltk.word_tokenize.

One way to deal with this is to create your nltk "sentence" from x:

x = data.loc[:, 'abstract']
sentence=' '.join(x)
tokens = nltk.word_tokenize(sentence)

EDIT: Try this as per further comments (remember this will be a Series of tokens to be accessed accordingly):

tokens=x.apply(lambda sentence: nltk.word_tokenize(sentence))

edited Mar 17, 2020 at 14:15

answered Mar 17, 2020 at 13:36

0buz

3,5352 gold badges12 silver badges31 bronze badges

7 Comments

0buz Over a year ago

So do you want one set of tokens per abstract cell?

Benedict Groves Over a year ago

Is there anyway to de-construct it back into a series of strings after it has been tokenized? Thanks

Benedict Groves Over a year ago

Yes that would be perfect!

0buz Over a year ago

Please see my EDIT and let me know how it goes.

Benedict Groves Over a year ago

just tried your edit and it works but weirdly only half way through the text. Phenotypic analysis of these cells revealed that they are at the pro-B cell stage of differentiation and express cell surface markers characteristic of pro-B cells cultured in IL-7. TSLP can replace the activity of IL-7 in supporting the progression of B lymphocytes from uncommitted bipotential precursors. In the absence of either TSLP or IL-7, the progeny of cells that give rise to mature B lymphocytes fail to develop from these bipotential precursors. [To, examine, the, role, of, gamma, c, in,] ...

|

Collectives™ on Stack Overflow

Python error: TypeError: Expected string or bytes-like object

2 Answers 2

7 Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related