2

I'm currently working on a sentiment analysis project using nltk in python. I can't get my script to pass in rows of text from my csv to perform tokenization on. However, if I pass the text in one entry at a time it works fine. I am getting one persistent error: 'TypeError: expected string or bytes-like object' when I try and pass the whole csv in. Here is the printed data frame and python code I'm using. Any help to resolve this issue would be great.

                              abstract
0    Allergic diseases are often triggered by envir...
1    omal lymphopoietin (TSLP) has important roles ...
2    of atrial premature beats, and a TSLP was high...
3     deposition may play an important role in the ...
4    ted by TsPLP was higher than that mediated by ...
5    nal Stat5 transcription factor in that TSLP st...
data = pd.read_csv('text.csv', sep=';', encoding = 'utf-8')
x = data.loc[:, 'abstract']
print(x.head())
tokens = nltk.word_tokenize(x)
print(tokens)

Attached is the full stack trace error. EDIT: print statement

enter image description here

EDIT: Output

enter image description here

8
  • Which line gives you that error? Commented Mar 17, 2020 at 13:20
  • 2
    Please update your question with the full Traceback message. Commented Mar 17, 2020 at 13:21
  • tokens = nltk.word_tokenize(x) is the reason of error. Here x is a df. You must pass String in nltk.word_tokenize() function. One thing you can do, you can iterate over x and pass the each line of string in nltk.word_tokenize() Commented Mar 17, 2020 at 13:24
  • @0buz Sorry should have clarified it's this line: tokens = nltk.word_tokenize(x) Commented Mar 17, 2020 at 13:24
  • @quamrana I have updated the question with a link to the full stack trace error Commented Mar 17, 2020 at 13:26

2 Answers 2

1

tokens = [nltk.word_tokenize(line) for line in x ]

Sign up to request clarification or add additional context in comments.

7 Comments

Thanks this worked! So the input text is formatted in csv, so one bit of text per cell, so how can I make sure that the output from this is printed by cell, not just as one block of text?
I do not understand. If you write the expected output I can have a look.
There are multiple entries in the csv file, the script you suggested tokenizes them however the output is printed as one block of text with no delimiters. It would be very helpful if I could get each cell of text in the csv returned like so: "Allergic', 'diseases', 'are', 'often', 'triggered', 'by', 'environmental', 'allergens', 'that', 'induce', 'dominant', 'type', '2', 'immune', 'responses', ',', 'characterized', 'by', 'the', 'infiltrated', 'T-helper', 'type', '2', '(', 'TH2', ')', 'lymphocytes', ',', 'eosinophils'] Then followed by the next entry and so on with clear separations
You can follow the answer by @0buz. Concatenate the lines in to a big text and then tokenize
I think the solution I suggested must return each cell content with separation. e. g . [ [first cell tokens],[second cel token], [third cell token][...],.,.,]
|
1

The nltk documentation give an example of nltk.word_tokenize usage where you may notice "sentence" is a string.

In your situation, x is a dataframe Series(of strings), which you need to reconstruct into a string before passing it to nltk.word_tokenize.

One way to deal with this is to create your nltk "sentence" from x:

x = data.loc[:, 'abstract']
sentence=' '.join(x)
tokens = nltk.word_tokenize(sentence)

EDIT: Try this as per further comments (remember this will be a Series of tokens to be accessed accordingly):

tokens=x.apply(lambda sentence: nltk.word_tokenize(sentence))

7 Comments

So do you want one set of tokens per abstract cell?
Is there anyway to de-construct it back into a series of strings after it has been tokenized? Thanks
Yes that would be perfect!
Please see my EDIT and let me know how it goes.
just tried your edit and it works but weirdly only half way through the text. Phenotypic analysis of these cells revealed that they are at the pro-B cell stage of differentiation and express cell surface markers characteristic of pro-B cells cultured in IL-7. TSLP can replace the activity of IL-7 in supporting the progression of B lymphocytes from uncommitted bipotential precursors. In the absence of either TSLP or IL-7, the progeny of cells that give rise to mature B lymphocytes fail to develop from these bipotential precursors. [To, examine, the, role, of, gamma, c, in,] ...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.