how do I fix error "cannot use a string pattern on a bytes-like object"?

Question

I am trying to read and convert pdf file to text by following this tutorial but i keep getting error. here is my python code

import PyPDF2 
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
 
if text != "":
   text = text
 
else:
   text = textract.process(fileurl, method='tesseract', language='eng')
 
 
tokens = word_tokenize(text)
 
punctuations = ['(',')',';',':','[',']',',']
 
stop_words = stopwords.words('english')
 
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

the error I keep getting is

tokens = word_tokenize(text)

TypeError: cannot use a string pattern on a bytes-like object

how can I fix the error?

Possible duplicate of TypeError: can't use a string pattern on a bytes-like object in re.findall() — MyNameIsCaleb
– MyNameIsCaleb, Commented Sep 25, 2019 at 2:34
Check the duplicate. word_tokenize uses regex on the backend so this solution will work for you as well. — MyNameIsCaleb
– MyNameIsCaleb, Commented Sep 25, 2019 at 2:34
@MyNameIsCaleb I reviewed the answer you referenced but I don't know how to apply to my situation — e.iluf
– e.iluf, Commented Sep 25, 2019 at 2:35

MyNameIsCaleb · Accepted Answer · 2019-09-25 02:42:39Z

3

You are reading in bytes but you need a string because word_tokenize uses regex in the backend.

Change this line:

tokens = word_tokenize(text.decode("utf-8"))

answered Sep 25, 2019 at 2:42

MyNameIsCaleb

4,4891 gold badge17 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

how do I fix error "cannot use a string pattern on a bytes-like object"?

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related