0

I am trying to read and convert pdf file to text by following this tutorial but i keep getting error. here is my python code

import PyPDF2 
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
 
if text != "":
   text = text
 
else:
   text = textract.process(fileurl, method='tesseract', language='eng')
 
 
tokens = word_tokenize(text)
 
punctuations = ['(',')',';',':','[',']',',']
 
stop_words = stopwords.words('english')
 
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

the error I keep getting is

tokens = word_tokenize(text)

TypeError: cannot use a string pattern on a bytes-like object

how can I fix the error?

8
  • Which version of python are you using? Commented Sep 25, 2019 at 2:32
  • 1
    Possible duplicate of TypeError: can't use a string pattern on a bytes-like object in re.findall() Commented Sep 25, 2019 at 2:34
  • Check the duplicate. word_tokenize uses regex on the backend so this solution will work for you as well. Commented Sep 25, 2019 at 2:34
  • @MyNameIsCaleb I reviewed the answer you referenced but I don't know how to apply to my situation Commented Sep 25, 2019 at 2:35
  • tokens = word_tokenize(text.decode("utf-8")) try this Commented Sep 25, 2019 at 2:36

1 Answer 1

3

You are reading in bytes but you need a string because word_tokenize uses regex in the backend.

Change this line:

tokens = word_tokenize(text.decode("utf-8"))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.