7

I cant read a file, and I dont understand why:

f = open("test/test.pdf", "r")
data = list(f.read())
print data

Returns : []

I would like to open a PDF, and extract every bytes, and put it in a List.

What's wrong with my code ? :(

Thanks,

1
  • How many bytes are actually in test/test.pdf? Commented Mar 23, 2010 at 1:58

4 Answers 4

13
f = open("test/test.pdf", "rb")

You must include the pseudo-mode "b" for binary when reading and writing on Windows. Otherwise the OS silently translates what it considers to be "line endings", causing i/o corruption.

Sign up to request clarification or add additional context in comments.

Comments

1

Jonathan is correct that you should be opening the file in binary mode if you are on windows.

However, a PDF file will start with "%PDF-", which would at least be read in regardless of whether you are using binary mode or not.

So it appears to me that your "test/test.pdf" is an empty file

Comments

1
  • As best as I understand the pdf format, a pdf file shouldn't be a binary file. It should be a text file that may contain lots of binary blobs. I could be wrong.
  • On Windows, if you are opening a binary file, you need to include b in the mode of your file, i.e. open(filename, "rb").
    • On Unix-like systems, the b doesn't hurt anything, though it does not mean anything.
  • Always use a context manager with your files. That is to say, instead of writing f = open("test/test.pdf", "rb"), say with open("test/test.pdf", "r") as f:. This will assure your file always gets closed.
  • list(f.read()) is not likely to be useful code very often. f.read() reaurns a str and calling list on it makes a list of the characters (one-byte strings). This is very seldom needed.
  • Binary or text or whatever, read should work. Are you positive that there is anything in test/test.pdf? Python does not seem to think there is.

Comments

0

What platform are you running on?

Using python 2.6 on Windows XP, I get:

f = open("14500lf.pdf", "r")
data = list(f.read())
print data
['%', 'P', 'D', 'F', '-', '1', '.', '5', '\r', '%', '\xe2', '\xe3', '\xcf', '\xd3', '\n', '1', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'n', 't', 'e', 'n', 't', 's', ' ', '3', ' ', '0', ' ', 'R', '/', 'T', 'y', 'p', 'e', '/', 'P', 'a', 'g', 'e', '/', 'P', 'a', 'r', 'e', 'n', 't', ' ', '8', '7', ' ', '0', ' ', 'R', '/', 'T', 'h', 'u', 'm', 'b', ' ', '7', '1', ' ', '0', ' ', 'R', '/', 'R', 'o', 't', 'a', 't', 'e', ' ', '0', '/', 'M', 'e', 'd', 'i', 'a', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'C', 'r', 'o', 'p', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'R', 'e', 's', 'o', 'u', 'r', 'c', 'e', 's', ' ', '2', ' ', '0', ' ', 'R', '>', '>', '\r', 'e', 'n', 'd', 'o', 'b', 'j', '\r', '2', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'l', 'o', 'r', 'S', 'p', 'a', 'c', 'e', '<', '<', '/', 'D', 'e', 'f', 'a', 'u', 'l', 't', 'R', 'G', 'B', ' ', '1', '0', '0', ' ', '0', ' ', 'R', '>', '>', '/', 'F', 'o', 'n', 't', '<', '<', '/', 'F', '5', ' ', '9', '6', ' ', '0', ' ', 'R', '/', 'F', '7', ' ', '9', '7', ' ', '0', ' ', 'R', '/', 'F', '9', ' ', '1', '0', '6', ' ', '0', ' ', 'R', '/', 'F', '1', '1', ' ', '1', '0', '7', ' ', '0', ' ', 'R', '/', 'F', '1', '4', ' ', '1', '1', '1', ' ', '0', ' ', 'R', '/', 'F', '1', '6', ' ', '1', '1', '6', ' ', '0', ' ', 'R', '/', 'F', '1', '7', ' ', '1', '1', '7', ' ', '0', ' ', 'R', '/', 'F', '1', '3', ' ', '1', '1', '2', ' ', '0', ' ', 'R', '>', '>', '/', 'P', 'r', 'o', 'c', 'S', 'e', 't', '[', '/', 'P', 'D', 'F', '/', 'T', 'e', 'x', 't', ']', '>', '>', '\r', 'e', 'n', 'd', 'o', 'b', 'j', '\r', '3', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'L', 'e', 'n', 'g', 't', 'h', ' ', '4', ' ', '0', ' ', 'R', '/', 'F', 'i', 'l', 't', 'e', 'r', '/', 'F', 'l', 'a', 't', 'e', 'D', 'e', 'c', 'o', 'd', 'e', '>', '>', 's', 't', 'r', 'e', 'a', 'm', '\n', 'H', '\x89', '\xa4', 'W', '\xd9', 'r', 'T', '\xc9', '\x11', '\xfd', '\x82', '\xfb', '\x0f', '\xf5', '\xd8', '\n', '\x8f', '\x8a', '\xda', '\x97', 'G', '!', '\x04', '\x06', '\x03']

On a PDF I happen to have on my desktop (Its a IC Datasheet LTC1450)

Using "rb" (Read Binary):

f = open("14500lf.pdf", "rb")
data = list(f.read())
print data
['%', 'P', 'D', 'F', '-', '1', '.', '5', '\r', '%', '\xe2', '\xe3', '\xcf', '\xd3', '\r', '\n', '1', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'n', 't', 'e', 'n', 't', 's', ' ', '3', ' ', '0', ' ', 'R', '/', 'T', 'y', 'p', 'e', '/', 'P', 'a', 'g', 'e', '/', 'P', 'a', 'r', 'e', 'n', 't', ' ', '8', '7', ' ', '0', ' ', 'R', '/', 'T', 'h', 'u', 'm', 'b', ' ', '7', '1', ' ', '0', ' ', 'R', '/', 'R', 'o', 't', 'a', 't', 'e', ' ', '0', '/', 'M', 'e', 'd', 'i', 'a', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'C', 'r', 'o', 'p', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'R', 'e', 's', 'o', 'u', 'r', 'c', 'e', 's', ' ', '2', ' ', '0', ' ', 'R', '>', '>', '\r', 'e',

....Snip a few thousand lines...

'9', '1', ' ', '0', ' ', 'R', '/', 'I', 'D', '[', '<', 'd', 'd', '3', 'd', '2', '8', '5', 'e', '1', 'd', '9', '0', '4', '6', 'e', '1', 'f', '6', 'e', '7', '0', '8', 'b', 'd', '8', 'e', '4', 'f', '9', 'b', '1', '3', '>', '<', '4', '3', '8', 'a', '7', '7', '2', '3', 'f', 'b', '2', '9', 'e', '7', '4', '6', 'a', '4', 'd', '4', '1', '6', 'a', 'f', '7', '6', '2', 'd', '8', '0', '9', '5', '>', ']', '>', '>', '\r', '\n', 's', 't', 'a', 'r', 't', 'x', 'r', 'e', 'f', '\r', '\n', '2', '9', '0', '2', '6', '9', '\r', '\n', '%', '%', 'E', 'O', 'F', '\r', '\n']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.