How can I send a 4 byte header from Java and read it in Python?

Question

I am trying to send a message via TCP sockets from a Java application and read it in Python 2.7 I want the first 4 bytes to specify the message length, so I could do:

header = socket.recv(4)
message_length = struct.unpack(">L",header)
message = socket.recv(message_length)

on the Python end.

Java side:

out = new PrintWriter(new BufferedWriter(new StreamWriter(socket.getOutputStream())),true);
byte[] bytes = ByteBuffer.allocate(4).putInt(message_length).array();
String header = new String(bytes, Charset.forName("UTF-8"));
String message_w_header = header.concat(message);
out.print(message_w_header);

This works for some message lengths (10, 102 characters) but for others it fails (for example 1017 characters). In the case of failing value if I output the values of each bytes I get:

Java:
Bytes 0 0 3 -7
Length 1017
Hex string 3f9

Python:
Bytes 0 0 3 -17
Length 1007
Hex string \x00\x00\x03\xef

I think this has something to do with signed bytes in Java and unsigned in Python but I can't figure out what should I do to make it work.

You can't decode arbitrary binary (raw 32-bit integer) as UTF-8. Instead of putting the message length into a String, put the message and header into a byte[]. — univerio
– univerio, Commented Mar 27, 2014 at 18:27

jfs · Accepted Answer · 2014-03-30 05:15:42Z

1

The issue is on Java side -- b'\x03\xf9' is not valid utf-8 byte sequence:

>>> b'\x03\xf9'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 1: invalid start byte

It seems new String(bytes, Charset.forName("UTF-8")); uses 'replace' error handler b'\xef' is the first of three bytes of '\ufffd' Unicode replacement character encoded in utf-8:

>>> b'\x03\xf9'.decode('utf-8', 'replace').encode('utf-8')
b'\x03\xef\xbf\xbd'

that is why you receive b'\x03\xef' instead of b'\x03\xf9' in Python.

To fix it, send bytes in Java instead of Unicode text.

Unrelated, sock.recv(n) may return less than n bytes. If the socket is blocking; you could create a file-like object using file = sock.makefile('rb') and call file.read(n) to read exactly n bytes.

answered Mar 30, 2014 at 5:15

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

user3117037 Over a year ago

Thank you for the detailed explanation. It seems that I mislead myself by assuming that there should be a 'magical' way to do it with character streams, given that they were introduced later in Java.

Collectives™ on Stack Overflow

How can I send a 4 byte header from Java and read it in Python?

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related