2

I am trying to send a message via TCP sockets from a Java application and read it in Python 2.7 I want the first 4 bytes to specify the message length, so I could do:

header = socket.recv(4)
message_length = struct.unpack(">L",header)
message = socket.recv(message_length)

on the Python end.

Java side:

out = new PrintWriter(new BufferedWriter(new StreamWriter(socket.getOutputStream())),true);
byte[] bytes = ByteBuffer.allocate(4).putInt(message_length).array();
String header = new String(bytes, Charset.forName("UTF-8"));
String message_w_header = header.concat(message);
out.print(message_w_header);

This works for some message lengths (10, 102 characters) but for others it fails (for example 1017 characters). In the case of failing value if I output the values of each bytes I get:

Java:
Bytes 0 0 3 -7
Length 1017
Hex string 3f9

Python:
Bytes 0 0 3 -17
Length 1007
Hex string \x00\x00\x03\xef

I think this has something to do with signed bytes in Java and unsigned in Python but I can't figure out what should I do to make it work.

5
  • In the Java code, what is the type of out? Commented Mar 27, 2014 at 16:47
  • PrintWriter, edited code. Commented Mar 27, 2014 at 16:57
  • Are you sure python is using UTF-8 ? Commented Mar 27, 2014 at 17:29
  • Changing default encoding doesn't make a difference. Commented Mar 27, 2014 at 18:15
  • 1
    You can't decode arbitrary binary (raw 32-bit integer) as UTF-8. Instead of putting the message length into a String, put the message and header into a byte[]. Commented Mar 27, 2014 at 18:27

1 Answer 1

1

The issue is on Java side -- b'\x03\xf9' is not valid utf-8 byte sequence:

>>> b'\x03\xf9'.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 1: invalid start byte

It seems new String(bytes, Charset.forName("UTF-8")); uses 'replace' error handler b'\xef' is the first of three bytes of '\ufffd' Unicode replacement character encoded in utf-8:

>>> b'\x03\xf9'.decode('utf-8', 'replace').encode('utf-8')
b'\x03\xef\xbf\xbd'

that is why you receive b'\x03\xef' instead of b'\x03\xf9' in Python.

To fix it, send bytes in Java instead of Unicode text.

Unrelated, sock.recv(n) may return less than n bytes. If the socket is blocking; you could create a file-like object using file = sock.makefile('rb') and call file.read(n) to read exactly n bytes.

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the detailed explanation. It seems that I mislead myself by assuming that there should be a 'magical' way to do it with character streams, given that they were introduced later in Java.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.