Java: Converting UTF 8 to String

Question

When I run the following program:

public static void main(String args[]) throws Exception
{
    byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
    String s = new String(str, "UTF-8");
}

on Linux and inspect the value of s in jdb, I correctly get:

 s = "ì–´"

on Windows, I incorrectly get:

s = "?"

My byte sequence is a valid UTF-8 character in Korean, why would it be producing two very different results?

The windows command prompt cannot display UTF8 characters unless you change the codepage using chcp and you need to use a font that can display those characters. — user330315
– user330315, Commented Oct 2, 2012 at 21:21

Tomasz Nurkiewicz · Accepted Answer · 2012-10-02 21:22:16Z

3

It correctly prints "어" on my computer (Ubuntu Linux), as described in Code Table Korean Hangul. Windows command prompt is known to have issues with encoding, don't bother.

Your code is fine.

answered Oct 2, 2012 at 21:22

Tomasz Nurkiewicz

342k72 gold badges713 silver badges680 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

kujawk Over a year ago

My mistake. The Korean characters were properly displaying in my Emacs text buffer so I naturally assumed that they would display properly in the Emacs shell buffer. Which as folks pointed out, they do not.

Bozho · Accepted Answer · 2012-10-02 21:20:11Z

1

It gives 어 for me. This means your console is probably not configured to display UTF-8 and it is a printing/display problem, rather than a problem with representation.

answered Oct 2, 2012 at 21:20

Bozho

599k147 gold badges1.1k silver badges1.2k bronze badges

Comments

Sergey Kalinichenko · Accepted Answer · 2012-10-02 21:21:20Z

1

You get the correct string, it's Windows console that does not display the string correctly.

Here is a link to an article that discusses a way to make Java console produce correct Unicode output using JNI.

answered Oct 2, 2012 at 21:21

Sergey Kalinichenko

729k85 gold badges1.2k silver badges1.6k bronze badges

Comments

Dan Bliss · Accepted Answer · 2012-10-02 21:35:51Z

0

JDB is displaying the data incorrectly. The code works the same on both Windows and Linux. Try running this more definitive test:

public static void main(String[] args) throws Exception {
    byte str[] = {(byte)0xEC, (byte)0x96, (byte)0xB4};
    String s = new String(str, "UTF-8"); 
    for(int i=0; i<s.length(); i++) {
        System.out.println(BigInteger.valueOf((int)s.charAt(i)).toString(16));
    }
}

This prints out the hex value of every character in the string. This will correctly print out "c5b4" in both Windows and Linux.

answered Oct 2, 2012 at 21:35

Dan Bliss

1,75413 silver badges10 bronze badges

Collectives™ on Stack Overflow

Java: Converting UTF 8 to String

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related