Unicode Byte sequence/convert a char to bytes array

Question

I am trying to write a simple program for this interview question:

Write a function that checks for valid unicode byte sequence. A unicode sequence is encoded as: - first byte indicates number of subsequent bytes '11110000' means 4 subsequent data bytes - data bytes start with a '10xxxxxx'

   public static void main(String[] args)
{

        System.out.println(checkUnicode(new byte[] {(byte)'c'}));

}

    /**
     * Write a function that checks for valid unicode byte sequence. A unicode
     * sequence is encoded as: - first byte indicates number of subsequent bytes
     * '1111000' means 4 subsequent data bytes - data bytes start with a
     * '10xxxxxx'
     * 
     * @param unicodeChar
     * @return
     */
 public static boolean checkUnicode(byte[] unicodeChar)
{
    byte b = unicodeChar[0];
    int len = 0;

    int temp = (int)b<<1;
    while((int)temp<<1 == 0)
    {
        len++;
    }
    System.out.println(len);

    if (unicodeChar.length == len) 
    {
        for(int i = 1 ; i < len; i++)
        {
            // Check if Most significant 2 bits in the byte are '10'
            // c0, in base 16, is 11000000 in binary
            // 10000000, in base 2, is 128 in decimal
            if( ( (int)unicodeChar[i]&0Xc0 )==128 )
            {
                continue;
            }
            else
            {
                return false;
            }
        }
        return true;
    }
    else
    {
        return false;
    }
}

The output I get is   
99
false

Changed the conversion from char to byte array based on Chris Jester-Young's comment.

Can someone point me to right direction

Thanks

Made some modifications based on input from Ted Hopp.
P.S:
I got the question from some forum and I think it wasn't posted in correctly there, however I still decided to solve it and use it as is to prevent obfuscating it more, since I did not understand it completely either !

@Dani: It refers to the number of leading 1s. The OP is basically asked to validate the well-formedness of a UTF-8 string. — C. K. Young
– C. K. Young, Commented Jun 5, 2011 at 3:27
@Dani: What? (You should read up on how UTF-8 works, then you will understand the question better.) — C. K. Young
– C. K. Young, Commented Jun 5, 2011 at 3:34
Could you provide some UTF-8 strings in binary string format so we can test our solutions? — Chris Dennett
– Chris Dennett, Commented Jun 5, 2011 at 3:54
If this is a real interview question ... rather than just practice ... you shouldn't be asking for help! — Stephen C
– Stephen C, Commented Jun 5, 2011 at 4:34

Chris Dennett · Accepted Answer · 2011-06-05 06:13:51Z

6

Here's an enterprise level solution for your enterprise level job:

public static void main(String[] args) {
    if (args.length == 0 || args[0] == null || (args[0] = args[0].trim()).isEmpty()) {
        System.out.println("No argument passed or argument empty!");
        return;
    }

    String arg = args[0];
    System.out.println("arg: " + arg + ", arg len: " + arg.length());

    BitSet bs = new BitSet(arg.length());
    for (int i = 0; i < arg.length(); i++) {
        if (arg.charAt(i) == '1') {
            bs.set(i, true); 
        }
    }
    ByteBuffer bb = ByteBuffer.wrap(bs.toByteArray());
    Charset cs = Charset.forName("UTF-8");
    CharsetDecoder csd =
            cs.newDecoder().onMalformedInput(CodingErrorAction.REPORT).
            onUnmappableCharacter(CodingErrorAction.REPORT)
            ;

    try {
        CharBuffer cb = csd.decode(bb);
        String uns = cb.toString();
        System.out.println("Got unicode string of len " + uns.length() + ": " + uns + " from " + arg + " -- no errors!");
    } catch (CharacterCodingException cce) {
        System.out.println("Invalid UTF-8 unicode string! " + cce.getMessage());
    }
}

Verification:

public static void test() {
    StringBuilder sb = new StringBuilder();
     byte[] byt = new String("stupid interview").getBytes();
     BitSet byt1 = fromByteArray(byt);
     for (int i = 0; i < byt1.size(); i++) {
         sb.append(byt1.get(i) ? "1" : "0");
     }
     String[] st = new String[1];
     st[0] = sb.toString();
     main(st);
}

public static BitSet fromByteArray(byte[] bytes) {
    BitSet bits = new BitSet();
    for (int i=0; i<bytes.length*8; i++) {
        if ((bytes[bytes.length-i/8-1]&(1<<(i%8))) > 0) {
            bits.set(i);
        }
    }
    return bits;
}

Output:

11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110
arg: 11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110, arg len: 128
{0, 1, 4, 5, 6, 10, 12, 13, 14, 16, 18, 20, 21, 22, 28, 29, 30, 32, 35, 37, 38, 42, 45, 46, 53, 56, 59, 61, 62, 65, 66, 67, 69, 70, 74, 76, 77, 78, 80, 82, 85, 86, 89, 92, 93, 94, 97, 98, 100, 101, 102, 104, 107, 109, 110, 112, 114, 117, 118, 120, 121, 122, 124, 125, 126}
Got unicode string of len 16: stupid interview from 11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110 -- no errors!

edited Jun 5, 2011 at 6:13

answered Jun 5, 2011 at 4:11

Chris Dennett

22.8k8 gold badges61 silver badges85 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

codeObserver Over a year ago

For some reason I am unable to see BitSet.valueOf in the source code for java.util.BitSet so unable to verify this. Will accept the solution once I get it to work . The question was asked by google as per the forum. I myself found it too contrived, since I will have to look up the APIs and conversions few times online before I can get it to work , so I wonder how I can answer it in the interview !

Chris Dennett Over a year ago

Oh, valueOf is in 1.7. Oops. They seem to be here: java2s.com/Open-Source/Java-Document/6.0-JDK-Core/…. Guess I took these methods for granted since I'm using 1.7. Just stumbled upon them :)

Chris Dennett Over a year ago

Anyway, if you want to avoid it, you'll need to create a ByteBuffer large enough for conversion (ceil(input length / 8)), and then you'll need to do the bitwise operations for creation of each of the bytes as you build the ByteBuffer.. divide (for byte no.) and modulo (for pos inside byte)

Ted Hopp · Accepted Answer · 2011-06-05 04:14:41Z

1

First, the documentation of UTF-8 provided in the question is wrong. There is no such thing as "a valid Unicode byte sequence" without specifying the encoding. A safe assumption is that they meant UTF-8. Second (and more important) 11110000 does not indicate 4 more bytes of data. The four "1" bits before the first "0" bit indicate a total of 4 bytes (that is, 3 subsequent bytes, not 4, each starting with "10"). The rules are described well in the Wikipedia article on UTF-8.

Second, converting a character to a string and calling getBytes is a good approach, but you need to specify the encoding as an argument to getBytes. (However, for the character 'c', this isn't going to make a difference.)

I don't know what you are trying to do in your code, but you need to count how many '1' bits there are before the first '0' bit. Your code doesn't do anything like that.

UPDATE: I wouldn't actually bother trying to analyze the bit structure. I'd just feed the bytes to a CharsetDecoder and see if it chokes:

public static boolean checkUnicode(byte[] unicodeChar)
{
    try {
        CharsetDecoder decoder = Charset.forName(UTF-8).newDecoder();
        // test only for malformed input, ignore unknown Unicode characters
        decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.decode(ByteBuffer.wrap(unicodeChar));
        return true;
    }
    catch (MalformedInputException ex)
    {
        return false;
    }
}

edited Jun 5, 2011 at 4:14

answered Jun 5, 2011 at 3:40

Ted Hopp

235k48 gold badges412 silver badges533 bronze badges

3 Comments

Op De Cirkel Over a year ago

As i understand, this is interview questions, and the assumption is hypothetical. Not necessarily reflects the reality.

Ted Hopp Over a year ago

@Op - Oh, I imagine that this interview question fully reflects reality. I'd expect that a job there will frequently involve working from ambiguous, sloppily worded specifications.

Ted Hopp Over a year ago

@Stephen - Yes, in an interview it is better to be more tactful than some of us are here on the forum. Perhaps part of the interview goals was to test whether the applicant knew enough to ask for clarification. :)

C. K. Young · Accepted Answer · 2011-06-05 03:59:54Z

0

Re how to convert your characters to bytes, you can just cast directly:

byte[] b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xac};

Or, as a shorthand:

byte[] b = {(byte) 0xe2, (byte) 0x82, (byte) 0xac};

edited Jun 5, 2011 at 3:59

answered Jun 5, 2011 at 3:26

C. K. Young

224k47 gold badges394 silver badges446 bronze badges

7 Comments

C. K. Young Over a year ago

@Ted: It doesn't? It seems to work well enough for me (I updated my answer to use '\u00ff'). Of course, it won't do what you expect from \u0100 onwards, but that's outside the scope of this exercise.

Op De Cirkel Over a year ago

char is 2 bytes long. To get the bytes you have to: int i = <some unicode char>; byte a1 = (byte)((i & 0xff00) >> 8); byte a0 = (byte)i;

C. K. Young Over a year ago

@Op De Cirkel: In the OP's question, the string being validated contain octets only (value < 256). The OP wanted a way to represent these octets in Java as bytes given that he has chars (all with value < 256) on hand. In this circumstance, the truncation involved in casting to byte is appropriate.

Ted Hopp Over a year ago

@Chris - the UTF-8 encoding of the Unicode character is the two-byte sequence 0xC3 0xBF (binary 11000011 10111111). The single byte 0xff is not a legal UTF-8 encoding of anything in Unicode. The exercise specified Unicode. Also, where does it say that all characters are < 256?

C. K. Young Over a year ago

@Ted: Obviously 0xff is not a legal byte to have in UTF-8. All the bytes in UTF-8 (not the code points they otherwise reconstitute to) are < 256.

|

trutheality · Accepted Answer · 2011-06-05 04:04:38Z

0

You can use Character.toCodePoint() to get an int, and then int to byte should be easy.

answered Jun 5, 2011 at 4:04

trutheality

23.5k6 gold badges58 silver badges68 bronze badges

Collectives™ on Stack Overflow

Unicode Byte sequence/convert a char to bytes array

4 Answers 4

3 Comments

3 Comments

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

3 Comments

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related