0

I am trying to write a simple program for this interview question:

Write a function that checks for valid unicode byte sequence. A unicode sequence is encoded as: - first byte indicates number of subsequent bytes '11110000' means 4 subsequent data bytes - data bytes start with a '10xxxxxx'

   public static void main(String[] args)
{

        System.out.println(checkUnicode(new byte[] {(byte)'c'}));

}

    /**
     * Write a function that checks for valid unicode byte sequence. A unicode
     * sequence is encoded as: - first byte indicates number of subsequent bytes
     * '1111000' means 4 subsequent data bytes - data bytes start with a
     * '10xxxxxx'
     * 
     * @param unicodeChar
     * @return
     */
 public static boolean checkUnicode(byte[] unicodeChar)
{
    byte b = unicodeChar[0];
    int len = 0;

    int temp = (int)b<<1;
    while((int)temp<<1 == 0)
    {
        len++;
    }
    System.out.println(len);

    if (unicodeChar.length == len) 
    {
        for(int i = 1 ; i < len; i++)
        {
            // Check if Most significant 2 bits in the byte are '10'
            // c0, in base 16, is 11000000 in binary
            // 10000000, in base 2, is 128 in decimal
            if( ( (int)unicodeChar[i]&0Xc0 )==128 )
            {
                continue;
            }
            else
            {
                return false;
            }
        }
        return true;
    }
    else
    {
        return false;
    }
}

The output I get is   
99
false  

Changed the conversion from char to byte array based on Chris Jester-Young's comment.

Can someone point me to right direction

Thanks

Made some modifications based on input from Ted Hopp.
P.S:
I got the question from some forum and I think it wasn't posted in correctly there, however I still decided to solve it and use it as is to prevent obfuscating it more, since I did not understand it completely either !

6
  • @Dani: It refers to the number of leading 1s. The OP is basically asked to validate the well-formedness of a UTF-8 string. Commented Jun 5, 2011 at 3:27
  • Why shove a unary number into a binary format? Commented Jun 5, 2011 at 3:33
  • @Dani: What? (You should read up on how UTF-8 works, then you will understand the question better.) Commented Jun 5, 2011 at 3:34
  • Could you provide some UTF-8 strings in binary string format so we can test our solutions? Commented Jun 5, 2011 at 3:54
  • If this is a real interview question ... rather than just practice ... you shouldn't be asking for help! Commented Jun 5, 2011 at 4:34

4 Answers 4

6

Here's an enterprise level solution for your enterprise level job:

public static void main(String[] args) {
    if (args.length == 0 || args[0] == null || (args[0] = args[0].trim()).isEmpty()) {
        System.out.println("No argument passed or argument empty!");
        return;
    }

    String arg = args[0];
    System.out.println("arg: " + arg + ", arg len: " + arg.length());

    BitSet bs = new BitSet(arg.length());
    for (int i = 0; i < arg.length(); i++) {
        if (arg.charAt(i) == '1') {
            bs.set(i, true); 
        }
    }
    ByteBuffer bb = ByteBuffer.wrap(bs.toByteArray());
    Charset cs = Charset.forName("UTF-8");
    CharsetDecoder csd =
            cs.newDecoder().onMalformedInput(CodingErrorAction.REPORT).
            onUnmappableCharacter(CodingErrorAction.REPORT)
            ;

    try {
        CharBuffer cb = csd.decode(bb);
        String uns = cb.toString();
        System.out.println("Got unicode string of len " + uns.length() + ": " + uns + " from " + arg + " -- no errors!");
    } catch (CharacterCodingException cce) {
        System.out.println("Invalid UTF-8 unicode string! " + cce.getMessage());
    }
}

Verification:

public static void test() {
    StringBuilder sb = new StringBuilder();
     byte[] byt = new String("stupid interview").getBytes();
     BitSet byt1 = fromByteArray(byt);
     for (int i = 0; i < byt1.size(); i++) {
         sb.append(byt1.get(i) ? "1" : "0");
     }
     String[] st = new String[1];
     st[0] = sb.toString();
     main(st);
}

public static BitSet fromByteArray(byte[] bytes) {
    BitSet bits = new BitSet();
    for (int i=0; i<bytes.length*8; i++) {
        if ((bytes[bytes.length-i/8-1]&(1<<(i%8))) > 0) {
            bits.set(i);
        }
    }
    return bits;
}

Output:

11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110
arg: 11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110, arg len: 128
{0, 1, 4, 5, 6, 10, 12, 13, 14, 16, 18, 20, 21, 22, 28, 29, 30, 32, 35, 37, 38, 42, 45, 46, 53, 56, 59, 61, 62, 65, 66, 67, 69, 70, 74, 76, 77, 78, 80, 82, 85, 86, 89, 92, 93, 94, 97, 98, 100, 101, 102, 104, 107, 109, 110, 112, 114, 117, 118, 120, 121, 122, 124, 125, 126}
Got unicode string of len 16: stupid interview from 11001110001011101010111000001110100101100010011000000100100101100111011000101110101001100100111001101110100101101010011011101110 -- no errors!
Sign up to request clarification or add additional context in comments.

3 Comments

For some reason I am unable to see BitSet.valueOf in the source code for java.util.BitSet so unable to verify this. Will accept the solution once I get it to work . The question was asked by google as per the forum. I myself found it too contrived, since I will have to look up the APIs and conversions few times online before I can get it to work , so I wonder how I can answer it in the interview !
Oh, valueOf is in 1.7. Oops. They seem to be here: java2s.com/Open-Source/Java-Document/6.0-JDK-Core/…. Guess I took these methods for granted since I'm using 1.7. Just stumbled upon them :)
Anyway, if you want to avoid it, you'll need to create a ByteBuffer large enough for conversion (ceil(input length / 8)), and then you'll need to do the bitwise operations for creation of each of the bytes as you build the ByteBuffer.. divide (for byte no.) and modulo (for pos inside byte)
1

First, the documentation of UTF-8 provided in the question is wrong. There is no such thing as "a valid Unicode byte sequence" without specifying the encoding. A safe assumption is that they meant UTF-8. Second (and more important) 11110000 does not indicate 4 more bytes of data. The four "1" bits before the first "0" bit indicate a total of 4 bytes (that is, 3 subsequent bytes, not 4, each starting with "10"). The rules are described well in the Wikipedia article on UTF-8.

Second, converting a character to a string and calling getBytes is a good approach, but you need to specify the encoding as an argument to getBytes. (However, for the character 'c', this isn't going to make a difference.)

I don't know what you are trying to do in your code, but you need to count how many '1' bits there are before the first '0' bit. Your code doesn't do anything like that.

UPDATE: I wouldn't actually bother trying to analyze the bit structure. I'd just feed the bytes to a CharsetDecoder and see if it chokes:

public static boolean checkUnicode(byte[] unicodeChar)
{
    try {
        CharsetDecoder decoder = Charset.forName(UTF-8).newDecoder();
        // test only for malformed input, ignore unknown Unicode characters
        decoder.onUnmappableCharacter(CodingErrorAction.IGNORE);
        decoder.onMalformedInput(CodingErrorAction.REPORT);
        decoder.decode(ByteBuffer.wrap(unicodeChar));
        return true;
    }
    catch (MalformedInputException ex)
    {
        return false;
    }
}

3 Comments

As i understand, this is interview questions, and the assumption is hypothetical. Not necessarily reflects the reality.
@Op - Oh, I imagine that this interview question fully reflects reality. I'd expect that a job there will frequently involve working from ambiguous, sloppily worded specifications.
@Stephen - Yes, in an interview it is better to be more tactful than some of us are here on the forum. Perhaps part of the interview goals was to test whether the applicant knew enough to ask for clarification. :)
0

Re how to convert your characters to bytes, you can just cast directly:

byte[] b = new byte[] {(byte) 0xe2, (byte) 0x82, (byte) 0xac};

Or, as a shorthand:

byte[] b = {(byte) 0xe2, (byte) 0x82, (byte) 0xac};

7 Comments

@Ted: It doesn't? It seems to work well enough for me (I updated my answer to use '\u00ff'). Of course, it won't do what you expect from \u0100 onwards, but that's outside the scope of this exercise.
char is 2 bytes long. To get the bytes you have to: int i = <some unicode char>; byte a1 = (byte)((i & 0xff00) >> 8); byte a0 = (byte)i;
@Op De Cirkel: In the OP's question, the string being validated contain octets only (value < 256). The OP wanted a way to represent these octets in Java as bytes given that he has chars (all with value < 256) on hand. In this circumstance, the truncation involved in casting to byte is appropriate.
@Chris - the UTF-8 encoding of the Unicode character is the two-byte sequence 0xC3 0xBF (binary 11000011 10111111). The single byte 0xff is not a legal UTF-8 encoding of anything in Unicode. The exercise specified Unicode. Also, where does it say that all characters are < 256?
@Ted: Obviously 0xff is not a legal byte to have in UTF-8. All the bytes in UTF-8 (not the code points they otherwise reconstitute to) are < 256.
|
0

You can use Character.toCodePoint() to get an int, and then int to byte should be easy.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.