0

One of my friends got this interview question. In addition, he was told he could assume the characters were letters a to z (upper or lower case). I wrote the following, but I can't figure out how to use the assumption about the limited characters (a to z) the string contains. Am I using this assumption without realizing it or can I make use of it?

  public static String compress(String str){
    int count = 1;
    char c = str.charAt(0);
    StringBuffer result = new StringBuffer();

    for (int i = 1; i < str.length();i++){
      if (str.charAt(i) == c){
        count++;
      }
      else{
        String to_add = c + String.valueOf(count);
        result.append(to_add);
        count = 1;
        c = str.charAt(i);
      }
    }
    // last character
    String to_add = c + String.valueOf(count);
    result.append(to_add);

    String result_str = result.toString();

    // Check whether the compressed string is
    // actually smaller than the original one
    if (result_str.length() < str.length()){
      return result_str;
    }
    else{
      return str;
    }
  }
3
  • Knowing that there are limited characters a-z (26) makes it possible to encode 32 characters in 26 bytes without using more advanced forms of compression algorithms. Commented Dec 28, 2014 at 11:20
  • what should be output of mixed case - say AAAaaaBBBcc == 5A3B2C ? Commented Dec 28, 2014 at 11:33
  • @user1428716 It should be A3a3B3c2 Commented Dec 28, 2014 at 13:34

2 Answers 2

0

Assign each character to a number, eg a = 1, z = 26. So, to represent these 26 characters you need at least 5 bits.

You can now use 2 bytes (16 bits) to store a triplet of characters. This requires 1/3 less bytes than the initial one byte per character (if ascii). To store a triplet of characters read bits from your bytes (for example left to right).

  1. First five bits of the first byte will represent the first character
  2. The next three bits of the first byte, concatenated with the first two bits of the second byte represent the second
  3. the next five bits from second byte represent the third character
  4. there is one bit left (ignore it)

*To slightly improve in compression size, if your String's length % 3 = 1, then for the last character of your String you can use one byte only as you don't have another triplet.

**You can get if a specific bit is set on a byte using the algorithm from this post, which is:

public byte getBit(byte b, int position)
{
   return (b >> position) & 1;
}

***You can set a bit to a byte using the algorithms from this post, which are:

to set a bit (set it to one)

b = b | (1 << position);

To unset a bit (set it to zero):

b = b & ~(1 << position);

****Using maths (least common multiple of 5 and 8), you could even slightly improve in compression size if you used 5 bytes = 40bits, which can represent 8 characters (8x5=40).

Then you would store octets of characters and there are no bits to ignore now. For the last characters of your String, depending on (string size % 8), you could again use less bytes.

*****Using the last 5-byte approach you get 3/8 less size, which is better than 1/3 of the 3-byte approach.

Sign up to request clarification or add additional context in comments.

Comments

0

'a' to 'Z' is 2*26=52 distinct characters, and it fits in 6-bits (2^6=64). You could just pack the code-points into sextets.

OTOH, RLE (what you have coded) works only for repetitions. If you have input like abcde it would turn into 1a1b1c1d1e or something alike, which is highly inefficient and you can hardly call it compression.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.