Compressing a string in java with limited characters allowed

Question

One of my friends got this interview question. In addition, he was told he could assume the characters were letters a to z (upper or lower case). I wrote the following, but I can't figure out how to use the assumption about the limited characters (a to z) the string contains. Am I using this assumption without realizing it or can I make use of it?

  public static String compress(String str){
    int count = 1;
    char c = str.charAt(0);
    StringBuffer result = new StringBuffer();

    for (int i = 1; i < str.length();i++){
      if (str.charAt(i) == c){
        count++;
      }
      else{
        String to_add = c + String.valueOf(count);
        result.append(to_add);
        count = 1;
        c = str.charAt(i);
      }
    }
    // last character
    String to_add = c + String.valueOf(count);
    result.append(to_add);

    String result_str = result.toString();

    // Check whether the compressed string is
    // actually smaller than the original one
    if (result_str.length() < str.length()){
      return result_str;
    }
    else{
      return str;
    }
  }

Knowing that there are limited characters a-z (26) makes it possible to encode 32 characters in 26 bytes without using more advanced forms of compression algorithms. — mac
– mac, Commented Dec 28, 2014 at 11:20
what should be output of mixed case - say AAAaaaBBBcc == 5A3B2C ? — user1428716
– user1428716, Commented Dec 28, 2014 at 11:33

Community · Accepted Answer · 2017-05-23 12:05:52Z

Assign each character to a number, eg a = 1, z = 26. So, to represent these 26 characters you need at least 5 bits.

You can now use 2 bytes (16 bits) to store a triplet of characters. This requires 1/3 less bytes than the initial one byte per character (if ascii). To store a triplet of characters read bits from your bytes (for example left to right).

First five bits of the first byte will represent the first character
The next three bits of the first byte, concatenated with the first two bits of the second byte represent the second
the next five bits from second byte represent the third character
there is one bit left (ignore it)

*To slightly improve in compression size, if your String's length % 3 = 1, then for the last character of your String you can use one byte only as you don't have another triplet.

**You can get if a specific bit is set on a byte using the algorithm from this post, which is:

public byte getBit(byte b, int position)
{
   return (b >> position) & 1;
}

***You can set a bit to a byte using the algorithms from this post, which are:

to set a bit (set it to one)

b = b | (1 << position);

To unset a bit (set it to zero):

b = b & ~(1 << position);

****Using maths (least common multiple of 5 and 8), you could even slightly improve in compression size if you used 5 bytes = 40bits, which can represent 8 characters (8x5=40).

Then you would store octets of characters and there are no bits to ignore now. For the last characters of your String, depending on (string size % 8), you could again use less bytes.

*****Using the last 5-byte approach you get 3/8 less size, which is better than 1/3 of the 3-byte approach.

milan · Accepted Answer · 2014-12-29 13:33:19Z

0

'a' to 'Z' is 2*26=52 distinct characters, and it fits in 6-bits (2^6=64). You could just pack the code-points into sextets.

OTOH, RLE (what you have coded) works only for repetitions. If you have input like abcde it would turn into 1a1b1c1d1e or something alike, which is highly inefficient and you can hardly call it compression.

edited Dec 29, 2014 at 13:33

answered Dec 28, 2014 at 11:24

milan

2,5452 gold badges27 silver badges42 bronze badges

Collectives™ on Stack Overflow

Compressing a string in java with limited characters allowed

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related