1

In a related question about Unicode handling in .Net, John Skeet stated:

If you're happy ignoring surrogate pairs, UTF-16 has some nice properties, basically due to the size per code unit being constant. You know how much space to allocate for a given number of code units…

But how do you know what the codeunit size is, or even whether an encoding is of variable number of codeunits per codepoint?

At first I though that it could be easily determined by calling GetMaxCharCount(nBytes) and GetMaxByteCount(nChars) functions of a System.Text.Encoding instance in question. For example, having 8 input bytes, we will get no more than 8, 4, and 2 decoded characters for ASCII / UTF-8, UTF-16 / UCS-2 and UTF-32 / UCS-4, respectively; yet with 8 input characters we will get 8 bytes for ASCII and some number other than that above for other encodings, which represents their size constancy or variability. However, those functions return hardly useful results:

        MaxChars   MaxBytes
         8 bytes    8 chars
---------------------------
ASCII    8 chars    9 bytes   <--- Leftover chars in ASCII? O_o
UTF-8    9 chars   27 bytes
UTF-16   5 chars   18 bytes
UTF-32   6 chars   36 bytes   <--- More chars than UTF-16? O_o

This behavior is intentional, though, as their documentation clearly says:

Note that GetMaxCharCount considers the worst case for leftover bytes from a previous encoder operation. For most code pages, passing a value of 0 to this method retrieves values greater than or equal to 1. GetMaxCharCount(N) is not necessarily the same value as N * GetMaxCharCount(1).

Note that GetMaxByteCount considers potential leftover surrogates from a previous decoder operation. Because of the decoder, passing a value of 1 to the method retrieves 2 for a single-byte encoding, such as ASCII. You should use the IsSingleByte property if this information is necessary. GetMaxByteCount(N) is not necessarily the same value as N * GetMaxByteCount(1).

That is not so clear is how those (or other?) functions may be applied to the task of determining code unit size dynamically rather than from a hardcoded lookup table for a limited number of encodings? The only viable way I found is “if IsSingleByte then unit size is 1 byte and character size is constant”, but if it was for single-byte encodings only, then this would not be needed at all. So what is the general solution for arbitrary encodings?

1
  • For IsSingleByte=False encodings, the only way I can think of is to loop through every Unicode codepoint (or maybe just a sampling of the more commonly used codepoints) and encode those using Encoding.GetBytes() and then analyze the bytes. For most encodings, the encoded forms of the ASCII characters U-0000 - U-007F will tell you the codeunit size (EBCDIC encodings might be weird), and then you can see whether the non-ASCII codepoints encode to larger multiples of that size. Commented Jul 10, 2015 at 23:51

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.