From the course: Developing Unicode-Aware Applications in Go

Encoding text

- [Instructor] At the early days of computing, they faced an interesting challenge. How can you encode every written language in a computer? And after a while, the solution that now everybody using is called Unicode. And Unicode is designed to support use of text in all of the world's written systems. Basically Unicode is a big table, and this table assigns a number for every character. Some of them are not visible characters, but still. So let's scroll down a little bit more. Now we see punctuation marks and other things, and here we start seeing the English letter. So the English letter A is 65 or in hexadecimal it's 41. And in binary it's 0101. So now we have a big table, and for every language, for every character, we have a number. If you look at Go's source code, you can see that the Unicode table basically encodes this information. So now we have version 15, and if you go to the history of this file, you're going to see that it's upgrading Unicode from 13 to 15 and to 13 and to 12, et cetera, et cetera. So sometimes in the new releases of Go, you will get a more updated version of Unicode table. So in practice, if you have a string like, I love Go, then 49 is I, 20 is the space. Then these letters are for the heart, then 20 again for the second space, 47 is the capital G, and 6f is the lowercase O. And when we talk about these, we have several names for the characters themself. We people call them characters. In the Unicode specifications, this is known as code points. And in Go we call these runes. And what we have below the binary representation in bytes, this is not the encoding, and specifically this one is UTF-8 encoding. So one problem is solved. For every letter now we have a number. But if you go back to the list of characters, and we go down a bit, way below English, we're going to see characters that actually are more than a single byte. Every digit here in hexadecimal is actually what is known as a nibble. A nibble is half a byte. So every two letters, this is a single byte, and we see that this one is encoded as two bytes. And now we have another interesting issue. How do we encode these bytes over the wire? And there are two main methods to do it. One is known as Big Endian. And if you take the number 1023, in Big Endian, first we put the 03 and then the FF. But if you put it in Little Endian, first we put the FF and then the 03. So when you talk about Unicode, we usually talk about two things. We talk about giving every letter or character a number. And also how do we encode these numbers as sequence of bytes? Let's have a look at it in code. So first, the encoding. So if I have a text Go, I can go over the characters inside it, and I'm going to print the character and also the number for the character. So debug, start with our debugging. And let's have a look at debug console. And you see the G is 71, the capital G and O, lowercase O, is 111. So this is for the encoding. Every character has its own value. When we talk about Endianness, you can also see that, let me hide for a minute. This one, we can use the encoding binary to work with Endianness. So I can print out what is the maximum value of a single byte, and this is going to be 255. And then I'm going to take a number which is bigger than a single byte, and I'm going to print it out. And then I'm going to use binary to write the byte representation first in Big Endian and then in Little Endian, and print them out. And when you give the fmt.Printf percent x and the parameter that is being passed is bytes, a byte slice, then you're going to get a string representation where every two digits is a single byte inside of it. So if you're going to run this one, so start with our debugging. And let's have a look at debug console. We see the maximum bite is 255. 1023 is 3FF. And in Big Endian, it's 03FF, so two bytes. And in Little Endian, first comes the FF and then the 03. By the way, if you're curious about the term Big Endian and Little Endian, it comes from the book "Gulliver's Travels", and search here for Endian. And here's the description. So there are two religious sects that are fighting a war. One of them is cracking the soft boiled egg from the little end. These are known as Little Endians, and the others from a big end, these are known as Big Endians. And this was adopted by computer geeks, which have really weird sense of humor.

Contents