TIL: Unicode and UTF8

2023-10-06

encoding

I was curious today about how utf8 encoding actually works. It’s really interesting actually and not terrible complicated if you break it down. utf8 is backward compatible with the ASCII encoding. utf8 uses a variable number of bytes to represent different characters. I’ll keep this brief and give a few examples:

Dec      Hex    Chr
63       3F     ?   <- ASCII goes from 0 to 127 (dec)
128161   1F4A1  💡  <- Unicode extends beyond 127, also expressed as a "codepoint": `U+1F4A1` which is just hex

Some symbols require two characters using something called “variation sequences”: http://unicode.org/faq/vs.html

Decimal: 9724   65039
Hex:     25FC   FE0F
Chr:     ◼️            <- black medium square

You can know how many bytes are used in a character by looking at the binary. Let’s take 💡 as an example:

Decimal: 128161
Hex:     1F4A1
Bytes:   F0          9F         92         A1
Binary:  11110000    10011111   10010010   10100001
         |           |
         |           |> leading 10 is used for the children bytes in a sequence
         |
         |> the leading 11110 marks it as a 4 byte sequence