Character Encoding Made Simple

Computers are smart enough to read bytes, but human beings aren’t. What we are able to read are characters, so we use encoding standards like ASCII and Unicode to map characters to bytes.

An ASCII character can fit to a byte (8 bits), but most Unicode characters cannot. So encoding forms/schemes like UTF-8 and UTF-16 are used.

ASCII

The original ASCII table is encoded on 7 bits therefore it has 128 characters.

Nowadays most readers/editors use an “extended” ASCII table (from ISO 8859-1), which is encoded on 8 bits and enjoys 256 characters (including Á, Ä, Œ, é, è and other characters useful for european languages as well as mathematical glyphs and other symbols).

Unicode

It assigns every character in all languages a unique number called a code point.

USC-2

Originally, Unicode was intended to have a fixed-width 16-bit encoding: UCS-2 (Universal Character Set in 2 Bytes). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings. Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented, Unicode Transformation Format (UTF) was introduced.

UTF-32

Unicode Transformation Format in 32 bits. It is a protocol to encode Unicode code points that uses exactly 32 bits per Unicode code point. Each 32-bit value in UTF-32 represents one Unicode code point and is exactly equal to that code point’s numerical value.

UTF-16

The encoding is variable-length, as code points are encoded with one or two 16-bit code units. UTF-16 makes the encoding more space-efficient, but 16 bits can’t represent all characters in Unicode therefore they invented “surrogate pair“. Surrogate pair uses two code units to represent a character and can get detected by looking at the first 6 bits.

UTF-8

Similar to UTF-16, the encoding is variable-length, as code points are encoded with one, two, three or four bytes. UTF-8 is the dominant character encoding for the World Wide Web, accounting for 88.7% of all Web pages in March 2017.


Side Note:

References: