Sat. Jan 21st, 2023

Features and uses of common character sets

ASCII

The original and most basic character listing is ASCII – the Americam Standard Code for Information Interchange.

It allows for 128 different characters, which are encoded using the values 0 to 127 (they therefore require 7 bits per character).

The characters from 0 to 31 are control characters – that means things like ‘new line’, ‘cursor left’, ‘cursor down’ and so on. The remainder of the characters are used to represent punctuation symbols, numbers, lower case and upper case letters of the Roman alphabet.

Advantages of ASCII are primarily its short length, at 7 bits per character. The main disadvantage of ASCII is that there are many languages worldwide that do not use Roman character sets, along with many emoji symbols. Clearly, 128 characters is not enough space with which to represent all of these characters.

Unicode

Unicode is designed to keep the advantages of ASCII but also solve its shortcomings. How is this possible?

Unicode defines various character lengths, such as:

  • UTF-8 (8 bits per character)
  • UTF-16 (16 bits per character)
  • UTF-24 (24 bits per character)
  • UTF-32 (32 bits per character)

Unicode is described as a superset of ASCII – what this means is that it is:

  1. Much larger that ASCII
  2. Contains the same characters as ASCII ensuring cross-compatibility

Given the general use of a byte as the data storage unit, ASCII is generally stored as a series of 8-bit values, despite only actually requiring 7 bits. This makes UTF-8 and ASCII functionally identical.

The major advantage of Unicode is the sheer size and flexibility of the encoding scheme: there are more than enough characters available to represent every character from every language in the world.

The disadvantage is that obviously as soon as you move beyond UTF-8, the storage requirements grow.

Interesting footnote: as long as you know data is being stored in UTF format, it isn’t necessary to know whether it is UTF-8, UTF-16, UTF-24 or UTF-32. All characters must start with a ‘0’. So, in UTF-8, all characters start with a ‘0’ followed by the ASCII representation: 8 bits.

UTF-16 starts with ’10’ – this indicates that the remaining 14 bits define a character.

UTF-24 starts with ‘110’ – indicating that the next 21 bits define a character.

Finally, UTF-32 starts with ‘1110’ – meaning that the next 28 bits are used to define a character.