Unicode

tags: Systems

The agreement between the two computers about the correspondence between letters and numbers is what Unicode standardizes.

In terms of Unicode, h is:

An abstract character named LATIN SMALL LETTER H.

This character has the corresponding number 0x68

Which is a code point in notation U+0068.

The first Unicode version 1.0 was published in October 1991 and had 7,161 characters “associated”. The latest version provides codes for 149,186 characters “associated”.
Total characters: 2¹⁶ + 2²⁰ - 2¹¹ = 1,112,064(~1mn) assignable code points.
- 2¹⁶ : Plane 0 (BMP) (U+0000 - U+FFFF)
- 2²⁰ : Plane 1-16 (astral planes/supplementary planes) (U+10000 - U+10FFFF)
- 2¹¹ : Code points from U+D800 - U+DFFF which are used to encode surrogate pairs in UTF-16

Codespace and blocks

The Unicode codespace is divided into 17 planes, numbered 0 to 16.
The Unicode Standard defines a codespace (a set of numerical values ranging from 0 through 10FFFF) called code points and denoted as U+0000 through U+10FFFF (See blocks.txt )
Plane 0 is the Basic Multilingual Plane (BMP)
- Single code unit in UTF-16 encoding
- Encoded in 1, 2 or 3 bytes in UTF-8.
Planes 1 through 16 are supplementary planes
- Accessed as surrogate pairs in UTF-16
- Encoded in four bytes in UTF-8
Within each plane, characters are allocated within named blocks of related characters.
- Characters required for a given script may be spread out over several different blocks.

Code units

How unicode is interpreted by the computer: code units

Code unit is a bit sequence used to encode each character within a given encoding form.

The character encoding is what transforms abstract code points into physical bits: code units.
Popular encodings are UTF-8, UTF-16 and UTF-32 (UTF = Unicode Transformation Format)

Encodings

UTF-16

Windows tends to use UTF-16
It uses 16 bit code units (16 bits = 2 bytes).
variable-length encoding
Surrogate pairs are a UTF-16 thing.
- The main hazard of UTF-16 is that it leads to people believing they are handling unicode correctly, when often they don’t properly decode surrogate pairs, etc.
- Surrogate pair is a representation for a single abstract character that consists of a sequence of code units of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit.
Code points
- Code points from BMP are encoded using a single code unit of 16-bit
- Code points from astral planes are encoded using two code units of 16-bit each(U+D800 - U+DFFF). (surrogate pairs) This allows encoding of over one million additional characters.
  - High surrogate: Unicode code point from range U+D800 to U+DBFF gets combined with another unicode code point
  - Low surrogate: Unicode code point from range U+DC00 to U+DFFF to generate a whole new character

UTF-32

It uses 32 bit code units (32 bits = 4 bytes).
An encoding where “all characters [are] matched with just one code point”, and it’s UTF-32.
Pros
- character is the same length. (fixed length encoding). So you know right away where you can split it without cutting any letters in half.
Cons
- Takes too much space. i.e downloading that text takes four times as long compared to UTF-8.

UTF-8

Mac OS X and Linux use UTF-8.
variable-length encoding
It uses 8 bit code units (8 bits = 1 byte).
It uses sequences of up to four code units to encode one code point.
Pros
- UTF-8 has the advantage that it uses the least amount of space if your characters are mostly in the basic Latin alphabet and punctuation. Webpages use UTF-8 because of this.
Cons
- If you want to split the text into pieces, you have to be careful not to break up the text in the middle of a character.
- It’s also not possible to find the 100th character without going through the first 99 (since they could be different lengths).

Language usage

Javascript

Strings are represented fundamentally as sequences of UTF-16 code units.
every code unit is exact 16 bits long. (maximum of 2¹⁶, or 65,536 possible characters)
This character set is called the basic multilingual plane (BMP)
To avoid ambiguity, the two parts of the pair must be between 0xD800 and 0xDFFF, and these code units are not used to encode single-code-unit characters.
- split by UTF-16 code units and will separate surrogate pairs.
Surrogate pairs, combining marks and grapheme are tough to handle in JavaScript.
Where ECMAScript operations interpret String values, each element is interpreted as a single UTF-16 code unit. i.e The length of a String is the number of elements. always think of string in JavaScript as a sequence of code units.
Most of the JavaScript string methods are not Unicode-aware.
What every JavaScript developer should know about Unicode

Every element of a string is interpreted by the engine as a code unit.

The way a string is rendered does not provide a deterministic way to decide what code units (that represent code points) it contains.

console.log('cafe\u0301'); // => 'café'
console.log('café');       // => 'café'
// 'cafe\u0301' and 'café' literals have slightly different code units,
// but both are rendered the same sequence of symbols café.

Escape sequences

Escape sequences in strings are used to express code units based on code point numbers.

Hexadecimal escape sequence
Unicode escape sequence
Code point escape sequence(new)

mogoz