The following table shows how different ranges of Unicode characters are encoded to bytes using seven different types of encoding. Byte Order Marks (BOMs) were not emitted for simplicity. Some notes and observations:
- ASCII encoding can only handle characters in the range U+0000 to U+007F. It simply replaces out-of-range bytes with a question mark (U+003F) and information is lost.
- UTF-7 produces incomprehensible bytes for characters outside the ASCII range, but weirdly enough, the bytes can be decoded back to the original characters without any information loss.
- The last column of UTF-16 encodings show the surrogate pairs used to encode characters outside the Basic Multilingual Plane (BMP).
- The most perfect of the encodings is arguably UTF-32 BE, because every posible characters is clearly represented as a single 32‑bit integer with the bytes in big-endian order (making them easier to read). UTF-32 encoding is rarely used in practice because it takes the most space, but it could be useful for exchanging characters between different exotic computer systems or alien lifeforms.
ABC | èéê | אבג | ィゥェ | 😀😁😂 | |
---|---|---|---|---|---|
ASCII | 41 42 43 |
3f3f3f (???) |
3f3f3f (???) |
3f3f3f (???) |
3f3f3f3f3f3f (??????) |
UTF-7 | 41 42 43 |
2b41 4f67 4136 5144 712d |
2b42 6441 4630 5158 532d |
2b2f 326a 2f61 6639 712d |
2b32 4433 6541 4e67 3933 6748 5950 6434 432d |
UTF-8 | 41 42 43 |
c3a8 c3a9 c3aa |
d790 d791 d792 |
efbda8 efbda9 efbdaa |
f09f9880 f09f9881 f09f9882 |
UTF-16 LE | 4100 4200 4300 |
e800 e900 ea00 |
d005 d105 d205 |
68ff 69ff 6aff |
3dd8:00de 3dd8:01de 3dd8:02de |
UTF-16 BE | 0041 0042 0043 |
00e8 00e9 00ea |
05d0 05d1 05d2 |
ff68 ff69 ff6a |
d83d:de00 d83d:de01 d83d:de02 |
UTF-32 LE | 41000000 42000000 43000000 |
e8000000 e9000000 ea000000 |
d0050000 d1050000 d2050000 |
68ff0000 69ff0000 6aff0000 |
00f60100 01f60100 02f60100 |
UTF-32 BE | 00000041 00000042 00000043 |
000000e8 000000e9 000000ea |
000005d0 000005d1 000005d2 |
0000ff68 0000ff69 0000ff6a |
0001f600 0001f601 0001f602 |
Characters with code points U+0000 to U+001F are known as the C0 control code set. Those points have no character glyphs associated with them, but if you want represent them in documentation then you can use their corresponding Unicode glyphs shown in the table below. The glyph column will only display correctly if the font your browser is using includes those characters.
Char Hex | Char Dec | Name | Label | Glyph Point | Glyph |
---|---|---|---|---|---|
0 | 0 | NUL | NULL | U+2400 | ␀ |
1 | 1 | SOH | START OF HEADING | U+2401 | ␁ |
2 | 2 | STX | START OF TEXT | U+2402 | ␂ |
3 | 3 | ETX | END OF TEXT | U+2403 | ␃ |
4 | 4 | EOT | END OF TRANSMISSION | U+2404 | ␄ |
5 | 5 | ENQ | ENQUIRY | U+2405 | ␅ |
6 | 6 | ACK | ACKNOWLEDGE | U+2406 | ␆ |
7 | 7 | BEL | BELL | U+2407 | ␇ |
8 | 8 | BS | BACKSPACE | U+2408 | ␈ |
9 | 9 | HT | HORIZONTAL TABULATION | U+2409 | ␉ |
0A | 10 | LF | LINE FEED | U+240A | ␊ |
0B | 11 | VT | VERTICAL TABULATION | U+240B | ␋ |
0C | 12 | FF | FORM FEED | U+240C | ␌ |
0D | 13 | CR | CARRIAGE RETURN | U+240D | ␍ |
0E | 14 | SO | SHIFT OUT | U+240E | ␎ |
0F | 15 | SI | SHIFT IN | U+240F | ␏ |
10 | 16 | DLE | DATA LINK ESCAPE | U+2410 | ␐ |
11 | 17 | DC1 | DEVICE CONTROL ONE | U+2411 | ␑ |
12 | 18 | DC2 | DEVICE CONTROL TWO | U+2412 | ␒ |
13 | 19 | DC3 | DEVICE CONTROL THREE | U+2413 | ␓ |
14 | 10 | DC4 | DEVICE CONTROL FOUR | U+2414 | ␔ |
15 | 21 | NAK | NEGATIVE ACKNOWLEDGE | U+2415 | ␕ |
16 | 22 | SYN | SYNCHRONOUS IDLE | U+2416 | ␖ |
17 | 23 | ETB | END OF TRANSMISSION BLOCK | U+2417 | ␗ |
18 | 24 | CAN | CANCEL | U+2418 | ␘ |
19 | 25 | EM | END OF MEDIUM | U+2419 | ␙ |
1A | 26 | SUB | SUBSTITUTE | U+241A | ␚ |
1B | 27 | ESC | ESCAPE | U+241B | ␛ |
1C | 28 | FS | FILE SEPARATOR | U+241C | ␜ |
1D | 29 | GS | GROUP SEPARATOR | U+241D | ␝ |
1E | 30 | RS | RECORD SEPARATOR | U+241E | ␞ |
1F | 31 | US | UNIT SEPARATOR | U+241F | ␟ |
Code points U+0080-U+009F are defined as C1 control codes in the ISO standards. They originally had no display glyph, but most of them were 'borrowed' for use in the Windows CP-1252 encoding. The following table shows a mapping from the CP-1252 encoding to the Unicode equivalents. Note that the characters in the CP-1252 column may appear as different characters if your browser is set to use a language other than English. They exactly match the Unicode equivalents in my copy of Chrome set to use the English (Australia) language.
Char | Hex | CP-1252 | Name | Unicode | Glyph |
---|---|---|---|---|---|
128 | 0x80 | | Euro sign | U+20AC | € |
129 | 0x81 | | - | - | - |
130 | 0x82 | | Single low quotation | U+201A | ‚ |
131 | 0x83 | | Florin sign | U+0192 | ƒ |
132 | 0x84 | | Double low quotation | U+201E | „ |
133 | 0x85 | Horizonal ellipsis | U+2026 | … | |
134 | 0x86 | | Dagger | U+2020 | † |
135 | 0x87 | | Double dagger | U+2021 | ‡ |
136 | 0x88 | | Modified letter circumflex accent | U+02C6 | ˆ |
137 | 0x89 | | Per mille sign | U+2030 | ‰ |
138 | 0x8A | | Latin captial letter S with caron | U+0160 | Š |
139 | 0x8B | | Single left pointing quotation mark | U+2039 | ‹ |
140 | 0x8C | | Latin capital ligature OE | U+0152 | Œ |
141 | 0x8D | | - | - | - |
142 | 0x8E | | Latin capital letter Z with caron | U+017D | Ž |
143 | 0x8F | | - | - | - |
144 | 0x90 | | - | - | - |
145 | 0x91 | | Left single quotation mark | U+2018 | ‘ |
146 | 0x92 | | Right single quotation mark | U+2019 | ’ |
147 | 0x93 | | Left double quotation mark | U+201C | “ |
148 | 0x94 | | Right double quotation mark | U+201D | ” |
149 | 0x95 | | Bullet | U+2022 | • |
150 | 0x96 | | En dash | U+2013 | – |
151 | 0x97 | | Em dash | U+2014 | — |
152 | 0x98 | | Small tilde | U+02DC | ˜ |
153 | 0x99 | | Trade mark sign | U+2122 | ™ |
154 | 0x9A | | Latin small letter S with caron | U+0161 | š |
155 | 0x9B | | Single right pointing quotation mark | U+203A | › |
156 | 0x9C | | Latin small ligature OE | U+0153 | œ |
157 | 0x9D | | - | - | - |
158 | 0x9E | | Latin capital letter Z with caron | U+017E | ž |
159 | 0x9F | | Latin capital letter Y with diaeresis | U+0178 | Ÿ |