Encodings

The following table shows how different ranges of Unicode characters are encoded to bytes using seven different types of encoding. Byte Order Marks (BOMs) were not emitted for simplicity. Some notes and observations:

  • ASCII encoding can only handle characters in the range U+0000 to U+007F. It simply replaces out-of-range bytes with a question mark (U+003F) and information is lost.
  • UTF-7 produces incomprehensible bytes for characters outside the ASCII range, but weirdly enough, the bytes can be decoded back to the original characters without any information loss.
  • The last column of UTF-16 encodings show the surrogate pairs used to encode characters outside the Basic Multilingual Plane (BMP).
  • The most perfect of the encodings is arguably UTF-32 BE, because every posible characters is clearly represented as a single 32‑bit integer with the bytes in big-endian order (making them easier to read). UTF-32 encoding is rarely used in practice because it takes the most space, but it could be useful for exchanging characters between different exotic computer systems or alien lifeforms.
ABC èéê אבג ィゥェ 😀😁😂
ASCII 41
42
43
3f3f3f
(???)
3f3f3f
(???)
3f3f3f
(???)
3f3f3f3f3f3f
(??????)
UTF-7 41
42
43
2b41 4f67
4136 5144
712d
2b42 6441
4630 5158
532d
2b2f 326a
2f61 6639
712d
2b32 4433 6541 4e67
3933 6748 5950 6434
432d
UTF-8 41
42
43
c3a8
c3a9
c3aa
d790
d791
d792
efbda8
efbda9
efbdaa
f09f9880
f09f9881
f09f9882
UTF-16 LE 4100
4200
4300
e800
e900
ea00
d005
d105
d205
68ff
69ff
6aff
3dd8:00de
3dd8:01de
3dd8:02de
UTF-16 BE 0041
0042
0043
00e8
00e9
00ea
05d0
05d1
05d2
ff68
ff69
ff6a
d83d:de00
d83d:de01
d83d:de02
UTF-32 LE 41000000
42000000
43000000
e8000000
e9000000
ea000000
d0050000
d1050000
d2050000
68ff0000
69ff0000
6aff0000
00f60100
01f60100
02f60100
UTF-32 BE 00000041
00000042
00000043
000000e8
000000e9
000000ea
000005d0
000005d1
000005d2
0000ff68
0000ff69
0000ff6a
0001f600
0001f601
0001f602
Control Characters

Characters with code points U+0000 to U+001F are known as the C0 control code set. Those points have no character glyphs associated with them, but if you want represent them in documentation then you can use their corresponding Unicode glyphs shown in the table below. The glyph column will only display correctly if the font your browser is using includes those characters.

Char
Hex
Char
Dec
NameLabelGlyph
Point
Glyph
00NULNULLU+2400
11SOHSTART OF HEADINGU+2401
22STXSTART OF TEXTU+2402
33ETXEND OF TEXTU+2403
44EOTEND OF TRANSMISSIONU+2404
55ENQENQUIRYU+2405
66ACKACKNOWLEDGEU+2406
77BELBELLU+2407
88BSBACKSPACEU+2408
99HTHORIZONTAL TABULATIONU+2409
0A10LFLINE FEEDU+240A
0B11VTVERTICAL TABULATIONU+240B
0C12FFFORM FEEDU+240C
0D13CRCARRIAGE RETURNU+240D
0E14SOSHIFT OUTU+240E
0F15SISHIFT INU+240F
1016DLEDATA LINK ESCAPEU+2410
1117DC1DEVICE CONTROL ONEU+2411
1218DC2DEVICE CONTROL TWOU+2412
1319DC3DEVICE CONTROL THREEU+2413
1410DC4DEVICE CONTROL FOURU+2414
1521NAKNEGATIVE ACKNOWLEDGEU+2415
1622SYNSYNCHRONOUS IDLEU+2416
1723ETBEND OF TRANSMISSION BLOCKU+2417
1824CANCANCELU+2418
1925EMEND OF MEDIUMU+2419
1A26SUBSUBSTITUTEU+241A
1B27ESCESCAPEU+241B
1C28FSFILE SEPARATORU+241C
1D29GSGROUP SEPARATORU+241D
1E30RSRECORD SEPARATORU+241E
1F31USUNIT SEPARATORU+241F
Latin-1 Supplement 0x80-0x9F

Code points U+0080-U+009F are defined as C1 control codes in the ISO standards. They originally had no display glyph, but most of them were 'borrowed' for use in the Windows CP-1252 encoding. The following table shows a mapping from the CP-1252 encoding to the Unicode equivalents. Note that the characters in the CP-1252 column may appear as different characters if your browser is set to use a language other than English. They exactly match the Unicode equivalents in my copy of Chrome set to use the English (Australia) language.

CharHexCP-1252NameUnicodeGlyph
1280x80Euro signU+20AC
1290x81---
1300x82Single low quotationU+201A
1310x83ƒFlorin signU+0192ƒ
1320x84Double low quotationU+201E
1330x85Horizonal ellipsisU+2026
1340x86DaggerU+2020
1350x87Double daggerU+2021
1360x88ˆModified letter circumflex accentU+02C6ˆ
1370x89Per mille signU+2030
1380x8AŠLatin captial letter S with caronU+0160Š
1390x8BSingle left pointing quotation markU+2039
1400x8CŒLatin capital ligature OEU+0152Œ
1410x8D---
1420x8EŽLatin capital letter Z with caronU+017DŽ
1430x8F---
1440x90---
1450x91Left single quotation markU+2018
1460x92Right single quotation markU+2019
1470x93Left double quotation markU+201C
1480x94Right double quotation markU+201D
1490x95BulletU+2022
1500x96En dashU+2013
1510x97Em dashU+2014
1520x98˜Small tildeU+02DC˜
1530x99Trade mark signU+2122
1540x9AšLatin small letter S with caronU+0161š
1550x9BSingle right pointing quotation markU+203A
1560x9CœLatin small ligature OEU+0153œ
1570x9D---
1580x9EžLatin capital letter Z with caronU+017Ež
1590x9FŸLatin capital letter Y with diaeresisU+0178Ÿ