Testing - Text Encodings

Encodings

The following table shows how different ranges of Unicode characters are encoded to bytes using seven different types of encoding. Byte Order Marks (BOMs) were not emitted for simplicity. Some notes and observations:

ASCII encoding can only handle characters in the range U+0000 to U+007F. It simply replaces out-of-range bytes with a question mark (U+003F) and information is lost.
UTF-7 produces incomprehensible bytes for characters outside the ASCII range, but weirdly enough, the bytes can be decoded back to the original characters without any information loss.
The last column of UTF-16 encodings show the surrogate pairs used to encode characters outside the Basic Multilingual Plane (BMP).
The most perfect of the encodings is arguably UTF-32 BE, because every posible characters is clearly represented as a single 32‑bit integer with the bytes in big-endian order (making them easier to read). UTF-32 encoding is rarely used in practice because it takes the most space, but it could be useful for exchanging characters between different exotic computer systems or alien lifeforms.

	ABC	èéê	אבג	ｨｩｪ	😀😁😂
ASCII	41 42 43	3f3f3f (???)	3f3f3f (???)	3f3f3f (???)	3f3f3f3f3f3f (??????)
UTF-7	41 42 43	2b41 4f67 4136 5144 712d	2b42 6441 4630 5158 532d	2b2f 326a 2f61 6639 712d	2b32 4433 6541 4e67 3933 6748 5950 6434 432d
UTF-8	41 42 43	c3a8 c3a9 c3aa	d790 d791 d792	efbda8 efbda9 efbdaa	f09f9880 f09f9881 f09f9882
UTF-16 LE	4100 4200 4300	e800 e900 ea00	d005 d105 d205	68ff 69ff 6aff	3dd8:00de 3dd8:01de 3dd8:02de
UTF-16 BE	0041 0042 0043	00e8 00e9 00ea	05d0 05d1 05d2	ff68 ff69 ff6a	d83d:de00 d83d:de01 d83d:de02
UTF-32 LE	41000000 42000000 43000000	e8000000 e9000000 ea000000	d0050000 d1050000 d2050000	68ff0000 69ff0000 6aff0000	00f60100 01f60100 02f60100
UTF-32 BE	00000041 00000042 00000043	000000e8 000000e9 000000ea	000005d0 000005d1 000005d2	0000ff68 0000ff69 0000ff6a	0001f600 0001f601 0001f602

Control Characters

Characters with code points U+0000 to U+001F are known as the C0 control code set. Those points have no character glyphs associated with them, but if you want represent them in documentation then you can use their corresponding Unicode glyphs shown in the table below. The glyph column will only display correctly if the font your browser is using includes those characters.

Char Hex	Char Dec	Name	Label	Glyph Point	Glyph
0	0	NUL	NULL	U+2400	␀
1	1	SOH	START OF HEADING	U+2401	␁
2	2	STX	START OF TEXT	U+2402	␂
3	3	ETX	END OF TEXT	U+2403	␃
4	4	EOT	END OF TRANSMISSION	U+2404	␄
5	5	ENQ	ENQUIRY	U+2405	␅
6	6	ACK	ACKNOWLEDGE	U+2406	␆
7	7	BEL	BELL	U+2407	␇
8	8	BS	BACKSPACE	U+2408	␈
9	9	HT	HORIZONTAL TABULATION	U+2409	␉
0A	10	LF	LINE FEED	U+240A	␊
0B	11	VT	VERTICAL TABULATION	U+240B	␋
0C	12	FF	FORM FEED	U+240C	␌
0D	13	CR	CARRIAGE RETURN	U+240D	␍
0E	14	SO	SHIFT OUT	U+240E	␎
0F	15	SI	SHIFT IN	U+240F	␏
10	16	DLE	DATA LINK ESCAPE	U+2410	␐
11	17	DC1	DEVICE CONTROL ONE	U+2411	␑
12	18	DC2	DEVICE CONTROL TWO	U+2412	␒
13	19	DC3	DEVICE CONTROL THREE	U+2413	␓
14	10	DC4	DEVICE CONTROL FOUR	U+2414	␔
15	21	NAK	NEGATIVE ACKNOWLEDGE	U+2415	␕
16	22	SYN	SYNCHRONOUS IDLE	U+2416	␖
17	23	ETB	END OF TRANSMISSION BLOCK	U+2417	␗
18	24	CAN	CANCEL	U+2418	␘
19	25	EM	END OF MEDIUM	U+2419	␙
1A	26	SUB	SUBSTITUTE	U+241A	␚
1B	27	ESC	ESCAPE	U+241B	␛
1C	28	FS	FILE SEPARATOR	U+241C	␜
1D	29	GS	GROUP SEPARATOR	U+241D	␝
1E	30	RS	RECORD SEPARATOR	U+241E	␞
1F	31	US	UNIT SEPARATOR	U+241F	␟

Latin-1 Supplement 0x80-0x9F

Code points U+0080-U+009F are defined as C1 control codes in the ISO standards. They originally had no display glyph, but most of them were 'borrowed' for use in the Windows CP-1252 encoding. The following table shows a mapping from the CP-1252 encoding to the Unicode equivalents. Note that the characters in the CP-1252 column may appear as different characters if your browser is set to use a language other than English. They exactly match the Unicode equivalents in my copy of Chrome set to use the English (Australia) language.

Char	Hex	CP-1252	Name	Unicode	Glyph
128	0x80	€	Euro sign	U+20AC	€
129	0x81		-	-	-
130	0x82	‚	Single low quotation	U+201A	‚
131	0x83	ƒ	Florin sign	U+0192	ƒ
132	0x84	„	Double low quotation	U+201E	„
133	0x85	…	Horizonal ellipsis	U+2026	…
134	0x86	†	Dagger	U+2020	†
135	0x87	‡	Double dagger	U+2021	‡
136	0x88	ˆ	Modified letter circumflex accent	U+02C6	ˆ
137	0x89	‰	Per mille sign	U+2030	‰
138	0x8A	Š	Latin captial letter S with caron	U+0160	Š
139	0x8B	‹	Single left pointing quotation mark	U+2039	‹
140	0x8C	Œ	Latin capital ligature OE	U+0152	Œ
141	0x8D		-	-	-
142	0x8E	Ž	Latin capital letter Z with caron	U+017D	Ž
143	0x8F		-	-	-
144	0x90		-	-	-
145	0x91	‘	Left single quotation mark	U+2018	‘
146	0x92	’	Right single quotation mark	U+2019	’
147	0x93	“	Left double quotation mark	U+201C	“
148	0x94	”	Right double quotation mark	U+201D	”
149	0x95	•	Bullet	U+2022	•
150	0x96	–	En dash	U+2013	–
151	0x97	—	Em dash	U+2014	—
152	0x98	˜	Small tilde	U+02DC	˜
153	0x99	™	Trade mark sign	U+2122	™
154	0x9A	š	Latin small letter S with caron	U+0161	š
155	0x9B	›	Single right pointing quotation mark	U+203A	›
156	0x9C	œ	Latin small ligature OE	U+0153	œ
157	0x9D		-	-	-
158	0x9E	ž	Latin capital letter Z with caron	U+017E	ž
159	0x9F	Ÿ	Latin capital letter Y with diaeresis	U+0178	Ÿ

Nancy Street Network – Text Encodings	www.orthogonal.com.au
Home \| TOC \| Random Page \| Testing