UTF-16
Dan Tobias (Talk | contribs) (Link up endianness) |
(Expanded this page) |
||
Line 3: | Line 3: | ||
|subcat=Character Encodings | |subcat=Character Encodings | ||
}} | }} | ||
− | ''' | + | '''UCS Transformation Format—16-bit''' ('''UTF-16''') is a 16-bit [[Unicode]] [[Character Encodings|character encoding]]. It encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers (which we'll call ''code units''). |
+ | |||
+ | Code points up to U+FFFF use one code unit, while higher code points use two. UTF-16 is not capable of encoding code points in the reserved range U+D800 to U+DFFF, but that range is not required for full Unicode support. | ||
+ | |||
+ | == Format == | ||
+ | |||
+ | For code points up to U+FFFF, the encoding is done in the trivial way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). | ||
+ | |||
+ | For higher code points, first subtract 0x10000 to get a 20-bit number. Add the 10 most significant bits of that number to 0xd800, and use that as the first code unit. Add the 10 least significant bits to 0xdc00, and use that as the second code unit. | ||
+ | |||
+ | == Byte-oriented encodings == | ||
+ | |||
+ | There are two byte-oriented flavors of UTF-16: UTF-16BE (big-[[Endianness|endian]]) and UTF-16LE (little-endian). Each code unit uses 2 bytes, so each code ''point'' is encoded using either 2 or 4 bytes. | ||
+ | |||
+ | The code units are encoded according to the desired byte order, and the order of the code units is not changed. (This means that for UTF-16LE, 4-byte code points use sort of a ''mixed-endian'' byte order.) | ||
+ | |||
+ | Files encoded in UTF-16BE or UTF-16LE often begin with a [[Byte Order Mark]] to distinguish them from each other, and to help distinguish them from other character encodings. | ||
+ | |||
+ | == Relationship with UCS-2 == | ||
+ | |||
+ | UTF-16 is compatible with [[UCS-2]], in the sense that a UTF-16 decoder can correctly decode UCS-2, provided that the UCS-2 data does not use any reserved Unicode code points. The reverse is not true: a UCS-2 decoder cannot safely decode UTF-16. | ||
== Specifications == | == Specifications == |
Revision as of 16:54, 17 February 2013
UCS Transformation Format—16-bit (UTF-16) is a 16-bit Unicode character encoding. It encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers (which we'll call code units).
Code points up to U+FFFF use one code unit, while higher code points use two. UTF-16 is not capable of encoding code points in the reserved range U+D800 to U+DFFF, but that range is not required for full Unicode support.
Contents |
Format
For code points up to U+FFFF, the encoding is done in the trivial way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff).
For higher code points, first subtract 0x10000 to get a 20-bit number. Add the 10 most significant bits of that number to 0xd800, and use that as the first code unit. Add the 10 least significant bits to 0xdc00, and use that as the second code unit.
Byte-oriented encodings
There are two byte-oriented flavors of UTF-16: UTF-16BE (big-endian) and UTF-16LE (little-endian). Each code unit uses 2 bytes, so each code point is encoded using either 2 or 4 bytes.
The code units are encoded according to the desired byte order, and the order of the code units is not changed. (This means that for UTF-16LE, 4-byte code points use sort of a mixed-endian byte order.)
Files encoded in UTF-16BE or UTF-16LE often begin with a Byte Order Mark to distinguish them from each other, and to help distinguish them from other character encodings.
Relationship with UCS-2
UTF-16 is compatible with UCS-2, in the sense that a UTF-16 decoder can correctly decode UCS-2, provided that the UCS-2 data does not use any reserved Unicode code points. The reverse is not true: a UCS-2 decoder cannot safely decode UTF-16.
Specifications
- RFC 2781 (2000-02)
- Unicode 6.0, Chapter 3 (2011) — §3.9 D91; §3.10 D96–D98, Table 3-9
- ISO/IEC 10646:2003 Annex Q (2003)