UTF-16

File Format
Name	UTF-16
Ontology	Electronic File Formats Character encoding Unicode UTF-16 ; ; ; ;
IANA charset	UTF-16
IANA aliases	csUTF16
IANA MIBenum	1015

UCS Transformation Format—16-bit (UTF-16) is a 16-bit Unicode character encoding. It encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers (which we'll call code units).

Code points up to U+FFFF use one code unit, while higher code points use two. UTF-16 supports the full range of the modern Unicode standard: code points U+0000 to U+D7FF, and U+E000 to U+10FFFF.

Format

For code points up to U+FFFF, the encoding is done in the trivial way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff).

For higher code points, first subtract 0x10000 to get a 20-bit number. Add the 10 most significant bits of that number to 0xd800, and use that as the first code unit. Add the 10 least significant bits to 0xdc00, and use that as the second code unit.

Descriptions of UTF-16 often use the term surrogate pair. This tends to create unnecessary confusion, perhaps because whether a surrogate pair is one code point or two depends on context in subtle ways. See UCS-2 for the historical reasons that this term exists.

Byte-oriented encodings

There are two byte-oriented flavors of UTF-16: UTF-16BE (big-endian) and UTF-16LE (little-endian). Each code unit uses 2 bytes, so each code point is encoded using either 2 or 4 bytes. IANA has assigned the charset identifiers UTF-16BE (MIBenum 1013) and UTF-16LE (MIBenum 1014) to these variants.

The code units are encoded according to the desired byte order, and the order of the code units is not changed. This means that for UTF-16LE, 4-byte code points use sort of a mixed-endian byte order.

Files encoded in UTF-16 often begin with a Byte Order Mark (BOM) to indicate their byte order, and to help distinguish them from other character encodings. But we need to be careful with terminology here, because the Unicode standard has decreed^[1] that the term "UTF-16BE" does not mean the same thing as "UTF-16 with big-endian byte order" (and similarly for UTF-16LE). A file encoded in "UTF-16" may use a BOM, but a file labeled as "UTF-16BE" or "UTF-16LE" should not use a BOM.

Relationship with UCS-2

UTF-16 is compatible with UCS-2, in the sense that a UTF-16 decoder can correctly decode UCS-2, provided that the UCS-2 data does not use any reserved Unicode code points. The reverse is not true: a UCS-2 decoder cannot necessarily decode UTF-16.

Specifications

RFC 2781 (2000-02)
Unicode 6.0, Chapter 3 (2011) — §3.9 D91; §3.10 D96–D98, Table 3-9
ISO/IEC 10646:2003 Annex Q (2003)

References

↑ Unicode 6.2, §2.6

External links

[0] Unicode 6.2, §2.6

[1]

UTF-16

Contents

Format

Byte-oriented encodings

Relationship with UCS-2

Specifications

See also

References

External links

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox