UCS-2
(Surrogate pairs) |
|||
Line 3: | Line 3: | ||
|subcat=Character Encodings | |subcat=Character Encodings | ||
}} | }} | ||
− | '''UCS-2''' is the trivial 16-bit [[Unicode]] encoding. It is considered to be obsolete. | + | '''UCS-2''' (as originally formulated) is the trivial 16-bit [[Unicode]] encoding. It is considered to be obsolete. |
− | The | + | The terms ''UCS-2'' and ''[[UTF-16]]'' are sometimes used interchangeably, because the data formats end up being identical in most real-world situations. |
UCS-2 was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms ''Unicode'' and ''UCS-2''. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2. | UCS-2 was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms ''Unicode'' and ''UCS-2''. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2. | ||
+ | == Original format == | ||
UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF). | UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF). | ||
Revision as of 21:49, 11 September 2013
UCS-2 (as originally formulated) is the trivial 16-bit Unicode encoding. It is considered to be obsolete.
The terms UCS-2 and UTF-16 are sometimes used interchangeably, because the data formats end up being identical in most real-world situations.
UCS-2 was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms Unicode and UCS-2. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2.
Contents |
Original format
UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF).
Since it is often necessary to encode code points into bytes, instead of 16-bit integers, there are two flavors of UCS-2 which do that: USC-2BE (big-endian) and UCS-2LE (little-endian).
Surrogate pairs
[Note: This is not the orthodox way of explaining surrogate pairs.]
Although UCS-2 does not support codepoints beyond U+FFFF, a hack called surrogate pairs was invented to allow such codepoints to safely pass through UCS-2 systems (e.g. file formats, databases, programming languages) in many cases.
When using this hack, a codepoint beyond U+FFFF is encoded as two codepoints in the reserved range U+D800 through U+DFFF. Each of these codepoints is then called a surrogate, and together they form a surrogate pair. Assuming that the reserved codepoints are not used in any other way, this format is identical to UTF-16.
When converting from UCS-2 to another encoding, it is a good idea to be aware of this hack. In most cases, sequences that look like surrogate pairs should be interpreted as such. One should keep in mind that UCS-2 systems generally aren't aware of UTF-16's rules, so it might be a bad idea to blindly interpret UCS-2 data as if it were UTF-16. For example, a string containing only the single code unit 0xDFFF is valid UCS-2, but invalid UTF-16.