UCS-2

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
 
(7 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
{{FormatInfo
 
{{FormatInfo
 
|formattype=electronic
 
|formattype=electronic
|subcat=Character Encodings
+
|subcat=Character encoding
 +
|subcat2=Unicode
 +
|charset=ISO-10646-UCS-2
 +
|charsetaliases=csUnicode
 +
|mibenum=1000
 
}}
 
}}
'''UCS-2''' is the trivial 16-bit [[Unicode]] encoding. It is considered to be obsolete.
+
'''UCS-2''' is the trivial 16-bit [[Unicode]] encoding. It was defined in versions of Unicode prior to 2.0, and is now considered to be obsolete. It was standardized as part of ISO 10646.
  
It was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms ''Unicode'' and ''UCS-2''. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2.
+
UCS-2 was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms ''Unicode'' and ''UCS-2''. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2.
  
 +
There may be some disagreement about precisely what "UCS-2" means. Besides the original definition, it could mean the encoding with ''surrogate pairs'' allowed (see below), or it could be an old name for the encoding whose current version is [[UTF-16]].
 +
 +
The terms ''UCS-2'' and ''UTF-16'' are sometimes used interchangeably, even though they really shouldn't be. However, the data formats do end up being identical in many real-world situations.
 +
 +
== Original format ==
 
UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF).
 
UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF).
  
 
Since it is often necessary to encode code points into bytes, instead of 16-bit integers, there are two flavors of UCS-2 which do that: USC-2BE (big-[[Endianness|endian]]) and UCS-2LE (little-endian).
 
Since it is often necessary to encode code points into bytes, instead of 16-bit integers, there are two flavors of UCS-2 which do that: USC-2BE (big-[[Endianness|endian]]) and UCS-2LE (little-endian).
  
== See also ==
+
== Surrogate pairs ==
 +
''[Note: This is not the orthodox way of explaining surrogate pairs.]''
  
 +
Although UCS-2 does not support code points beyond U+FFFF, a hack called ''surrogate pairs'' was invented to allow such code points to safely pass through UCS-2 systems (e.g. file formats, databases, strings in some programming languages) in many cases.
 +
 +
When using this hack, a code point beyond U+FFFF is encoded as two code points in the reserved range U+D800 through U+DFFF. Each of these code points is then called a ''surrogate'', and together they form a ''surrogate pair''. Assuming that the reserved code points are not used in any other way, this format is identical to [[UTF-16]].
 +
 +
When converting from UCS-2 to another encoding, it is a good idea to be aware of this hack. In most cases, sequences that look like surrogate pairs should be interpreted as such. One should keep in mind that UCS-2 systems generally aren't aware of UTF-16's rules, so it might be a bad idea to blindly interpret UCS-2 data as if it were UTF-16. For example, a string containing only the single code unit 0xDFFF is valid UCS-2, but invalid UTF-16.
 +
 +
== See also ==
 
* [[UTF-16]]
 
* [[UTF-16]]
 
* [[Byte Order Mark]]
 
* [[Byte Order Mark]]
  
 
== External links ==
 
== External links ==
 
 
* [http://www.unicode.org/faq/basic_q.html#14 Unicode FAQ: What is the difference between UCS-2 and UTF-16?]
 
* [http://www.unicode.org/faq/basic_q.html#14 Unicode FAQ: What is the difference between UCS-2 and UTF-16?]

Latest revision as of 02:35, 21 May 2019

File Format
Name UCS-2
Ontology
IANA charset ISO-10646-UCS-2
IANA aliases csUnicode
IANA MIBenum 1000

UCS-2 is the trivial 16-bit Unicode encoding. It was defined in versions of Unicode prior to 2.0, and is now considered to be obsolete. It was standardized as part of ISO 10646.

UCS-2 was at one time the only popular Unicode encoding, so there was little need to distinguish between the terms Unicode and UCS-2. If an old format specification says that text is encoded in "Unicode", it probably means UCS-2.

There may be some disagreement about precisely what "UCS-2" means. Besides the original definition, it could mean the encoding with surrogate pairs allowed (see below), or it could be an old name for the encoding whose current version is UTF-16.

The terms UCS-2 and UTF-16 are sometimes used interchangeably, even though they really shouldn't be. However, the data formats do end up being identical in many real-world situations.

Contents

[edit] Original format

UCS-2 encodes a sequence of Unicode code points in a sequence of unsigned 16-bit integers, one code point per integer, in the obvious way (U+0000=0x0000, U+0001=0x0001, ..., U+FFFF=0xffff). It is only capable of encoding code points up to U+FFFF, and does not support the higher code points (U+10000 through U+10FFFF).

Since it is often necessary to encode code points into bytes, instead of 16-bit integers, there are two flavors of UCS-2 which do that: USC-2BE (big-endian) and UCS-2LE (little-endian).

[edit] Surrogate pairs

[Note: This is not the orthodox way of explaining surrogate pairs.]

Although UCS-2 does not support code points beyond U+FFFF, a hack called surrogate pairs was invented to allow such code points to safely pass through UCS-2 systems (e.g. file formats, databases, strings in some programming languages) in many cases.

When using this hack, a code point beyond U+FFFF is encoded as two code points in the reserved range U+D800 through U+DFFF. Each of these code points is then called a surrogate, and together they form a surrogate pair. Assuming that the reserved code points are not used in any other way, this format is identical to UTF-16.

When converting from UCS-2 to another encoding, it is a good idea to be aware of this hack. In most cases, sequences that look like surrogate pairs should be interpreted as such. One should keep in mind that UCS-2 systems generally aren't aware of UTF-16's rules, so it might be a bad idea to blindly interpret UCS-2 data as if it were UTF-16. For example, a string containing only the single code unit 0xDFFF is valid UCS-2, but invalid UTF-16.

[edit] See also

[edit] External links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox