Unicode
Dan Tobias (Talk | contribs) |
|||
Line 12: | Line 12: | ||
At its root, Unicode is simply an assignment of numeric values to characters, where a huge number of characters from various [[Written Languages|writing systems]] (modern or ancient) as well as special symbols of many types are each given a number. Usually in Unicode charts these are given as hexadecimal numbers rather than the decimal ones most humans tend to prefer (computer geeks are funny that way), but they can be expressed in any base you want; in [[HTML]] you can use ampersand encoding with either decimal or hexadecimal numbers to express a Unicode character (e.g., A or A). (However, aside from the simple numeric assignments, Unicode also contains various rules regarding such things as character composition with separate diacritical elements and left-to-right vs. right-to-left character positioning, so things can get a bit more complex than just converting a series of numbers into characters.) | At its root, Unicode is simply an assignment of numeric values to characters, where a huge number of characters from various [[Written Languages|writing systems]] (modern or ancient) as well as special symbols of many types are each given a number. Usually in Unicode charts these are given as hexadecimal numbers rather than the decimal ones most humans tend to prefer (computer geeks are funny that way), but they can be expressed in any base you want; in [[HTML]] you can use ampersand encoding with either decimal or hexadecimal numbers to express a Unicode character (e.g., A or A). (However, aside from the simple numeric assignments, Unicode also contains various rules regarding such things as character composition with separate diacritical elements and left-to-right vs. right-to-left character positioning, so things can get a bit more complex than just converting a series of numbers into characters.) | ||
− | + | The first 128 Unicode code points, 0-127, correspond to the same code points in ASCII (including both printable characters and the [[C0 controls]]). The next 128, 128-255, correspond to the same points in [[ISO 8859-1]] (including the [[C1 controls]]), which in turn contains the same characters at 0-127 as ASCII, so the entire first 256 characters in Unicode are equivalent to that standard. | |
− | + | == Encodings == | |
+ | Once numbers are assigned to characters, they can be encoded as sequences of bytes in various ways, as defined in the specifications of particular character encodings. | ||
+ | The most common Unicode encodings are [[UTF-8]], [[UTF-16]], and [[UTF-32]]. See [[Character Encodings]] for a longer list. | ||
+ | |||
+ | There is no encoding named simply "Unicode". If a format specification says that text is encoded in "Unicode", it probably means [[UTF-16]] or [[UCS-2]]. If the document is related to Microsoft Windows, it probably means UTF-16LE. | ||
+ | |||
+ | == Notes == | ||
And if you think Unicode is full of crap, you've got some support with [http://www.fileformat.info/info/unicode/char/1f4a9/index.htm this character] (U+1F4A9, "Pile of Poo"). | And if you think Unicode is full of crap, you've got some support with [http://www.fileformat.info/info/unicode/char/1f4a9/index.htm this character] (U+1F4A9, "Pile of Poo"). | ||
Revision as of 23:49, 11 September 2013
Unicode is a system for representing characters numerically, and is the basis for various character encodings including the popular UTF-8. It was devised beginning in 1987, with the first version of its standard published starting in 1991. Subsequent revisions have continually expanded its character repertoire.
Unicode was devised in reaction both to the unwieldy multiplicity of character sets that had arisen to include various subsets of the many characters left out of the English-centric ASCII set, and to the clumsiness of the proposed ISO 10646 character set which was the awkward product of international politics rather than a sensible technical system. It has been successful to the point where just about all technical standards dealing with characters now are defined with regard to Unicode code points, with even the older proprietary encodings cross-referenced to the Unicode characters they encode.
Early versions of Unicode attempted to be a 16-bit character encoding where characters in a potential repertoire of 65,536 code points could be represented as 16-bit (2-byte) unsigned integers, though the "big-endian vs. little-endian" problem caused there to be two possible byte streams corresponding to a particular document, which could be disambiguated using the Byte Order Mark character which came out as (hexadecimal) FFFE or FEFF depending on which byte order was used. However, from an early time, users of Unicode preferred encodings which did not take up two bytes for every character, even common ones from the ASCII repertoire, so variable-byte-length encodings came into common use. Ultimately, the Unicode standard expanded its repertoire using multiple "planes" so that even 16 bits weren't enough to encode all possible characters.
At its root, Unicode is simply an assignment of numeric values to characters, where a huge number of characters from various writing systems (modern or ancient) as well as special symbols of many types are each given a number. Usually in Unicode charts these are given as hexadecimal numbers rather than the decimal ones most humans tend to prefer (computer geeks are funny that way), but they can be expressed in any base you want; in HTML you can use ampersand encoding with either decimal or hexadecimal numbers to express a Unicode character (e.g., A or A). (However, aside from the simple numeric assignments, Unicode also contains various rules regarding such things as character composition with separate diacritical elements and left-to-right vs. right-to-left character positioning, so things can get a bit more complex than just converting a series of numbers into characters.)
The first 128 Unicode code points, 0-127, correspond to the same code points in ASCII (including both printable characters and the C0 controls). The next 128, 128-255, correspond to the same points in ISO 8859-1 (including the C1 controls), which in turn contains the same characters at 0-127 as ASCII, so the entire first 256 characters in Unicode are equivalent to that standard.
Encodings
Once numbers are assigned to characters, they can be encoded as sequences of bytes in various ways, as defined in the specifications of particular character encodings.
The most common Unicode encodings are UTF-8, UTF-16, and UTF-32. See Character Encodings for a longer list.
There is no encoding named simply "Unicode". If a format specification says that text is encoded in "Unicode", it probably means UTF-16 or UCS-2. If the document is related to Microsoft Windows, it probably means UTF-16LE.
Notes
And if you think Unicode is full of crap, you've got some support with this character (U+1F4A9, "Pile of Poo").
References
- Unicode official site -- has lots of standards documents and code charts
- Wikipedia entry on Unicode
- Wikipedia list of Unicode Characters
- Shapecatcher - Site for finding unicode characters by drawing them
- Tool for converting Unicode into other character formats such as UTF-8, HTML, etc.
- Unicode search at Fileformat.Info
- ConScript Unicode Registry: an unofficial registry for Private Use Area blocks used for constructed scripts (e.g., Klingon), not part of the official Unicode standard