Character encoding

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Specific character sets or encodings)
(36 intermediate revisions by 4 users not shown)
Line 1: Line 1:
 
{{FormatInfo
 
{{FormatInfo
 
|formattype=electronic
 
|formattype=electronic
|thiscat=Character Encodings
+
|thiscat=Character encoding
 
|image=Characters.png
 
|image=Characters.png
 
}}
 
}}
 +
'''Character Encodings''' are methods of representing characters of text, usually as numeric values which can be stored on computers as bits and bytes, but sometimes in other things (e.g., [[Braille]] represents them as patterns of raised dots). Sometimes they're also referred to as "character sets", but purists will make a distinction in that, strictly speaking, a character set is merely a repertoire of characters, the list of characters supported by some system, protocol, or file format, without it necessarily having any inherent order or numbering system. A character encoding assigns specific values (in some coding system) to each character. However, the distinction can get vague and fuzzy; there are multiple levels of abstraction ([[Unicode]] includes a set of defined characters as well as assigned numeric code points for each, but leaves it to other more specific encodings such as [[UTF-8]] to define the specific bits/bytes that represent them in a file), and some protocols even use parameter names such as 'charset' to indicate which character ''encoding'' is in use, so the terminology can slip and slide even in "tech" uses. This section documents all the various sorts of character sets/encodings of any sort.
  
See [[Fonts]] for their renditions as seen on screens and printouts.
+
See [[Fonts]] for the renditions of character encodings as seen on screens and printouts. The appearance of a character is known as a "glyph", and a font consists of a set of glyphs mapped onto the more abstractly-defined characters as included in the character set that is part of a character encoding.
 +
 
 +
== Specific character sets or encodings ==
  
 
* [[Adobe Standard Encoding]]
 
* [[Adobe Standard Encoding]]
 +
* [[Amstrad CP/M Plus character set]]
 
* [[ANSEL]]
 
* [[ANSEL]]
 
** [[MARC-8]]
 
** [[MARC-8]]
Line 16: Line 20:
 
* [[Baudot code]]
 
* [[Baudot code]]
 
* [[Braille]]
 
* [[Braille]]
 +
** [[BRF]]
 +
** [[Nemeth Code]]
 +
** [[Taylor Code]]
 
* [[Compucolor character set]]
 
* [[Compucolor character set]]
 +
* DEC (Digital Equipment Corporation)
 +
** [[DEC Special Graphics Character Set]]
 +
** [[PDP-1 alphanumeric codes]]
 
* [[EBCDIC]]
 
* [[EBCDIC]]
** [[CP037]]
+
** [[CP037]] · [[CP285]] · [[CP424]] · [[CP500]] · [[CP875]] · [[CP1026]] · [[CP1047]] · [[CP1140]] · [[CP1148]] · [[CP1155]] · [[CP4971]] · [[CP9067]] · [[CP12712]] · [[EBCDIC 6-Bit]]
** [[CP285]]
+
* [[Flag semaphore]]
** [[CP424]]
+
** [[CP500]]
+
** [[CP875]]
+
** [[CP1026]]
+
** [[CP1047]]
+
** [[CP1140]]
+
** [[CP1148]]
+
** [[CP1155]]
+
** [[CP4971]]
+
** [[CP9067]]
+
** [[CP12712]]
+
** [[EBCDIC 6-Bit]]
+
 
* [[GB 2312]]
 
* [[GB 2312]]
* [[IBM PC code pages]]
+
* IBM: See EBCDIC, MS-DOS Encodings, and APL Code Page elsewhere in this list
 
* [[ISO 646]]
 
* [[ISO 646]]
** [[ISO 646-CA]] (Canada / French)
+
** [[ISO 646-CA]] · [[ISO 646-CA-2]] · [[ISO 646-CH]] · [[ISO 646-CN]] · [[ISO 646-CU]] · [[ISO 646-DE]] · [[ISO 646-DK]] · [[ISO 646-FI]] · [[ISO 646-FR]] · [[ISO 646-GB]] · [[ISO 646-HU]] · [[ISO 646-IRV]] · [[ISO 646-IT]] · [[ISO 646-JP]] · [[ISO 646-JP OCR-B]] · [[ISO 646-KR]] · [[ISO 646-MT]] · [[ISO 646-NL]] · [[ISO 646-NO]] · [[ISO 646-NO-2]] · [[ISO 646-PT]] · [[ISO 646-SE]] · [[ISO 646-SE-2]] · [[ISO 646-US]] · [[ISO 646-YU]]
** [[ISO 646-CA-2]] (Canada / French)
+
** [[ISO 646-CH]] (Switzerland)
+
** [[ISO 646-CN]] (China / Basic Latin)
+
** [[ISO 646-CU]] (Cuba / Spanish)
+
** [[ISO 646-DE]] (Germany)
+
** [[ISO 646-DK]] (Denmark)
+
** [[ISO 646-FI]] (Finland)
+
** [[ISO 646-FR]] (France)
+
** [[ISO 646-GB]] (Great Britain)
+
** [[ISO 646-HU]] (Hungary)
+
** [[ISO 646-IRV]] (International Reference Version)
+
** [[ISO 646-IT]] (Italy)
+
** [[ISO 646-JP]] (Japan / Romaji)
+
** [[ISO 646-JP OCR-B]] (Japan / Romaji)
+
** [[ISO 646-KR]] (Korea / Latin)
+
** [[ISO 646-MT]] (Malta)
+
** [[ISO 646-NL]] (Netherlands)
+
** [[ISO 646-NO]] (Norway)
+
** [[ISO 646-NO-2]] (Norway)
+
** [[ISO 646-PT]] (Portugal)
+
** [[ISO 646-SE]] (Sweden)
+
** [[ISO 646-SE-2]] (Sweden)
+
** [[ISO 646-US]] (Same as [[ASCII]])
+
** [[ISO 646-YU]] (Yugoslavia)
+
 
* [[ISO 2022]]
 
* [[ISO 2022]]
 
* [[ISO 8859]]
 
* [[ISO 8859]]
** [[ISO 8859-1]] (Latin-1)
+
** [[ISO 8859-1]] · [[ISO 8859-2]] · [[ISO 8859-3]] · [[ISO 8859-4]] · [[ISO 8859-5]] · [[ISO 8859-6]] · [[ISO 8859-7]] · [[ISO 8859-8]] · [[ISO 8859-9]] · [[ISO 8859-10]] · [[ISO8859-11]] · [[ISO 8859-13]] · [[ISO 8859-14]] · [[ISO 8859-15]] · [[ISO 8859-16]]
** [[ISO 8859-2]] (Latin-2, Central/East European)
+
** [[ISO 8859-3]] (Latin-3, Esperanto, Galician, Maltese, and Turkish)
+
** [[ISO 8859-4]] (Latin-4, Scandinavian and Baltic)
+
** [[ISO 8859-5]] (Cyrillic)
+
** [[ISO 8859-6]] (Arabic)
+
** [[ISO 8859-7]] (Modern Greek)
+
** [[ISO 8859-8]] (Hebrew)
+
** [[ISO 8859-9]] (Latin-5, Turkish)
+
** [[ISO 8859-10]] (Latin-6, Lappish, Nordic, and Inuit)
+
** [[ISO 8859-11]] (Thai)
+
** [[ISO 8859-13]] (Latin-7, Baltic Rim)
+
** [[ISO 8859-14]] (Celtic)
+
** [[ISO 8859-15]] (Latin-9, Latin-1 with a Euro sign)
+
** [[ISO 8859-16]] (Romanian)
+
 
* [[JIS]]
 
* [[JIS]]
 
** [[JIS X 0201]]
 
** [[JIS X 0201]]
Line 86: Line 46:
 
** [[KOI8-U]] (Ukraine)
 
** [[KOI8-U]] (Ukraine)
 
* [[Macintosh encodings]]
 
* [[Macintosh encodings]]
** [[MacCE]]
+
** [[MacCE]] · [[MacCyrillic]] · [[MacDingbats]] · [[MacGreek]] · [[MacGujarati]] · [[MacGurmukhi]] · [[MacIcelandic]] · [[MacRoman]] · [[MacRomania]] · [[MacSymbol]] · [[MacThai]] · [[MacTurkish]]
** [[MacCyrillic]]
+
* [[Mattel Aquarius character set]]
** [[MacDingbat]]
+
** [[MacGreek]]
+
** [[MacGujarati]]
+
** [[MacGurmukhi]]
+
** [[MacIceland]]
+
** [[MacRoman]]
+
** [[MacRomania]]
+
** [[MacSymbol]]
+
** [[MacThai]]
+
** [[MacTurkish]]
+
** [[MacUkraine]]
+
 
* [[Morse code]]
 
* [[Morse code]]
* [[MS-DOS encodings]]
+
* [[MS-DOS encodings]] (IBM PC code pages)
** [[MS-DOS Latin US ]]
+
** [[CP437]] · [[CP737]] · [[CP775]] · [[CP850]] · [[CP851]] · [[CP852]] · [[CP855]] · [[CP857]] · [[CP860]] · [[CP861]] · [[CP862]] · [[CP863]] · [[CP864]] · [[CP865]] · [[CP866]] · [[CP869]]
** [[MS-DOS Greek ]]
+
* [[Palm OS character set]]
** [[MS-DOS Baltic Rim ]]
+
** [[MS-DOS Latin-1 ]]
+
** [[MS-DOS Greek 1 ]]
+
** [[MS-DOS Latin-2 ]]
+
** [[MS-DOS Cyrillic ]]
+
** [[MS-DOS Turkish ]]
+
** [[MS-DOS Portuguese ]]
+
** [[MS-DOS Icelandic ]]
+
** [[MS-DOS Hebrew ]]
+
** [[MS-DOS French Canada ]]
+
** [[MS-DOS Arabic ]]
+
** [[MS-DOS Nordic ]]
+
** [[MS-DOS Cyrillic CIS 1 ]]
+
** [[MS-DOS Greek 2 ]]
+
 
* [[PETSCII]] (or PET ASCII or CBM ASCII; used by Commodore computers)
 
* [[PETSCII]] (or PET ASCII or CBM ASCII; used by Commodore computers)
 
* [[Unicode]]
 
* [[Unicode]]
 +
** [[BOCU-1]]
 +
** [[CESU-8]]
 +
** [[GB18030]]
 +
** [[Punycode]]
 +
** [[SCSU]]
 +
** [[UCS-2]]
 
** [[UTF-1]]
 
** [[UTF-1]]
 
** [[UTF-7]]
 
** [[UTF-7]]
 
** [[UTF-8]]
 
** [[UTF-8]]
** [[CESU-8]]
 
** [[UTF-EBCDIC]]
 
 
** [[UTF-9]]
 
** [[UTF-9]]
 
** [[UTF-16]]
 
** [[UTF-16]]
** [[UCS-2]]
 
 
** [[UTF-18]]
 
** [[UTF-18]]
 
** [[UTF-32]] (UCS-4)
 
** [[UTF-32]] (UCS-4)
** [[GB18030]]
+
** [[UTF-EBCDIC]]
** [[Punycode]]
+
** [[WTF-8]]
 
* [[VISCII]]
 
* [[VISCII]]
 
* [[Windows encodings]]
 
* [[Windows encodings]]
** [[Windows 1252]] (ISO 8859-1 plus additional characters)
+
** [[Windows 1250]] · [[Windows 1251]] · [[Windows 1252]] · [[Windows 1253]] · [[Windows 1254]] · [[Windows 1255]] · [[Windows 1256]] · [[Windows 1257]] · [[Windows 1258]]
** [[Windows 1255]] (Hebrew)
+
* [[ZSCII]], used in Infocom games
** [[Windows 1256]] (Arabic, Farsi, Urdu)
+
** [[Windows 1257]] (Baltic Rim)
+
** [[Windows 1258]] (Vietnamese)
+
  
 
== Format details ==
 
== Format details ==
Line 143: Line 78:
 
* [[C0 controls]] (ASCII control characters, 7 bit)
 
* [[C0 controls]] (ASCII control characters, 7 bit)
 
* [[C1 controls]] (extended control characters, 8 bit)
 
* [[C1 controls]] (extended control characters, 8 bit)
 +
 +
== Character encoding naming and numbering systems ==
 +
* [[Code page identifier]]
 +
* [[IANA character set name]]
  
 
== Character escape codes ==
 
== Character escape codes ==
Line 149: Line 88:
 
* [[Backslash escapes]] (used in various programming and markup languages)
 
* [[Backslash escapes]] (used in various programming and markup languages)
 
* [[HTML character references]] (entities and numeric values)
 
* [[HTML character references]] (entities and numeric values)
 +
See also [[ANSI escape code]].
 +
 +
==Character memory storage types==
 +
===C++===
 +
* [[char (C++)]] at least 8 bits
 +
* [[char16_t]] no less than 16 bits, no less than char
 +
* [[char32_t]] no less than 32 bits, no less than char16_t
 +
* [[wchar_t]] whatever the largest block of addressable memory happens to be on the system
 +
 +
====Glib library====
 +
* [[gunichar]] Unicode character (variable memory length)
 +
 +
===Java Virtual Machine===
 +
* [[char (Java)]] exactly 16 bits that represent [[UCS2]]
 +
* [[java.lang.Character]] exactly 16 bits that represent [[UCS2]] wrapped in an object
 +
 +
====Scala====
 +
Both char and scala.Char are wrappers around [[Java bytecode|JVM]]'s original types.
 +
 +
===.Net framework===
 +
* [[System.Char]] exactly 16 bits that represent [[UCS2]] (This is C#'s char)
 +
 +
===QuickBasic===
 +
There is no single character datatype. (There's STRING that holds up to 32767 characters assumed to be 1 byte each).
  
 
== Tools ==
 
== Tools ==
 +
* [http://www.gnu.org/software/libiconv/ GNU libiconv]
 +
* [https://apr.apache.org/ APR-iconv]
 +
* [http://site.icu-project.org/ ICU]
 +
* [https://github.com/pinard/Recode Recode]
 
* [http://www.kreativekorp.com/software/recode/ Kreative Recode: software to convert character encodings]
 
* [http://www.kreativekorp.com/software/recode/ Kreative Recode: software to convert character encodings]
  
 
== Commentary and satire ==
 
== Commentary and satire ==
 
* [http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)] by [http://en.wikipedia.org/wiki/Joel_Spolsky Joel Spolsky]
 
* [http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)] by [http://en.wikipedia.org/wiki/Joel_Spolsky Joel Spolsky]
[[Category:Character Encodings]]
+
* [http://moriel.smarterthanthat.com/tips/the-language-double-take-dealing-with-bidirectional-text-or-wait-tahw/ The Language Double-Take: Dealing with Bidirectional Text (or: Wait, ?tahW)]
 
* [http://geoff.greer.fm/2012/08/12/character-encoding-bugs-are-%F0%9D%92%9Cwesome/ Character encoding bugs are 𝒜wesome!]
 
* [http://geoff.greer.fm/2012/08/12/character-encoding-bugs-are-%F0%9D%92%9Cwesome/ Character encoding bugs are 𝒜wesome!]
 
* [http://xkcd.com/1209/ xkcd: Encoding]
 
* [http://xkcd.com/1209/ xkcd: Encoding]
 +
* [http://www.collegehumor.com/article/6872071/8-new-and-necessary-punctuation-marks 8 New Punctuation Marks We Desperately Need]
 +
* [http://mentalfloss.com/article/50380/8-symbols-we-turned-words 8 Symbols We Turned Into Words]
 +
* [https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name I Can Text You A Pile of Poo, But I Can’t Write My Name]
  
 
== Other external links ==
 
== Other external links ==

Revision as of 14:07, 29 January 2018

File Format
Name Character encoding
Ontology

{{{caption}}}

Character Encodings are methods of representing characters of text, usually as numeric values which can be stored on computers as bits and bytes, but sometimes in other things (e.g., Braille represents them as patterns of raised dots). Sometimes they're also referred to as "character sets", but purists will make a distinction in that, strictly speaking, a character set is merely a repertoire of characters, the list of characters supported by some system, protocol, or file format, without it necessarily having any inherent order or numbering system. A character encoding assigns specific values (in some coding system) to each character. However, the distinction can get vague and fuzzy; there are multiple levels of abstraction (Unicode includes a set of defined characters as well as assigned numeric code points for each, but leaves it to other more specific encodings such as UTF-8 to define the specific bits/bytes that represent them in a file), and some protocols even use parameter names such as 'charset' to indicate which character encoding is in use, so the terminology can slip and slide even in "tech" uses. This section documents all the various sorts of character sets/encodings of any sort.

See Fonts for the renditions of character encodings as seen on screens and printouts. The appearance of a character is known as a "glyph", and a font consists of a set of glyphs mapped onto the more abstractly-defined characters as included in the character set that is part of a character encoding.

Contents

Specific character sets or encodings

Format details

Character encoding naming and numbering systems

Character escape codes

(used to enter characters in various systems and formats)

See also ANSI escape code.

Character memory storage types

C++

  • char (C++) at least 8 bits
  • char16_t no less than 16 bits, no less than char
  • char32_t no less than 32 bits, no less than char16_t
  • wchar_t whatever the largest block of addressable memory happens to be on the system

Glib library

  • gunichar Unicode character (variable memory length)

Java Virtual Machine

Scala

Both char and scala.Char are wrappers around JVM's original types.

.Net framework

QuickBasic

There is no single character datatype. (There's STRING that holds up to 32767 characters assumed to be 1 byte each).

Tools

Commentary and satire

Other external links

References

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox