Character encoding

File Format
Name	Character encoding
Ontology	Electronic File Formats Character encoding ; ;

Latest revision as of 13:38, 28 August 2023

Character Encodings are methods of representing characters of text, usually as numeric values which can be stored on computers as bits and bytes, but sometimes in other things (e.g., Braille represents them as patterns of raised dots). Sometimes they're also referred to as "character sets", but purists will make a distinction in that, strictly speaking, a character set is merely a repertoire of characters, the list of characters supported by some system, protocol, or file format, without it necessarily having any inherent order or numbering system. A character encoding assigns specific values (in some coding system) to each character. However, the distinction can get vague and fuzzy; there are multiple levels of abstraction (Unicode includes a set of defined characters as well as assigned numeric code points for each, but leaves it to other more specific encodings such as UTF-8 to define the specific bits/bytes that represent them in a file), and some protocols even use parameter names such as 'charset' to indicate which character encoding is in use, so the terminology can slip and slide even in "tech" uses. This section documents all the various sorts of character sets/encodings of any sort.

See Fonts for the renditions of character encodings as seen on screens and printouts. The appearance of a character is known as a "glyph", and a font consists of a set of glyphs mapped onto the more abstractly-defined characters as included in the character set that is part of a character encoding.

[edit] Specific character sets or encodings

Adobe Standard Encoding
Amstrad CP/M Plus character set
ANSEL
- MARC-8
APL code page
- CP293 · CP907
Apple II character set
ARMSCII
ASCII
ATASCII (used by Atari computers)
Baudot code
Big5
- Windows Big5
Braille
- BRF
- Nemeth Code
- Taylor Code
CNS 11643
Compucolor character set
DEC (Digital Equipment Corporation)
- DEC Special Graphics Character Set
- PDP-1 alphanumeric codes
EBCDIC
- CP001 · CP037 · CP037-2 · CP256 · CP285 · CP293 · CP423 · CP424 · CP500 · CP875 · CP1026 · CP1047 · CP1140 · CP1146 · CP1148 · CP1155 · CP4971 · CP8616 · CP9067 · CP12712 · EBCDIC 6-Bit · UTF-EBCDIC
Flag semaphore
GB 2312
IBM: See EBCDIC, MS-DOS Encodings, and APL Code Page elsewhere in this list
ISO 646
- ISO 646-CA · ISO 646-CA2 · ISO 646-CH · ISO 646-CN · ISO 646-CU · ISO 646-DE · ISO 646-DK · ISO 646-ES · ISO 646-ES2 · ISO 646-FI · ISO 646-FR · ISO 646-FR1 · ISO 646-GB · ISO 646-HU · ISO 646-IE · ISO 646-IRV · ISO 646-IS · ISO 646-IT · ISO 646-JP · ISO 646-JP OCR-B · ISO 646-KR · ISO 646-MT · ISO 646-NL · ISO 646-NO · ISO 646-NO2 · ISO 646-PL · ISO 646-PT · ISO 646-PT2 · ISO 646-SE · ISO 646-SE2 · ISO 646-US · ISO 646-TW · ISO 646-YU
ISO 2022
- ISO 2022-CN · ISO 2022-CN-EXT · ISO 2022-JP · ISO 2022-JP-1 · ISO 2022-JP-2 · ISO 2022-JP-3 · ISO 2022-JP-2004 · ISO 2022-KR
ISO 8859
- ISO 8859-1 · ISO 8859-2 · ISO 8859-3 · ISO 8859-4 · ISO 8859-5 · ISO 8859-6 · ISO 8859-7 · ISO 8859-8 · ISO 8859-9 · ISO 8859-10 · ISO 8859-11 · ISO 8859-13 · ISO 8859-14 · ISO 8859-15 · ISO 8859-16
ISO-IR-165
JIS
- JIS X 0201
- JIS X 0208
- JIS X 0212
- JIS X 0213
- Shift-JIS
KS X 1001
KOI7
KOI8
- KOI8-CS (Czechoslovakia)
- KOI8-R (Russia)
- KOI8-U (Ukraine)
Macintosh encodings
- MacCE · MacCyrillic · MacDingbats · MacGreek · MacGujarati · MacGurmukhi · MacIcelandic · MacRoman · MacRomanian · MacSymbol · MacThai · MacTurkish
Mattel Aquarius character set
Morse code
MS-DOS encodings (IBM PC code pages)
- CP437 · CP737 · CP775 · CP850 · CP851 · CP852 · CP855 · CP857 · CP860 · CP861 · CP862 · CP863 · CP864 · CP865 · CP866 · CP869 · CP872 · CP17248
Palm OS character set
PETSCII (or PET ASCII or CBM ASCII; used by Commodore computers)
TRON code
Unicode
- BOCU-1
- CESU-8
- GB18030
- Punycode
- SCSU
- UCS-2
- UTF-1
- UTF-7
- UTF-8
- UTF-9
- UTF-16
- UTF-18
- UTF-32 (UCS-4)
- UTF-EBCDIC
- WTF-8
VISCII
Windows encodings
- Windows 1250 · Windows 1251 · Windows 1252 · Windows 1253 · Windows 1254 · Windows 1255 · Windows 1256 · Windows 1257 · Windows 1258 · Windows Big5 (Windows 950)
YUSCII
ZSCII, used in Infocom games

[edit] Format details

Byte Order Mark
C0 controls (ASCII control characters, 7 bit)
C1 controls (extended control characters, 8 bit)

[edit] Character encoding naming and numbering systems

[edit] Character escape codes

(used to enter characters in various systems and formats)

Alt codes (DOS/Windows)
Backslash escapes (used in various programming and markup languages)
HTML character references (entities and numeric values)

[edit] Character memory storage types

[edit] C++

char (C++) at least 8 bits
char16_t no less than 16 bits, no less than char
char32_t no less than 32 bits, no less than char16_t
wchar_t whatever the largest block of addressable memory happens to be on the system

[edit] GLib library

gchar (8 bit, same as C char)
guchar (8 bit, same as C unsigned char)
gunichar Unicode character (32 bit)
gunichar2 (16 bit)

[edit] Java Virtual Machine

char (Java) exactly 16 bits that represent UCS2
java.lang.Character exactly 16 bits that represent UCS2 wrapped in an object

[edit] .Net framework

System.Char exactly 16 bits that represent UCS2 (This is C#'s char)

[edit] Pascal

Traditionally, Pascal stored 8-bit characters (in system-specific character sets), and some implementations also had a 'string' type that was an array of characters with the zeroth element containing the number of characters in the string (maximum 255). Newer Pascal implementations have a variety of other types. [1]

[edit] QuickBasic

There is no single character datatype. (There's STRING that holds up to 32767 characters assumed to be 1 byte each).

[edit] Scala

Both char and scala.Char are wrappers around JVM's original types.

[edit] Tools

[edit] Commentary and satire

[edit] Other external links

[edit] References

Ken Lunde, CJKV Information Processing, O'Reilly 2008, ISBN 978-0-596-51447-1 (has lots of information on encodings and Unicode in general, not only for CJKV locales)
IBM 3270 character set reference (1987)

Character encoding

Latest revision as of 13:38, 28 August 2023

Contents

[edit] Specific character sets or encodings

[edit] Format details

[edit] Character encoding naming and numbering systems

[edit] Character escape codes

[edit] Character memory storage types

[edit] C++

[edit] GLib library

[edit] Java Virtual Machine

[edit] .Net framework

[edit] Pascal

[edit] QuickBasic

[edit] Scala

[edit] Tools

[edit] Commentary and satire

[edit] Other external links

[edit] References

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox

@@ Line 1: / Line 1: @@
-{|
+{{FormatInfo
-|[[File Formats]]
+|formattype=electronic
-| >
+|thiscat=Character encoding
-|[[Electronic File Formats]]
+|image=Characters.png
-| >
+}}
-|Character Encoding
+'''Character Encodings''' are methods of representing characters of text, usually as numeric values which can be stored on computers as bits and bytes, but sometimes in other things (e.g., [[Braille]] represents them as patterns of raised dots). Sometimes they're also referred to as "character sets", but purists will make a distinction in that, strictly speaking, a character set is merely a repertoire of characters, the list of characters supported by some system, protocol, or file format, without it necessarily having any inherent order or numbering system. A character encoding assigns specific values (in some coding system) to each character. However, the distinction can get vague and fuzzy; there are multiple levels of abstraction ([[Unicode]] includes a set of defined characters as well as assigned numeric code points for each, but leaves it to other more specific encodings such as [[UTF-8]] to define the specific bits/bytes that represent them in a file), and some protocols even use parameter names such as 'charset' to indicate which character ''encoding'' is in use, so the terminology can slip and slide even in "tech" uses. This section documents all the various sorts of character sets/encodings of any sort.
-|}
+See [[Fonts]] for the renditions of character encodings as seen on screens and printouts. The appearance of a character is known as a "glyph", and a font consists of a set of glyphs mapped onto the more abstractly-defined characters as included in the character set that is part of a character encoding.
+== Specific character sets or encodings ==
+* [[Adobe Standard Encoding]]
+* [[Amstrad CP/M Plus character set]]
+* [[ANSEL]]
+** [[MARC-8]]
+* [[APL code page]]
+** [[CP293]] · [[CP907]]
+* [[Apple II character set]]
+* [[ARMSCII]]
 * [[ASCII]]
-** [[PET ASCII]] (or PETSCII or CBM-ASCII; used by Commodore computers)
+* [[ATASCII]] (used by Atari computers)
 * [[Baudot code]]
+* [[Big5]]
+** [[Windows Big5]]
 * [[Braille]]
+** [[BRF]]
+** [[Nemeth Code]]
+** [[Taylor Code]]
+* [[CNS 11643]]
+* [[Compucolor character set]]
+* DEC (Digital Equipment Corporation)
+** [[DEC Special Graphics Character Set]]
+** [[PDP-1 alphanumeric codes]]
 * [[EBCDIC]]
-* [[IBM PC code pages]]
+** [[CP001]] · [[CP037]] · [[CP037-2]] · [[CP256]] · [[CP285]] · [[CP293]] · [[CP423]] · [[CP424]] · [[CP500]] · [[CP875]] · [[CP1026]] · [[CP1047]] · [[CP1140]] · [[CP1146]] · [[CP1148]] · [[CP1155]] · [[CP4971]] · [[CP8616]] · [[CP9067]] · [[CP12712]] · [[EBCDIC 6-Bit]] · [[UTF-EBCDIC]]
+* [[Flag semaphore]]
+* [[GB 2312]]
+* IBM: See EBCDIC, MS-DOS Encodings, and APL Code Page elsewhere in this list
 * [[ISO 646]]
-** [[ISO 646-CA]] (Canada / French)
+** [[ISO 646-CA]] · [[ISO 646-CA2]] · [[ISO 646-CH]] · [[ISO 646-CN]] · [[ISO 646-CU]] · [[ISO 646-DE]] · [[ISO 646-DK]] · [[ISO 646-ES]] ·   [[ISO 646-ES2]] · [[ISO 646-FI]] · [[ISO 646-FR]] · [[ISO 646-FR1]] · [[ISO 646-GB]] · [[ISO 646-HU]] · [[ISO 646-IE]] · [[ISO 646-IRV]] ·  [[ISO 646-IS]] · [[ISO 646-IT]] · [[ISO 646-JP]] · [[ISO 646-JP OCR-B]] · [[ISO 646-KR]] · [[ISO 646-MT]] · [[ISO 646-NL]] · [[ISO 646-NO]] · [[ISO 646-NO2]] · [[ISO 646-PL]] · [[ISO 646-PT]] · [[ISO 646-PT2]] · [[ISO 646-SE]] · [[ISO 646-SE2]] · [[ISO 646-US]] · [[ISO 646-TW]] · [[ISO 646-YU]]
-** [[ISO 646-CA-2]] (Canada / French)
+* [[ISO 2022]]
-** [[ISO 646-CH]] (Switzerland)
+** [[ISO 2022-CN]] · [[ISO 2022-CN-EXT]] · [[ISO 2022-JP]] · [[ISO 2022-JP-1]] · [[ISO 2022-JP-2]] · [[ISO 2022-JP-3]] · [[ISO 2022-JP-2004]] · [[ISO 2022-KR]]
-** [[ISO 646-CN]] (China / Basic Latin)
-** [[ISO 646-CU]] (Cuba / Spanish)
-** [[ISO 646-DE]] (Germany)
-** [[ISO 646-DK]] (Denmark)
-** [[ISO 646-FI]] (Finland)
-** [[ISO 646-FR]] (France)
-** [[ISO 646-GB]] (Great Britain)
-** [[ISO 646-HU]] (Hungary)
-** [[ISO 646-IRV]] (International Reference Version)
-** [[ISO 646-IT]] (Italy)
-** [[ISO 646-JP]] (Japan / Romaji)
-** [[ISO 646-JP OCR-B]] (Japan / Romaji)
-** [[ISO 646-KR]] (Korea / Latin)
-** [[ISO 646-MT]] (Malta)
-** [[ISO 646-NL]] (Netherlands)
-** [[ISO 646-NO]] (Norway)
-** [[ISO 646-NO-2]] (Norway)
-** [[ISO 646-PT]] (Portugal)
-** [[ISO 646-SE]] (Sweden)
-** [[ISO 646-SE-2]] (Sweden)
-** [[ISO 646-US]] (Same as [[ASCII]])
-** [[ISO 646-YU]] (Yugoslavia)
 * [[ISO 8859]]
-** [[ISO 8859-1]] (Latin-1)
+** [[ISO 8859-1]] · [[ISO 8859-2]] · [[ISO 8859-3]] · [[ISO 8859-4]] · [[ISO 8859-5]] · [[ISO 8859-6]] · [[ISO 8859-7]] · [[ISO 8859-8]] · [[ISO 8859-9]] · [[ISO 8859-10]] · [[ISO 8859-11]] · [[ISO 8859-13]] · [[ISO 8859-14]] · [[ISO 8859-15]] · [[ISO 8859-16]]
-** [[ISO 8859-2]] (Latin-2, Central/East European)
+* [[ISO-IR-165]]
-** [[ISO 8859-3]] (Latin-3, Esperanto, Galician, Maltese, and Turkish)
-** [[ISO 8859-4]] (Latin-4, Scandinavian and Baltic)
-** [[ISO 8859-5]] (Cyrillic)
-** [[ISO 8859-6]] (Arabic)
-** [[ISO 8859-7]] (Modern Greek)
-** [[ISO 8859-8]] (Hebrew)
-** [[ISO 8859-9]] (Latin-5, Turkish)
-** [[ISO 8859-10]] (Latin-6, Lappish, Nordic, and Inuit)
-** [[ISO 8859-11]] (Thai)
-** [[ISO 8859-13]] (Latin-7, Baltic Rim)
-** [[ISO 8859-14]] (Celtic)
-** [[ISO 8859-15]] (Latin-9, Latin-1 with a Euro sign)
-** [[ISO 8859-16]] (Romanian)
 * [[JIS]]
 ** [[JIS X 0201]]
 ** [[JIS X 0208]]
+** [[JIS X 0212]]
+** [[JIS X 0213]]
 ** [[Shift-JIS]]
+* [[KS X 1001]]
+* [[KOI7]]
 * [[KOI8]]
 ** [[KOI8-CS]] (Czechoslovakia)
@@ Line 64: / Line 57: @@
 ** [[KOI8-U]] (Ukraine)
 * [[Macintosh encodings]]
-** [[MacCE]]
+** [[MacCE]] · [[MacCyrillic]] · [[MacDingbats]] · [[MacGreek]] · [[MacGujarati]] · [[MacGurmukhi]] · [[MacIcelandic]] · [[MacRoman]] · [[MacRomanian]] · [[MacSymbol]] · [[MacThai]] · [[MacTurkish]]
-** [[MacCyrillic]]
+* [[Mattel Aquarius character set]]
-** [[MacDingbat]]
-** [[MacGreek]]
-** [[MacGujarati]]
-** [[MacGurmukhi]]
-** [[MacIceland]]
-** [[MacRoman]]
-** [[MacRomania]]
-** [[MacSymbol]]
-** [[MacThai]]
-** [[MacTurkish]]
-** [[MacUkraine]]
 * [[Morse code]]
+* [[MS-DOS encodings]] (IBM PC code pages)
+** [[CP437]] · [[CP737]] · [[CP775]] · [[CP850]] · [[CP851]] · [[CP852]] · [[CP855]] · [[CP857]] · [[CP860]] · [[CP861]] · [[CP862]] · [[CP863]] · [[CP864]] · [[CP865]] · [[CP866]] · [[CP869]] · [[CP872]] · [[CP17248]]
+* [[Palm OS character set]]
+* [[PETSCII]] (or PET ASCII or CBM ASCII; used by Commodore computers)
+* [[TRON code]]
 * [[Unicode]]
+** [[BOCU-1]]
+** [[CESU-8]]
+** [[GB18030]]
+** [[Punycode]]
+** [[SCSU]]
+** [[UCS-2]]
 ** [[UTF-1]]
-** [[UTF-16]]
-** [[UTF-8]]
 ** [[UTF-7]]
+** [[UTF-8]]
+** [[UTF-9]]
+** [[UTF-16]]
+** [[UTF-18]]
+** [[UTF-32]] (UCS-4)
 ** [[UTF-EBCDIC]]
+** [[WTF-8]]
 * [[VISCII]]
 * [[Windows encodings]]
-** [[Windows 1252]] (ISO 8859-1 plus additional characters)
+** [[Windows 1250]] · [[Windows 1251]] · [[Windows 1252]] · [[Windows 1253]] · [[Windows 1254]] · [[Windows 1255]] · [[Windows 1256]] · [[Windows 1257]] · [[Windows 1258]] · [[Windows Big5]] (Windows 950)
-** [[Windows 1255]] (Hebrew)
+* [[YUSCII]]
-** [[Windows 1256]] (Arabic, Farsi, Urdu)
+* [[ZSCII]], used in Infocom games
-** [[Windows 1257]] (Baltic Rim)
-** [[Windows 1258]] (Vietnamese)
-== External links ==
+== Format details ==
+* [[Byte Order Mark]]
+* [[C0 controls]] (ASCII control characters, 7 bit)
+* [[C1 controls]] (extended control characters, 8 bit)
+== Character encoding naming and numbering systems ==
+* [[Code page identifier]]
+* [[IANA character set name]]
+== Character escape codes ==
+(used to enter characters in various systems and formats)
+* [[Alt codes]] (DOS/Windows)
+* [[Backslash escapes]] (used in various programming and markup languages)
+* [[HTML character references]] (entities and numeric values)
+See also [[ANSI escape code]].
+==Character memory storage types==
+===C++===
+* [[char (C++)]] at least 8 bits
+* [[char16_t]] no less than 16 bits, no less than char
+* [[char32_t]] no less than 32 bits, no less than char16_t
+* [[wchar_t]] whatever the largest block of addressable memory happens to be on the system
+====GLib library====
+* [[gchar]] (8 bit, same as C char)
+* [[guchar]] (8 bit, same as C unsigned char)
+* [[gunichar]] Unicode character (32 bit)
+* [[gunichar2]] (16 bit)
+===Java Virtual Machine===
+* [[char (Java)]] exactly 16 bits that represent [[UCS2]]
+* [[java.lang.Character]] exactly 16 bits that represent [[UCS2]] wrapped in an object
+===.Net framework===
+* [[System.Char]] exactly 16 bits that represent [[UCS2]] (This is C#'s char)
+===Pascal===
+Traditionally, Pascal stored 8-bit characters (in system-specific character sets), and some implementations also had a 'string' type that was an array of characters with the zeroth element containing the number of characters in the string (maximum 255). Newer Pascal implementations have a variety of other types. [http://wiki.freepascal.org/Character_and_string_types]
+===QuickBasic===
+There is no single character datatype. (There's STRING that holds up to 32767 characters assumed to be 1 byte each).
+====Scala====
+Both char and scala.Char are wrappers around [[Java bytecode|JVM]]'s original types.
+== Tools ==
+* [http://www.gnu.org/software/libiconv/ GNU libiconv]
+* [https://apr.apache.org/ APR-iconv]
+* [http://site.icu-project.org/ ICU]
+* [https://github.com/pinard/Recode Recode]
+* [http://www.kreativekorp.com/software/recode/ Kreative Recode: software to convert character encodings]
+* [https://cryptii.com/pipes/binary-to-text Binary to text converter]
+* [https://www.convertbinary.com/ Text to binary converter]
+* [https://2cyr.com/decode/?lang=en Universal Cyrillic decoder] ([https://www.accent.bg/decode/ mirror])
+== Commentary and satire ==
+* [http://www.joelonsoftware.com/articles/Unicode.html The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)] by [http://en.wikipedia.org/wiki/Joel_Spolsky Joel Spolsky]
+* [http://moriel.smarterthanthat.com/tips/the-language-double-take-dealing-with-bidirectional-text-or-wait-tahw/ The Language Double-Take: Dealing with Bidirectional Text (or: Wait, ?tahW)]
+* [http://geoff.greer.fm/2012/08/12/character-encoding-bugs-are-%F0%9D%92%9Cwesome/ Character encoding bugs are 𝒜wesome!]
+* [http://xkcd.com/1209/ xkcd: Encoding]
+* [http://www.collegehumor.com/article/6872071/8-new-and-necessary-punctuation-marks 8 New Punctuation Marks We Desperately Need]
+* [http://mentalfloss.com/article/50380/8-symbols-we-turned-words 8 Symbols We Turned Into Words]
+* [https://modelviewculture.com/pieces/i-can-text-you-a-pile-of-poo-but-i-cant-write-my-name I Can Text You A Pile of Poo, But I Can’t Write My Name]
+== Other external links ==
+* [http://www.kreativekorp.com/charset/ Lots of character encoding charts]
 * [http://www.transbay.net/~enf/ascii/ascii.pdf The Evolution of Character Codes, 1874–1968]
-* [http://www.kreativekorp.com/charset/ Collection of character encodings]]
+* [http://www.kreativekorp.com/charset/ Collection of character encodings]
+* [https://www.iana.org/assignments/character-sets/character-sets.xhtml IANA official character set list]
+* [https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf International register of escape sequences]
+== References ==
+* Ken Lunde, ''CJKV Information Processing'', O'Reilly 2008, ISBN 978-0-596-51447-1 (has lots of information on encodings and Unicode in general, not only for CJKV locales)
+* [http://archive.org/details/bitsavers_ibm3270GA2SetReferenceApr87_34686991 IBM 3270 character set reference (1987)]