Character encoding

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Specific character sets or encodings)
(10 intermediate revisions by one user not shown)
Line 33: Line 33:
 
** [[PDP-1 alphanumeric codes]]
 
** [[PDP-1 alphanumeric codes]]
 
* [[EBCDIC]]
 
* [[EBCDIC]]
** [[CP001]] · [[CP037]] · [[CP285]] · [[CP293]] · [[CP423]] · [[CP424]] · [[CP500]] · [[CP875]] · [[CP1026]] · [[CP1047]] · [[CP1140]] · [[CP1148]] · [[CP1155]] · [[CP4971]] · [[CP8616]] · [[CP9067]] · [[CP12712]] · [[EBCDIC 6-Bit]] · [[UTF-EBCDIC]]
+
** [[CP001]] · [[CP037]] · [[CP037-2]] · [[CP256]] · [[CP285]] · [[CP293]] · [[CP423]] · [[CP424]] · [[CP500]] · [[CP875]] · [[CP1026]] · [[CP1047]] · [[CP1140]] · [[CP1146]] · [[CP1148]] · [[CP1155]] · [[CP4971]] · [[CP8616]] · [[CP9067]] · [[CP12712]] · [[EBCDIC 6-Bit]] · [[UTF-EBCDIC]]
 
* [[Flag semaphore]]
 
* [[Flag semaphore]]
 
* [[GB 2312]]
 
* [[GB 2312]]
Line 50: Line 50:
 
** [[JIS X 0213]]
 
** [[JIS X 0213]]
 
** [[Shift-JIS]]
 
** [[Shift-JIS]]
 +
* [[KS X 1001]]
 +
* [[KOI7]]
 
* [[KOI8]]
 
* [[KOI8]]
 
** [[KOI8-CS]] (Czechoslovakia)
 
** [[KOI8-CS]] (Czechoslovakia)
Line 55: Line 57:
 
** [[KOI8-U]] (Ukraine)
 
** [[KOI8-U]] (Ukraine)
 
* [[Macintosh encodings]]
 
* [[Macintosh encodings]]
** [[MacCE]] · [[MacCyrillic]] · [[MacDingbats]] · [[MacGreek]] · [[MacGujarati]] · [[MacGurmukhi]] · [[MacIcelandic]] · [[MacRoman]] · [[MacRomania]] · [[MacSymbol]] · [[MacThai]] · [[MacTurkish]]
+
** [[MacCE]] · [[MacCyrillic]] · [[MacDingbats]] · [[MacGreek]] · [[MacGujarati]] · [[MacGurmukhi]] · [[MacIcelandic]] · [[MacRoman]] · [[MacRomanian]] · [[MacSymbol]] · [[MacThai]] · [[MacTurkish]]
 
* [[Mattel Aquarius character set]]
 
* [[Mattel Aquarius character set]]
 
* [[Morse code]]
 
* [[Morse code]]
 
* [[MS-DOS encodings]] (IBM PC code pages)
 
* [[MS-DOS encodings]] (IBM PC code pages)
** [[CP437]] · [[CP737]] · [[CP775]] · [[CP850]] · [[CP851]] · [[CP852]] · [[CP855]] · [[CP857]] · [[CP860]] · [[CP861]] · [[CP862]] · [[CP863]] · [[CP864]] · [[CP865]] · [[CP866]] · [[CP869]]
+
** [[CP437]] · [[CP737]] · [[CP775]] · [[CP850]] · [[CP851]] · [[CP852]] · [[CP855]] · [[CP857]] · [[CP860]] · [[CP861]] · [[CP862]] · [[CP863]] · [[CP864]] · [[CP865]] · [[CP866]] · [[CP869]] · [[CP872]] · [[CP17248]]
 
* [[Palm OS character set]]
 
* [[Palm OS character set]]
 
* [[PETSCII]] (or PET ASCII or CBM ASCII; used by Commodore computers)
 
* [[PETSCII]] (or PET ASCII or CBM ASCII; used by Commodore computers)
Line 132: Line 134:
 
* [https://github.com/pinard/Recode Recode]
 
* [https://github.com/pinard/Recode Recode]
 
* [http://www.kreativekorp.com/software/recode/ Kreative Recode: software to convert character encodings]
 
* [http://www.kreativekorp.com/software/recode/ Kreative Recode: software to convert character encodings]
 +
* [https://cryptii.com/pipes/binary-to-text Binary to text converter]
 +
* [https://www.convertbinary.com/ Text to binary converter]
  
 
== Commentary and satire ==
 
== Commentary and satire ==
Line 146: Line 150:
 
* [http://www.transbay.net/~enf/ascii/ascii.pdf The Evolution of Character Codes, 1874–1968]
 
* [http://www.transbay.net/~enf/ascii/ascii.pdf The Evolution of Character Codes, 1874–1968]
 
* [http://www.kreativekorp.com/charset/ Collection of character encodings]
 
* [http://www.kreativekorp.com/charset/ Collection of character encodings]
 +
* [https://www.iana.org/assignments/character-sets/character-sets.xhtml IANA official character set list]
 +
* [https://www.itscj.ipsj.or.jp/itscj_english/iso-ir/ISO-IR.pdf International register of escape sequences]
  
 
== References ==
 
== References ==
 
* Ken Lunde, ''CJKV Information Processing'', O'Reilly 2008, ISBN 978-0-596-51447-1 (has lots of information on encodings and Unicode in general, not only for CJKV locales)
 
* Ken Lunde, ''CJKV Information Processing'', O'Reilly 2008, ISBN 978-0-596-51447-1 (has lots of information on encodings and Unicode in general, not only for CJKV locales)
 
* [http://archive.org/details/bitsavers_ibm3270GA2SetReferenceApr87_34686991 IBM 3270 character set reference (1987)]
 
* [http://archive.org/details/bitsavers_ibm3270GA2SetReferenceApr87_34686991 IBM 3270 character set reference (1987)]

Revision as of 14:28, 21 June 2019

File Format
Name Character encoding
Ontology

{{{caption}}}

Character Encodings are methods of representing characters of text, usually as numeric values which can be stored on computers as bits and bytes, but sometimes in other things (e.g., Braille represents them as patterns of raised dots). Sometimes they're also referred to as "character sets", but purists will make a distinction in that, strictly speaking, a character set is merely a repertoire of characters, the list of characters supported by some system, protocol, or file format, without it necessarily having any inherent order or numbering system. A character encoding assigns specific values (in some coding system) to each character. However, the distinction can get vague and fuzzy; there are multiple levels of abstraction (Unicode includes a set of defined characters as well as assigned numeric code points for each, but leaves it to other more specific encodings such as UTF-8 to define the specific bits/bytes that represent them in a file), and some protocols even use parameter names such as 'charset' to indicate which character encoding is in use, so the terminology can slip and slide even in "tech" uses. This section documents all the various sorts of character sets/encodings of any sort.

See Fonts for the renditions of character encodings as seen on screens and printouts. The appearance of a character is known as a "glyph", and a font consists of a set of glyphs mapped onto the more abstractly-defined characters as included in the character set that is part of a character encoding.

Contents

Specific character sets or encodings

Format details

Character encoding naming and numbering systems

Character escape codes

(used to enter characters in various systems and formats)

See also ANSI escape code.

Character memory storage types

C++

  • char (C++) at least 8 bits
  • char16_t no less than 16 bits, no less than char
  • char32_t no less than 32 bits, no less than char16_t
  • wchar_t whatever the largest block of addressable memory happens to be on the system

Glib library

  • gunichar Unicode character (variable memory length)

Java Virtual Machine

.Net framework

Pascal

Traditionally, Pascal stored 8-bit characters (in system-specific character sets), and some implementations also had a 'string' type that was an array of characters with the zeroth element containing the number of characters in the string (maximum 255). Newer Pascal implementations have a variety of other types. [1]

QuickBasic

There is no single character datatype. (There's STRING that holds up to 32767 characters assumed to be 1 byte each).

Scala

Both char and scala.Char are wrappers around JVM's original types.

Tools

Commentary and satire

Other external links

References

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox