UTF-8

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Add details)
(Describe the difference between UTF-M-8, UTF-G-8, and UTF-E-8.)
 
(12 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
{{FormatInfo
 
{{FormatInfo
 
|formattype=electronic
 
|formattype=electronic
|subcat=Character Encodings
+
|subcat=Character encoding
 +
|subcat2=Unicode
 +
|charset=UTF-8
 +
|charsetaliases=csUTF8
 +
|mibenum=106
 
}}
 
}}
'''[[UCS]] Transformation Format—8-bit''' (UTF-8) is a [[Unicode]] character encoding. Codes 0-127 (0-7F hexadecimal) represent the equivalent [[ASCII]] characters, and these codes in a UTF-8 stream are never used in any other context. Codes FE and FF are never used, except in the optional Byte Order Mark at the beginning of a document. In UTF-8 the BOM is encoded as the bytes 0xEF, 0xBB, 0xBF. Since UTF-8 has no "endianness," this is not actually a byte order indicator but can be treated as a signature indicating the document is UTF-8 encoded.
+
'''UCS Transformation Format—8-bit''' ('''UTF-8''') is a byte-oriented [[Unicode]] [[Character Encodings|character encoding]]. It offers good compatibility with [[ASCII]], because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.
  
UTF-8 is best suited for scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as [[UTF-16]] or [[UTF-32]].
+
UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as [[UTF-16]].
 +
 
 +
== Format ==
 +
 
 +
A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are not valid Unicode.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.
 +
 
 +
(The version specified in the previous paragraph may be called UTF-M-8, to distinguish it from UTF-G-8 which is the extension to 31-bits and the original specification (originally simply called "UTF-8"). An extension to 63-bits, which is allegedly supported in [[Perl]], is called UTF-E-8. The name "UTF-8" by itself often refers to UTF-M-8.)
 +
 
 +
== In MySQL ==
 +
 
 +
[[MySQL]] [https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html calls it utf8mb4], after making the unfortunate move of using the name 'utf8' to designate a limited subset that extends only to three bytes covering the BMP range (excluding characters past U+FFFF, or #65535 decimal). This continues a long computer-industry tradition of mangling character encoding standards, from [[PETSCII]] to serving [[Windows 1252]] as [[ISO 8859-1]].
 +
 
 +
== See also ==
 +
* [[Byte Order Mark]]
 +
* [[CESU-8]]
 +
* [[WTF-8]]
  
 
== Specifications ==
 
== Specifications ==
Line 18: Line 37:
 
== External links ==
 
== External links ==
  
 +
* [[Wikipedia:UTF-8]]
 
* [http://doc.cat-v.org/plan_9/4th_edition/papers/utf Hello World or Καλημέρα κόσμε or こんにちは 世界] ([http://plan9.bell-labs.com/sys/doc/utf.pdf PDF])
 
* [http://doc.cat-v.org/plan_9/4th_edition/papers/utf Hello World or Καλημέρα κόσμε or こんにちは 世界] ([http://plan9.bell-labs.com/sys/doc/utf.pdf PDF])
 
+
* [http://geoff.greer.fm/2012/08/12/character-encoding-bugs-are-%F0%9D%92%9Cwesome/ Character encoding bugs are 𝒜wesome!]
[[Category:Text encoding]]
+
* [http://doc.cat-v.org/bell_labs/utf-8_history The history of UTF-8 as told by Rob Pike]

Latest revision as of 19:09, 23 August 2020

File Format
Name UTF-8
Ontology
IANA charset UTF-8
IANA aliases csUTF8
IANA MIBenum 106

UCS Transformation Format—8-bit (UTF-8) is a byte-oriented Unicode character encoding. It offers good compatibility with ASCII, because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.

UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as UTF-16.

Contents

[edit] Format

A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are not valid Unicode.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.

(The version specified in the previous paragraph may be called UTF-M-8, to distinguish it from UTF-G-8 which is the extension to 31-bits and the original specification (originally simply called "UTF-8"). An extension to 63-bits, which is allegedly supported in Perl, is called UTF-E-8. The name "UTF-8" by itself often refers to UTF-M-8.)

[edit] In MySQL

MySQL calls it utf8mb4, after making the unfortunate move of using the name 'utf8' to designate a limited subset that extends only to three bytes covering the BMP range (excluding characters past U+FFFF, or #65535 decimal). This continues a long computer-industry tradition of mangling character encoding standards, from PETSCII to serving Windows 1252 as ISO 8859-1.

[edit] See also

[edit] Specifications

[edit] External links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox