UTF-8

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
Line 11: Line 11:
  
 
A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.
 
A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.
 +
 +
== In MySQL ==
 +
 +
[[MySQL]] [https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html calls it utf8mb4], after making the unfortunate move of using the name 'utf8' to designate a limited subset that extends only to three bytes covering the BMP range (excluding characters past U+FFFF, or #65535 decimal). This continues a long computer-industry tradition of mangling character encoding standards, from [[PETSCII]] to serving [[Windows-1252]] as [[ISO 8859-1]].
  
 
== See also ==
 
== See also ==

Revision as of 12:34, 25 June 2014

File Format
Name UTF-8
Ontology

UCS Transformation Format—8-bit (UTF-8) is a byte-oriented Unicode character encoding. It offers good compatibility with ASCII, because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.

UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as UTF-16.

Contents

Format

A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.

In MySQL

MySQL calls it utf8mb4, after making the unfortunate move of using the name 'utf8' to designate a limited subset that extends only to three bytes covering the BMP range (excluding characters past U+FFFF, or #65535 decimal). This continues a long computer-industry tradition of mangling character encoding standards, from PETSCII to serving Windows-1252 as ISO 8859-1.

See also

Specifications

External links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox