UTF-8

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
Line 3: Line 3:
 
|subcat=Character Encodings
 
|subcat=Character Encodings
 
}}
 
}}
'''[[UCS]] Transformation Format—8-bit''' (UTF-8) is a [[Unicode]] character encoding. Codes 0-127 (00-7F hexadecimal) represent the equivalent [[ASCII]] characters, and these codes in a UTF-8 stream are never used in any other context. Codes FE and FF are never used, except in the optional [[Byte Order Mark]] at the beginning of a document. In UTF-8 the BOM is encoded as the bytes 0xEF, 0xBB, 0xBF. Since UTF-8 has no "endianness," this is not actually a byte order indicator but can be treated as a signature indicating the document is UTF-8 encoded.
 
  
UTF-8 is best suited for scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as [[UTF-16]] or [[UTF-32]].
+
'''UCS Transformation Format—8-bit''' ('''UTF-8''') is a byte-oriented [[Unicode]] [[Character Encodings|character encoding]]. It offers good compatibility with [[ASCII]], because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.
 +
 
 +
UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as [[UTF-16]].
 +
 
 +
== Format ==
 +
 
 +
A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.
 +
 
 +
== See also ==
 +
* [[Byte Order Mark]]
 +
* [[CESU-8]]
  
 
== Specifications ==
 
== Specifications ==

Revision as of 20:46, 22 February 2013

File Format
Name UTF-8
Ontology

UCS Transformation Format—8-bit (UTF-8) is a byte-oriented Unicode character encoding. It offers good compatibility with ASCII, because codes 0–127 (00–7F hexadecimal) represent the equivalent ASCII characters, and these codes are never used in any other context.

UTF-8 is most efficient with scripts that make heavy use of the Roman alphabet. With other scripts it may not provide as efficient an encoding as UTF-16.

Contents

Format

A Unicode code point is encoded as either 1, 2, 3, or 4 bytes. (Early versions of UTF-8 defined sequences with more than 4 bytes, but they are obsolete.) Code points U+0000 to U+007F use 1 byte, U+0080 to U+07FF use 2, U+0800 to U+FFFF use 3, and U+10000 to U+10FFFF use 4.

See also

Specifications

External links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox