Byte Order Mark

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Rewrote & expanded)
Line 6: Line 6:
 
==Introduction==
 
==Introduction==
  
The '''Byte Order Mark''' (BOM) is a header that is added sometimes to some type of textual formats such as [[CSV]] to have applications recognize the right [[Character Encodings|character encoding]]. It was designed to deal with the [[Endianness|"big-endian vs. little-endian"]] problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position (U+FEFF) in the [[Unicode]] character set (officially designated "Zero-width non-break space", which is completely invisible when printed and has no effect on rendering, spacing, or breaking of adjacent characters), where the corresponding character with the two bytes of the 16-bit code point reversed (U+FFFE) is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be in the opposite byte order than was previously assumed, and processing should restart at the beginning of the file with the proper order set. Usually it's only processed in this manner if it's the first character of the file, or else processing could get in an infinite loop if both versions appear somewhere in the file.
+
A '''Byte Order Mark''' ('''BOM''') is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a [[Unicode]] text file, or other block of Unicode text.
  
Hence, if you examine the raw bytes of a file you believe to be a 16-bit-encoded text file and its first two bytes are FE and FF in that order, this indicates that the order is big-endian, while if the first two bytes are FF then FE, it is little-endian.
+
There are two main schools of thought as to its purpose:
 +
# Its purpose is to identify the [[endianness]] of a file whose [[Character Encodings|encoding]] is otherwise already known (particularly useful with [[UTF-16]]).
 +
# Its purpose is more general: to help computer programs guess the encoding of a file, even if they have no external information about what its encoding might be. Thus, the term "byte order mark" is something of a misnomer.
  
Some [[UTF-8]] files (including [[CSV]] files) are written with a prepending BOM consisting of 3 bytes: <code>EF BB BF</code>.  
+
The idea of a BOM is undeniably a ''hack'', but its benefits sometimes outweigh its drawbacks.
  
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2]
+
To make false positives less likely, the U+FFFE code point is permanently reserved, and will never be a meaningful code point.
  
== References ==
+
Other usage of the U+FEFF character is deprecated, and U+2060 WORD JOINER is suggested instead.
* [http://en.wikipedia.org/wiki/Byte-order_mark Byte-order mark (Wikipedia)]
+
 
 +
== Byte patterns of common BOMs ==
 +
 
 +
A file beginning with bytes <code>0xFE 0xFF</code> is probably encoded in [[UTF-16]] with big-endian byte order.
 +
 
 +
<code>0xFF 0xFE</code> suggests [[UTF-16]] with little-endian byte order.
 +
 
 +
<code>0xEF 0xBB 0xBF</code> suggests [[UTF-8]].
 +
 
 +
== UTF-8 ==
 +
 
 +
Whether [[UTF-8]] files should ever use a BOM is a contentious issue. A good case can be made for either side of the argument. But note that if you need to read files written by third-party applications, that ship has sailed: existing UTF-8 files often do use a BOM.
 +
 
 +
== External links ==
 +
* [[Wikipedia:Byte order mark|Wikipedia article]]
 +
* [http://www.unicode.org/charts/PDF/UFE70.pdf Unicode code chart FE70–FEFF]
 +
* [http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf Unicode standard, Ch. 2]
  
 
[[Category:File format details]]
 
[[Category:File format details]]

Revision as of 20:22, 22 February 2013

File Format
Name Byte Order Mark
Ontology

Contents

Introduction

A Byte Order Mark (BOM) is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a Unicode text file, or other block of Unicode text.

There are two main schools of thought as to its purpose:

  1. Its purpose is to identify the endianness of a file whose encoding is otherwise already known (particularly useful with UTF-16).
  2. Its purpose is more general: to help computer programs guess the encoding of a file, even if they have no external information about what its encoding might be. Thus, the term "byte order mark" is something of a misnomer.

The idea of a BOM is undeniably a hack, but its benefits sometimes outweigh its drawbacks.

To make false positives less likely, the U+FFFE code point is permanently reserved, and will never be a meaningful code point.

Other usage of the U+FEFF character is deprecated, and U+2060 WORD JOINER is suggested instead.

Byte patterns of common BOMs

A file beginning with bytes 0xFE 0xFF is probably encoded in UTF-16 with big-endian byte order.

0xFF 0xFE suggests UTF-16 with little-endian byte order.

0xEF 0xBB 0xBF suggests UTF-8.

UTF-8

Whether UTF-8 files should ever use a BOM is a contentious issue. A good case can be made for either side of the argument. But note that if you need to read files written by third-party applications, that ship has sailed: existing UTF-8 files often do use a BOM.

External links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox