Byte Order Mark

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Put the more detailed description here where it belongs instead of just linking to the CSV article.)
(Byte patterns of common BOMs)
 
(10 intermediate revisions by 2 users not shown)
Line 1: Line 1:
==Introduction==
+
{{FormatInfo
 +
|formattype=electronic
 +
|subcat=Character encoding
 +
}}
 +
A '''Byte Order Mark''' ('''BOM''') is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a [[Unicode]] text file, or other block of Unicode text.
  
The '''Byte Order Mark''' (BOM) is a header that is added sometimes to some type of textual formats such as [[CSV]] to have applications recognize the right character encoding. It was designed to deal with the "big-endian vs. little-endian" problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position in the [[Unicode]] character set, where the corresponding character with the two bytes of the 16-bit code point are reversed is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be the opposite byte order than was previously assumed.
+
== Discussion ==
  
Some [[UTF8]] files (including [[CSV]] files) are written with a prepending BOM consisting of 3 bytes: <code>EF BB BF</code>.  
+
There are two main schools of thought as to its purpose:
 +
# Its purpose is to identify the [[endianness]] of a file whose [[Character Encodings|encoding]] is otherwise already known (particularly useful with [[UTF-16]]).
 +
# Its purpose is more general: to help computer programs guess the encoding of a file, even if they have no external information about what its encoding might be. Thus, the term "byte order mark" is something of a misnomer.
  
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2]
+
The idea of a BOM is undeniably a ''hack'', but its benefits sometimes outweigh its drawbacks.
  
== References ==
+
To make false positives less likely, the U+FFFE code point is permanently reserved, and will never be a meaningful code point.
* [http://en.wikipedia.org/wiki/Byte-order_mark Byte-order mark (Wikipedia)]
+
 
 +
Other usage of the U+FEFF character is deprecated, and U+2060 WORD JOINER is suggested instead.
 +
 
 +
== Byte patterns of common BOMs ==
 +
 
 +
A file beginning with bytes <code>0xFE 0xFF</code> is probably encoded in [[UTF-16]] with big-endian byte order.
 +
 
 +
<code>0xFF 0xFE</code> suggests [[UTF-16]] with little-endian byte order.
 +
 
 +
<code>0xEF 0xBB 0xBF</code> suggests [[UTF-8]].
 +
 
 +
<code>0x0E 0xFE 0xFF</code> suggests [[SCSU]].
 +
 
 +
== UTF-8 ==
 +
 
 +
Whether [[UTF-8]] files should ever use a BOM is a contentious issue. A good case can be made for either side of the argument. But note that if you need to read files written by third-party applications, that ship has sailed: existing UTF-8 files often do use a BOM.
 +
 
 +
== External links ==
 +
* [[Wikipedia:Byte order mark|Wikipedia article]]
 +
* [http://www.unicode.org/charts/PDF/UFE70.pdf Unicode code chart FE70–FEFF]
 +
* [http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf Unicode standard, Ch. 2]
  
 
[[Category:File format details]]
 
[[Category:File format details]]

Latest revision as of 01:11, 26 March 2025

File Format
Name Byte Order Mark
Ontology

A Byte Order Mark (BOM) is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a Unicode text file, or other block of Unicode text.

Contents

[edit] Discussion

There are two main schools of thought as to its purpose:

  1. Its purpose is to identify the endianness of a file whose encoding is otherwise already known (particularly useful with UTF-16).
  2. Its purpose is more general: to help computer programs guess the encoding of a file, even if they have no external information about what its encoding might be. Thus, the term "byte order mark" is something of a misnomer.

The idea of a BOM is undeniably a hack, but its benefits sometimes outweigh its drawbacks.

To make false positives less likely, the U+FFFE code point is permanently reserved, and will never be a meaningful code point.

Other usage of the U+FEFF character is deprecated, and U+2060 WORD JOINER is suggested instead.

[edit] Byte patterns of common BOMs

A file beginning with bytes 0xFE 0xFF is probably encoded in UTF-16 with big-endian byte order.

0xFF 0xFE suggests UTF-16 with little-endian byte order.

0xEF 0xBB 0xBF suggests UTF-8.

0x0E 0xFE 0xFF suggests SCSU.

[edit] UTF-8

Whether UTF-8 files should ever use a BOM is a contentious issue. A good case can be made for either side of the argument. But note that if you need to read files written by third-party applications, that ship has sailed: existing UTF-8 files often do use a BOM.

[edit] External links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox