Byte Order Mark
| Dan Tobias  (Talk | contribs) |  (→Byte patterns of common BOMs) | ||
| (2 intermediate revisions by one user not shown) | |||
| Line 1: | Line 1: | ||
| {{FormatInfo | {{FormatInfo | ||
| |formattype=electronic | |formattype=electronic | ||
| − | |subcat=Character  | + | |subcat=Character encoding | 
| }} | }} | ||
| + | A '''Byte Order Mark''' ('''BOM''') is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a [[Unicode]] text file, or other block of Unicode text. | ||
| − | == | + | == Discussion == | 
| − | + | There are two main schools of thought as to its purpose: | |
| + | # Its purpose is to identify the [[endianness]] of a file whose [[Character Encodings|encoding]] is otherwise already known (particularly useful with [[UTF-16]]). | ||
| + | # Its purpose is more general: to help computer programs guess the encoding of a file, even if they have no external information about what its encoding might be. Thus, the term "byte order mark" is something of a misnomer. | ||
| − | + | The idea of a BOM is undeniably a ''hack'', but its benefits sometimes outweigh its drawbacks. | |
| − | + | To make false positives less likely, the U+FFFE code point is permanently reserved, and will never be a meaningful code point. | |
| − | + | Other usage of the U+FEFF character is deprecated, and U+2060 WORD JOINER is suggested instead. | |
| − | ==  | + | == Byte patterns of common BOMs == | 
| − | * [http:// | + | |
| + | A file beginning with bytes <code>0xFE 0xFF</code> is probably encoded in [[UTF-16]] with big-endian byte order. | ||
| + | |||
| + | <code>0xFF 0xFE</code> suggests [[UTF-16]] with little-endian byte order. | ||
| + | |||
| + | <code>0xEF 0xBB 0xBF</code> suggests [[UTF-8]]. | ||
| + | |||
| + | <code>0x0E 0xFE 0xFF</code> suggests [[SCSU]]. | ||
| + | |||
| + | == UTF-8 == | ||
| + | |||
| + | Whether [[UTF-8]] files should ever use a BOM is a contentious issue. A good case can be made for either side of the argument. But note that if you need to read files written by third-party applications, that ship has sailed: existing UTF-8 files often do use a BOM. | ||
| + | |||
| + | == External links == | ||
| + | * [[Wikipedia:Byte order mark|Wikipedia article]] | ||
| + | * [http://www.unicode.org/charts/PDF/UFE70.pdf Unicode code chart FE70–FEFF] | ||
| + | * [http://www.unicode.org/versions/Unicode6.2.0/ch02.pdf Unicode standard, Ch. 2] | ||
| [[Category:File format details]] | [[Category:File format details]] | ||
Latest revision as of 01:11, 26 March 2025
A Byte Order Mark (BOM) is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a Unicode text file, or other block of Unicode text.
| Contents | 
[edit] Discussion
There are two main schools of thought as to its purpose:
- Its purpose is to identify the endianness of a file whose encoding is otherwise already known (particularly useful with UTF-16).
- Its purpose is more general: to help computer programs guess the encoding of a file, even if they have no external information about what its encoding might be. Thus, the term "byte order mark" is something of a misnomer.
The idea of a BOM is undeniably a hack, but its benefits sometimes outweigh its drawbacks.
To make false positives less likely, the U+FFFE code point is permanently reserved, and will never be a meaningful code point.
Other usage of the U+FEFF character is deprecated, and U+2060 WORD JOINER is suggested instead.
[edit] Byte patterns of common BOMs
A file beginning with bytes 0xFE 0xFF is probably encoded in UTF-16 with big-endian byte order.
0xFF 0xFE suggests UTF-16 with little-endian byte order.
0xEF 0xBB 0xBF suggests UTF-8.
0x0E 0xFE 0xFF suggests SCSU.
[edit] UTF-8
Whether UTF-8 files should ever use a BOM is a contentious issue. A good case can be made for either side of the argument. But note that if you need to read files written by third-party applications, that ship has sailed: existing UTF-8 files often do use a BOM.

