Byte Order Mark

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Rewrote & expanded)
 
Line 1: Line 1:
 
{{FormatInfo
 
{{FormatInfo
 
|formattype=electronic
 
|formattype=electronic
|subcat=Character Encodings
+
|subcat=Character encoding
 
}}
 
}}
 +
A '''Byte Order Mark''' ('''BOM''') is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a [[Unicode]] text file, or other block of Unicode text.
  
==Introduction==
+
== Discussion ==
 
+
A '''Byte Order Mark''' ('''BOM''') is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a [[Unicode]] text file, or other block of Unicode text.
+
  
 
There are two main schools of thought as to its purpose:
 
There are two main schools of thought as to its purpose:

Latest revision as of 22:01, 4 March 2016

File Format
Name Byte Order Mark
Ontology

A Byte Order Mark (BOM) is a strategically-placed U+FEFF (ZERO WIDTH NO-BREAK SPACE) character at the beginning of a Unicode text file, or other block of Unicode text.

Contents

[edit] Discussion

There are two main schools of thought as to its purpose:

  1. Its purpose is to identify the endianness of a file whose encoding is otherwise already known (particularly useful with UTF-16).
  2. Its purpose is more general: to help computer programs guess the encoding of a file, even if they have no external information about what its encoding might be. Thus, the term "byte order mark" is something of a misnomer.

The idea of a BOM is undeniably a hack, but its benefits sometimes outweigh its drawbacks.

To make false positives less likely, the U+FFFE code point is permanently reserved, and will never be a meaningful code point.

Other usage of the U+FEFF character is deprecated, and U+2060 WORD JOINER is suggested instead.

[edit] Byte patterns of common BOMs

A file beginning with bytes 0xFE 0xFF is probably encoded in UTF-16 with big-endian byte order.

0xFF 0xFE suggests UTF-16 with little-endian byte order.

0xEF 0xBB 0xBF suggests UTF-8.

[edit] UTF-8

Whether UTF-8 files should ever use a BOM is a contentious issue. A good case can be made for either side of the argument. But note that if you need to read files written by third-party applications, that ship has sailed: existing UTF-8 files often do use a BOM.

[edit] External links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox