Byte Order Mark
| Dan Tobias  (Talk | contribs) m (→Introduction) | Dan Tobias  (Talk | contribs)   (Add infobox) | ||
| Line 1: | Line 1: | ||
| + | {{FormatInfo | ||
| + | |formattype=electronic | ||
| + | |subcat=Character Encodings | ||
| + | }} | ||
| + | |||
| ==Introduction== | ==Introduction== | ||
Revision as of 13:38, 27 November 2012
Introduction
The Byte Order Mark (BOM) is a header that is added sometimes to some type of textual formats such as CSV to have applications recognize the right character encoding. It was designed to deal with the "big-endian vs. little-endian" problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position (U+FEFF) in the Unicode character set, where the corresponding character with the two bytes of the 16-bit code point reversed (U+FFFE) is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be in the opposite byte order than was previously assumed, and processing should restart at the beginning of the file with the proper order set. Usually it's only processed in this manner if it's the first character of the file, or else processing could get in an infinite loop if both versions appear somewhere in the file.
Some UTF-8 files (including CSV files) are written with a prepending BOM consisting of 3 bytes: EF BB BF. 
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [1]

