Byte Order Mark
Dan Tobias (Talk | contribs) |
Dan Tobias (Talk | contribs) |
||
Line 6: | Line 6: | ||
==Introduction== | ==Introduction== | ||
− | The '''Byte Order Mark''' (BOM) is a header that is added sometimes to some type of textual formats such as [[CSV]] to have applications recognize the right [[Character Encodings|character encoding]]. It was designed to deal with the [[Endianness|"big-endian vs. little-endian"]] problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position (U+FEFF) in the [[Unicode]] character set, where the corresponding character with the two bytes of the 16-bit code point reversed (U+FFFE) is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be in the opposite byte order than was previously assumed, and processing should restart at the beginning of the file with the proper order set. Usually it's only processed in this manner if it's the first character of the file, or else processing could get in an infinite loop if both versions appear somewhere in the file. | + | The '''Byte Order Mark''' (BOM) is a header that is added sometimes to some type of textual formats such as [[CSV]] to have applications recognize the right [[Character Encodings|character encoding]]. It was designed to deal with the [[Endianness|"big-endian vs. little-endian"]] problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position (U+FEFF) in the [[Unicode]] character set (officially designated "Zero-width non-break space", which is completely invisible when printed and has no effect on rendering, spacing, or breaking of adjacent characters), where the corresponding character with the two bytes of the 16-bit code point reversed (U+FFFE) is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be in the opposite byte order than was previously assumed, and processing should restart at the beginning of the file with the proper order set. Usually it's only processed in this manner if it's the first character of the file, or else processing could get in an infinite loop if both versions appear somewhere in the file. |
Hence, if you examine the raw bytes of a file you believe to be a 16-bit-encoded text file and its first two bytes are FE and FF in that order, this indicates that the order is big-endian, while if the first two bytes are FF then FE, it is little-endian. | Hence, if you examine the raw bytes of a file you believe to be a 16-bit-encoded text file and its first two bytes are FE and FF in that order, this indicates that the order is big-endian, while if the first two bytes are FF then FE, it is little-endian. |
Revision as of 14:49, 9 February 2013
Introduction
The Byte Order Mark (BOM) is a header that is added sometimes to some type of textual formats such as CSV to have applications recognize the right character encoding. It was designed to deal with the "big-endian vs. little-endian" problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position (U+FEFF) in the Unicode character set (officially designated "Zero-width non-break space", which is completely invisible when printed and has no effect on rendering, spacing, or breaking of adjacent characters), where the corresponding character with the two bytes of the 16-bit code point reversed (U+FFFE) is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be in the opposite byte order than was previously assumed, and processing should restart at the beginning of the file with the proper order set. Usually it's only processed in this manner if it's the first character of the file, or else processing could get in an infinite loop if both versions appear somewhere in the file.
Hence, if you examine the raw bytes of a file you believe to be a 16-bit-encoded text file and its first two bytes are FE and FF in that order, this indicates that the order is big-endian, while if the first two bytes are FF then FE, it is little-endian.
Some UTF-8 files (including CSV files) are written with a prepending BOM consisting of 3 bytes: EF BB BF
.
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [1]