Byte Order Mark

File Format
Name	Byte Order Mark
Ontology	Electronic File Formats Character Encodings Byte Order Mark ; ; ;

Revision as of 14:49, 9 February 2013

Introduction

The Byte Order Mark (BOM) is a header that is added sometimes to some type of textual formats such as CSV to have applications recognize the right character encoding. It was designed to deal with the "big-endian vs. little-endian" problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position (U+FEFF) in the Unicode character set (officially designated "Zero-width non-break space", which is completely invisible when printed and has no effect on rendering, spacing, or breaking of adjacent characters), where the corresponding character with the two bytes of the 16-bit code point reversed (U+FFFE) is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be in the opposite byte order than was previously assumed, and processing should restart at the beginning of the file with the proper order set. Usually it's only processed in this manner if it's the first character of the file, or else processing could get in an infinite loop if both versions appear somewhere in the file.

Hence, if you examine the raw bytes of a file you believe to be a 16-bit-encoded text file and its first two bytes are FE and FF in that order, this indicates that the order is big-endian, while if the first two bytes are FF then FE, it is little-endian.

Some UTF-8 files (including CSV files) are written with a prepending BOM consisting of 3 bytes: EF BB BF.

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [1]

References

Byte-order mark (Wikipedia)

@@ Line 6: / Line 6: @@
 ==Introduction==
-The '''Byte Order Mark''' (BOM) is a header that is added sometimes to some type of textual formats such as [[CSV]] to have applications recognize the right [[Character Encodings|character encoding]]. It was designed to deal with the [[Endianness|"big-endian vs. little-endian"]] problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position (U+FEFF) in the [[Unicode]] character set, where the corresponding character with the two bytes of the 16-bit code point reversed (U+FFFE) is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be in the opposite byte order than was previously assumed, and processing should restart at the beginning of the file with the proper order set. Usually it's only processed in this manner if it's the first character of the file, or else processing could get in an infinite loop if both versions appear somewhere in the file.
+The '''Byte Order Mark''' (BOM) is a header that is added sometimes to some type of textual formats such as [[CSV]] to have applications recognize the right [[Character Encodings|character encoding]]. It was designed to deal with the [[Endianness|"big-endian vs. little-endian"]] problem of expressing multi-byte numeric data, where some systems put the highest-order byte first and others put it last. This affects 16-bit character encodings. The BOM has been allocated a character position (U+FEFF) in the [[Unicode]] character set (officially designated "Zero-width non-break space", which is completely invisible when printed and has no effect on rendering, spacing, or breaking of adjacent characters), where the corresponding character with the two bytes of the 16-bit code point reversed (U+FFFE) is reserved and guaranteed against having a different meaning allocated by Unicode. This means that if the reversed version is encountered, the file is known to be in the opposite byte order than was previously assumed, and processing should restart at the beginning of the file with the proper order set. Usually it's only processed in this manner if it's the first character of the file, or else processing could get in an infinite loop if both versions appear somewhere in the file.
 Hence, if you examine the raw bytes of a file you believe to be a 16-bit-encoded text file and its first two bytes are FE and FF in that order, this indicates that the order is big-endian, while if the first two bytes are FF then FE, it is little-endian.

Byte Order Mark

Revision as of 14:49, 9 February 2013

Introduction

References

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox