Endianness

From Just Solve the File Format Problem
Jump to: navigation, search
File Format
Name Endianness
Ontology

Endianness is the order of bytes in multi-byte numeric quantities in binary formats. OK, that sounds pretty "techie"... let's step back a bit and explain the concept.

Traditionally, computer memory (and other forms of short or long term storage including disks and tapes) is divided into bytes (or "octets") consisting of eight bits. A bit is a binary digit (1 or 0), which is the smallest unit of computer data, but they are grouped in batches of 8 when it comes to computer data storage. However, computer processors have long been capable of dealing with even more than 8 bits at once (that's why they call them "32-bit processors", and so on). That's good, since if you want to do math on numbers higher than 255, you need more bits.

But when you have a number taking up more than one byte, such as a two-byte integer (which can get up to 65,535, well, unless it's a signed integer, in which case it can run from -32,768 through 32,767, but I digress), you have the problem of what order to put the two bytes. Do you put the higher-order part first (the bits representing the larger parts of the number) or the lower-order? That is what is called "big-endian" and "little-endian" respectively.

The way we write numbers on paper is "big-endian"; a number starts with the highest part. In "65535", the first digit is the 6, which stands for 60,000, and is the largest part of the number. The last digit, the final "5", only stands for 5 (with no multiple), and is the smallest part. If it had been written "53556", it would be "little-endian".

That is in the decimal system, but you get the same results in other bases; in hexadecimal, that number is "FFFF", which you can't tell if it's big or little endian because all the digits happen to be the same, but the first "F" stands for the higher value there.

To take a different hexadecimal number where you can tell which digit is which, "ABCD", the A stands for 10 * 16 * 16 * 16, while the D stands for 13; the A is the bigger part of the number. If it is stored in two consecutive bytes of computer memory, one of the bytes would have the hex value AB, and the other CD. If the storage on that particular computer is big-endian, the order would be AB CD, which corresponds to how the number is written as a four-digit hex number. However, if it is little-endian, the order is reversed, and it would be CD AB.

By that standard, the big-endian format seems to make more sense, but in fact the little-endian style is more commonly used; the Intel chips used on normal PCs use numbers this way, as did some of the other chips on early personal computers such as the Z-80 and 6502. The other way, big-endian, is used on some mainframes and minicomputers, as well as the 68000 and Power PC chips that used to be used in Macs prior to their switch to Intel chips. However, the big-endian format is standard for network protocols, where it is known as "network byte order".

Processor chips set the standard for arrangement of numbers by requiring them to be in a particular order when doing math on them. Then, developers creating data formats to be used with that chip tend to put numbers in the same order for storage, to make it easy to do math on them when the stored values are retrieved, so if they have to set a standard for a file format using binary-encoded numbers, they'll probably follow the manner their favorite chip does it. This then causes some problems when the data is interchanged with other systems of a different architecture. You end up with some file format standards using one method, others using the opposite, and a few having two variants, one with each endianness.

Of course, computers are much faster now than in the days when these file formats started being defined, so it's no longer that big a deal to flip the bytes to translate numbers from a format that is different from what the current computer understands. You just have to be sure you know which way the data is stored to begin with so you don't flip it the wrong way.

The Byte Order Mark is used in Unicode encodings to signal which order the bytes are stored in encodings such as UTF-16 where this matters.

Endianness can be the subject of tech "holy wars"; in fact, the terminology "big-endian" and "little-endian" come from a section of Gulliver's Travels recounting a fictional holy war over which end of an egg is proper to break first.

References

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox