Mbox

mbox is the format typically used in Unix-like systems for storing collections of e-mail messages, dating back to the early days of Unix and UUCP network connections. It has several minor variants, but they all consist of a series of messages in Internet e-mail message format (RFC 822 and its successors) appended together, including headers and body. The entire file is supposed to be in 7-bit ASCII, with any characters outside that range encoded in some manner (in accordance with various other standards for character encoding in messages and their component parts and attachments). Traditional old-school Unix systems didn't generally use file extensions, so many mbox files are extensionless, but some other systems (e.g., Eudora) use a .mbx extension.

Some problems and incompatibilities of this format stem from the (rather shortsighted, in hindsight) design decision to have as the separator indicating the boundary between messages a "From" line inserted by the mailer program; messages are split based on the characters "From " (with a trailing space) appearing at the beginning of a line. This is distinct from the "From:" line in the headers of a message, which has a colon after it. The From line that is used as a separator follows that keyword with the originating mailbox name (originally a UUCP "bang path", with a series of nodes separated by exclamation points showing how the message got from its originating node to the place it currently was) then the date and possibly other information.

Because "From" is a common English word which often appears in the body of messages, sometimes at the beginning of lines, "escaping" has to be done by the mailer programs to alter such lines so they are not mistaken for message breakpoints. This is usually done by prefixing the line with a greater-than sign (>).

This is where the variant formats come in. The original escaping system merely added the prefix character when "From " appeared at the start of a line, so if it was already escaped as ">From ", no further escaping was done. This made the escaping non-reversible, since there was no way to distinguish the case where a "From " was escaped (and the character should be stripped on reading or export) from cases where a greater-than sign was already present (e.g., when the "From " is part of a quoted message in a reply, with all lines prefixed with angle brackets) and should be left alone.

Some "improved" mbox formats solve this by always adding a ">" sign to lines with "From " either at the start of a line or with one or more ">" signs between the start of the line and the "From ". Then, on reading, exporting, or otherwise handling the messages, one ">" sign is stripped from the beginning of a line that contains "From " after one or more ">" signs. Thus, the ">" sign count increases by one on encoding, and decreases by one on decoding, and everything works in a perfectly reversible way if all software cooperates and doesn't encode too many times without decoding (which would result in a ">"-sign pileup), or decode too many times (which would strip more ">" signs than necessary and leave a bare "From " in a message that didn't have one to begin with).

Often, mail software doesn't strip any of the ">" signs, so you wind up with those characters intruding into plain-text messages where the word "From" occurs; sometimes these malformattings have even made it into print in magazines, newspapers, and books which printed articles which were at one point e-mailed (and not adequately proofread). Web pages too (see this lyric page for an example).

The known variants (in terminology dating to the 1990s):


 * mboxo: The "original" mbox format, which escapes "From " lines in a non-reversible manner. (LOCFDD )


 * mboxrd: Version using "reversible" escaping where the ">" signs are piled on and are supposed to be stripped later; named after its inventor, Rahul Dhesi. Thunderbird uses this type of file. (LOCFDD )


 * mboxcl: From Unix System V; uses "Content-Length" headers in the individual messages to determine where to find the next message, so it doesn't need to scan for "From " lines. However, this format still escapes "From " lines in message bodies (in the non-reversible manner of mboxo) anyway, making the whole thing seem pretty pointless. (LOCFDD )


 * mboxcl2: A variant of mboxcl which uses "Content-Length" headers to find messages, and doesn't do any "From " escaping. This avoids corrupting message bodies, but could get into trouble if a message has a missing or incorrect content length header, or if the file is processed by a mailer or utility that expects to be able to split messages by "From " headers. (LOCFDD )

Sample files

 * https://telparia.com/fileFormatSamples/archive/mbox

Links

 * RFC 4155 (application for MIME type, defines format)
 * Wikipedia article
 * Qmail man page
 * Python library to handle mailbox formats
 * More discussion of mbox format versions
 * Info on fixing corrupt mbox files of the form used by Mozilla Thunderbird