Plain text

Plain text files (also known by the extension TXT) consist of characters encoded sequentially in some particular character encoding. Plain text files contain no formatting information other than white space characters. Some data formats (usually those intended to be human-readable) are based on plain text; see Text-based data for some structured formats that are stored in plain text (and hence can be opened in a plain text editor if no more specific program is available).

Traditionally, ASCII was used much of the time for maximum interoperability, though many platform-specific character sets were also in use. For non-English text an encoding supporting a broader character repertoire is needed, often UTF-8 nowadays. Note that if the file consists only of 7-bit ASCII characters, the bytes of the file are identical in us-ascii, ISO-8859-1, UTF-8, and a number of other encodings, so such a file can be identified as any of these depending on what is most convenient for a particular application. It is only when characters out of this repertoire are used that encoding-specific details need be considered. Some formats, such as HTML and XML, provide some sort of escape sequences (such as ampersands used for character references and entities) allowing special characters to be referenced within the document while leaving the document itself entirely ASCII.

Another point of contention or incompatibility in text-file formats is the conventions for line and paragraph breaks. Depending on what system the file was created on or intended to be viewed on, line breaks may be done as Carriage Return (ASCII 0D hex) and Linefeed (ASCII 0A hex) together (usually in that order, though in rare cases in the opposite order), or just one of those characters alone. Some text viewing or editing programs that are not cross-platform-friendly will really mess up badly in attempting to view/edit files using a different line break convention than the program expects, so you might see lines overwriting one another instead of going to the next line, or peculiar control characters show up within the file, or other strangeness. Files with linefeed alone are often referred to as "UNIX mode" (and the linefeed, in this context, referred to as NL for Newline), while files with carriage return alone are referred to as "Mac mode" (though it's also common in other early platforms such as the Apple II and Commodore 64, and no longer used in current Macs), while the CR+LF format is called "DOS" or "PC" or "Windows" mode (though it was used in various mainframes and network protocols as well).

Files may also use hard line breaks to keep line length within a fixed number of columns (usually 80, but other values such as 40 or 65 are used sometimes), or just have line breaks at the end of paragraphs and expect systems to word-wrap long lines; encountering files of a different convention than you expect may result in lines running way off to the right of the screen and requiring horizontal scrolling, or else short, choppy lines. Many text editors have a "paragraph reformat" command to bring paragraphs into compliance with your desired conventions.

Most operating systems include a simple text editor (e.g., Windows Notepad) which can open text files, but many other text editors exist (and computer people sometimes have "holy wars" over which one is best). Some of the common text editors are EMACS, vi, and UltraEdit. In the earlier days of computing, there was less distinction between text editors and word processors than there is now, as word processors generally used a format that was mostly plain text and could even be completely plain text if you refrained from using special embedded commands and features. However, modern word processors such as Microsoft Word default to using program-specific save formats that have little resemblance to plain text, unless you go out of your way to "Save As" .txt. A common "newbie error" is to attempt to create or edit plain text files in such a program, leaving the files as proprietarily-formatted in a way that messes up the operation of other programs that expect to find plain text.

Creating artwork using text characters is known as ASCII Art, or other variants such as ANSI Art if special control or escape codes are used in addition to the plain text characters.

Extension
The traditional extension for text files is, but lots of other extensions have been used. Occasionally on systems permitting extensions longer than three letters,  has been used, and   for ASCII has also had some use;   has also sometimes been used for files "documenting" something (like the manual accompanying a piece of downloaded software), but that went out of common use once that extension became associated with Microsoft Word's DOC format.

Identification
UTF-32 text files are usually detected by starting with the Byte Order Mark (BOM) consisting of the bytes  (for little endian  ) or   (for big endian  ). In some cases UTF-32 files may occur without the BOM, however, only —  and  —  are valid ranges for dwords;  —  and  —  are invalid.

UTF-16 text files are usually detected by starting with the byte order mark (BOM) consisting of the bytes  (for little endian  ) or FE FF (for big endian  ). However, in some cases UTF-16 files may occur without the BOM, in which case, detection is not guaranteed to be reliable, but the line feed in its byte reversal  is not in Unicode 15.0, and null bytes are unlikely to occur in other text encodings, so the presence of word-aligned   or   can rule out 8-bit encodings and one of the endianness and therefore may be used for UTF-16 detection. On the other hand, the bytes  in little endian form   which is not in Unicode 15.0 either but it is a common newline in 8-bit encodings. The detection of UCS-2 text works similarly, since UCS-2 is the precursor of UTF-16, as UTF-16 introduced surrogate pairs formed by —  followed by  —, with other combinations of  —  being invalid.

ASCII-only text files may be detected by verifying that the file has all —  bytes.

UTF-8 text files may be detected by presence of any bytes from —, absence of null bytes (if UTF-16 hasn't been ruled out yet), or verifying that the file is valid UTF-8. UTF-8 has many error cases; the only valid bit patterns are  (where x forms  — ),     (where x forms  —, but not  — ),       (where x forms  —   — , but not  —  or  — ), and         (where x forms  — , but not  —  or  — ). UTF-8 text files may also start with the UTF-8 byte order mark (EF BB BF).

When a file is known to be a plain text file but UTF-32, UTF-16, ASCII, and UTF-8 were already ruled out, only 8-bit encodings or mixed single byte/double byte encodings (such as Shift JIS) remain. In this case, the only thing left (other than applying complex heuristics) is to use the regional or system text encoding, such as CP1252, CP1250, CP437, CP852, etc..

Software

 * Textract: extract text from various document formats

Links and References

 * Text file (Wikipedia)
 * textfiles.com: a site full of old text files
 * Less: a Unix/Linux text file pager (for viewing files)
 * Scenario for discussion: Text files.
 * Always bet on text