Plain text files (also known by the extension TXT) consist of characters encoded sequentially in some particular character encoding. Plain text files contain no formatting information other than white space characters. Some data formats (usually those intended to be human-readable) are based on plain text; see Text-based data for some structured formats that are stored in plain text (and hence can be opened in a plain text editor if no more specific program is available).
Traditionally, ASCII was used much of the time for maximum interoperability, though many platform-specific character sets were also in use. For non-English text an encoding supporting a broader character repertoire is needed, often UTF-8 nowadays. Note that if the file consists only of 7-bit ASCII characters, the bytes of the file are identical in us-ascii, ISO-8859-1, UTF-8, and a number of other encodings, so such a file can be identified as any of these depending on what is most convenient for a particular application. It is only when characters out of this repertoire are used that encoding-specific details need be considered. Some formats, such as HTML and XML, provide some sort of escape sequences (such as ampersands used for character references and entities) allowing special characters to be referenced within the document while leaving the document itself entirely ASCII.
Another point of contention or incompatibility in text-file formats is the conventions for line and paragraph breaks. Depending on what system the file was created on or intended to be viewed on, line breaks may be done as Carriage Return (ASCII 0D hex) and Linefeed (ASCII 0A hex) together (usually in that order, though in rare cases in the opposite order), or just one of those characters alone. Some text viewing or editing programs that are not cross-platform-friendly will really mess up badly in attempting to view/edit files using a different line break convention than the program expects, so you might see lines overwriting one another instead of going to the next line, or peculiar control characters show up within the file, or other strangeness. Files with linefeed alone are often referred to as "UNIX mode" (and the linefeed, in this context, referred to as NL for Newline), while files with carriage return alone are referred to as "Mac mode" (though it's also common in other early platforms such as the Apple II and Commodore 64, and no longer used in current Macs), while the CR+LF format is called "DOS" or "PC" or "Windows" mode (though it was used in various mainframes and network protocols as well).
Files may also use hard line breaks to keep line length within a fixed number of columns (usually 80, but other values such as 40 or 65 are used sometimes), or just have line breaks at the end of paragraphs and expect systems to word-wrap long lines; encountering files of a different convention than you expect may result in lines running way off to the right of the screen and requiring horizontal scrolling, or else short, choppy lines. Many text editors have a "paragraph reformat" command to bring paragraphs into compliance with your desired conventions.
Most operating systems include a simple text editor (e.g., Windows Notepad) which can open text files, but many other text editors exist (and computer people sometimes have "holy wars" over which one is best). Some of the common text editors are EMACS, vi, and UltraEdit. In the earlier days of computing, there was less distinction between text editors and word processors than there is now, as word processors generally used a format that was mostly plain text and could even be completely plain text if you refrained from using special embedded commands and features. However, modern word processors such as Microsoft Word default to using program-specific save formats that have little resemblance to plain text, unless you go out of your way to "Save As" .txt. A common "newbie error" is to attempt to create or edit plain text files in such a program, leaving the files as proprietarily-formatted in a way that messes up the operation of other programs that expect to find plain text.
The traditional extension for text files is
.txt, but lots of other extensions have been used. Occasionally on systems permitting extensions longer than three letters,
.text has been used, and
.asc for ASCII has also had some use;
.doc has also sometimes been used for files "documenting" something (like the manual accompanying a piece of downloaded software), but that went out of common use once that extension became associated with Microsoft Word's DOC format.
UTF-32 text files are arrays of 32-bit integers representing Unicode code points and are usually detected by starting with the Byte Order Mark (BOM) consisting of the bytes
FF FE 00 00 (for little endian
00 00 FE FF (for big endian
0x0000FEFF). In some cases UTF-32 files may occur without the BOM, however, only
0x0010FFFF are valid ranges for dwords;
0xFFFFFFFF are invalid.
UTF-16 text files are arrays of 16-bit integers representing code units and are usually detected by starting with the byte order mark (BOM) consisting of the bytes
FF FE (for little endian
FE FF (for big endian
0xFEFF). However, in some cases UTF-16 files may occur without the BOM, in which case, detection is not guaranteed to be reliable, but the line feed (
0x000A) in its byte reversal (
0x0A00) is not in Unicode 15.1, and null bytes are unlikely to occur in other text encodings, so the presence of word-aligned
00 0A or
0A 00 can rule out 8-bit encodings and one of the endianness and therefore may be used for UTF-16 detection. On the other hand, the bytes
0D 0A in little endian form
U+0A0D which is not in Unicode 15.1 either but it is a common newline in 8-bit encodings. The detection of UCS-2 text works similarly, since UCS-2 is the precursor of UTF-16, as UTF-16 introduced surrogate pairs formed by
0xDBFF followed by
0xDFFF, with other combinations of
0xDFFF being invalid.
ASCII-only text files may be detected by verifying that the file has all
0xFF are not used in ASCII encoding, and null characters by
0x00 are not typically found in plain text; null bytes are much more likely to be in UTF-16 or UTF-32 text.
UTF-8 text files may be detected by presence of any bytes from
0xFF (to avoid UTF-8 for ASCII-only files), absence of null bytes (if UTF-16 and UTF-32 haven't been ruled out yet), and verifying that the file is valid UTF-8. UTF-8 has many error cases; the only valid bit patterns are
0xxxxxxx (where x forms
10xxxxxx (where x forms
0x07FF, but not
10xxxxxx (where x forms
0xFFFF, but not
10xxxxxx (where x forms
0x10FFFF, but not
0x1FFFFF). UTF-8 text files may also start with the UTF-8 byte order mark (EF BB BF), but should still be verified for validity.
When a file is known to be a plain text file but UTF-32, UTF-16, ASCII, and UTF-8 were already ruled out, only 8-bit encodings or mixed single byte/double byte encodings (such as Shift JIS) remain. In this case, the only thing left (other than applying complex heuristics) is to use the regional or system text encoding, such as CP1252, CP1250, CP437, CP852, etc..