Plain text

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Unicode as of 16.0 still does not include U+0A00 or U+0A0D, so the heuristic still works)
 
(4 intermediate revisions by 3 users not shown)
Line 25: Line 25:
 
== Identification ==
 
== Identification ==
  
UTF-32 text files are usually detected by starting with the byte order mark (BOM) consisting of the bytes FF FE 00 00 (for little endian 0x0000FEFF) or 00 00 FE FF (for big endian 0x0000FFFE). In some cases UTF-32 files may occur without the BOM, however, only 0x00000000—0x0000D7FF and 0x0000E000—0x0010FFFF are valid ranges for dwords; 0x0000D800—0x0000DFFF and 0x00110000—0xFFFFFFFF are invalid.
+
[[UTF-32]] text files are arrays of 32-bit integers representing Unicode code points and are usually detected by starting with the ''Byte Order Mark'' (BOM) consisting of the bytes <code>FF FE 00 00</code> (for little endian <code>0x0000FEFF</code>) or <code>00 00 FE FF</code> (for big endian <code>0x0000FEFF</code>). In some cases UTF-32 files may occur without the BOM, however, only <code>0x00000000</code>—<code>0x0000D7FF</code> and <code>0x0000E000</code>—<code>0x0010FFFF</code> are valid ranges for dwords; <code>0x0000D800</code>—<code>0x0000DFFF</code> and <code>0x00110000</code>—<code>0xFFFFFFFF</code> are invalid.
  
UTF-16 text files are usually detected by starting with the byte order mark (BOM) consisting of the bytes FF FE (for little endian 0xFEFF) or FE FF (for big endian 0xFFFE). However, in some cases UTF-16 files may occur without the BOM, in which case, detection is not guaranteed to be reliable, but the line feed (0x000A) in its byte reversal (0x0A00) is not in Unicode 15.0, and null bytes are unlikely to occur in other text encodings, so the presence of word-aligned 00 0A or 0A 00 can rule out 8-bit encodings and one of the endianness and therefore may be used for UTF-16 detection. On the other hand, the bytes 0D 0A in little endian form U+0A0D which is not in Unicode 15.0 either but it is a common newline in 8-bit encodings. The detection of UCS-2 text works similarly, since UCS-2 is the precursor of UTF-16, as UTF-16 introduced surrogate pairs formed by 0xD800—0xDBFF followed by 0xDC00—0xDFFF, with other combinations of 0xD800—0xDFFF being invalid.
+
[[UTF-16]] text files are arrays of 16-bit integers representing code units and are usually detected by starting with the byte order mark (BOM) consisting of the bytes <code>FF FE</code> (for little endian <code>0xFEFF</code>) or <code>FE FF</code> (for big endian <code>0xFEFF</code>). However, in some cases UTF-16 files may occur without the BOM, in which case, detection is not guaranteed to be reliable, but the line feed (<code>0x000A</code>) in its byte reversal (<code>0x0A00</code>) is not in ''Unicode 16.0'', and null bytes are unlikely to occur in other text encodings, so the presence of word-aligned <code>00 0A</code> or <code>0A 00</code> can rule out 8-bit encodings and one of the endianness and therefore may be used for UTF-16 detection. On the other hand, the bytes <code>0D 0A</code> in little endian form <code>U+0A0D</code> which is not in ''Unicode 16.0'' either but it is a common newline in 8-bit encodings. The detection of [[UCS-2]] text works similarly, since UCS-2 is the precursor of UTF-16, as UTF-16 introduced surrogate pairs formed by <code>0xD800</code>—<code>0xDBFF</code> followed by <code>0xDC00</code>—<code>0xDFFF</code>, with other combinations of <code>0xD800</code>—<code>0xDFFF</code> being invalid.
  
ASCII only text files may be detected by verifying that the file has all 0x01—0x7F bytes.
+
[[ASCII|ASCII-only]] text files may be detected by verifying that the file has all <code>0x01</code>—<code>0x7F</code> bytes. <code>0x80</code>—<code>0xFF</code> are not used in ASCII encoding, and null characters by <code>0x00</code> are not typically found in plain text; null bytes are much more likely to be in UTF-16 or UTF-32 text.
  
UTF-8 text files may be detected by presence of any bytes from 0x80—0xFF, absence of null bytes (if UTF-16 hasn't been ruled out yet), or verifying that the file is valid UTF-8. UTF-8 has many error cases; the only valid bit patterns are 0xxxxxxx (where x forms 0x00—0x7F), 110xxxxx 10xxxxxx (where x forms 0x0080—0x07FF, but not 0x00—0x7F), 1110xxxx 10xxxxxx 10xxxxxx (where x forms 0x0800—0xD7FF 0xE000—0xFFFF, but not 0x0000—0x07FF or 0xD800—0xDFFF), and 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (where x forms 0x10000—0x10FFFF, but not 0x0000—0xFFFF or 0x110000—0x1FFFFF). UTF-8 text files may also start with the UTF-8 byte order mark (EF BB BF).
+
[[UTF-8]] text files may be detected by presence of any bytes from <code>0x80</code>—<code>0xFF</code> (to avoid processing ASCII-only files as UTF-8), absence of null bytes (if UTF-16 and UTF-32 haven't been ruled out yet), and verifying that the file is valid UTF-8. UTF-8 has many error cases; the only valid bit patterns are <code>0xxxxxxx</code> (where x forms <code>0x00</code>—<code>0x7F</code>), <code>110xxxxx</code> <code>10xxxxxx</code> (where x forms <code>0x0080</code>—<code>0x07FF</code>, but not <code>0x00</code>—<code>0x7F</code>), <code>1110xxxx</code> <code>10xxxxxx</code> <code>10xxxxxx</code> (where x forms <code>0x0800</code>—<code>0xD7FF</code> <code>0xE000</code>—<code>0xFFFF</code>, but not <code>0x0000</code>—<code>0x07FF</code> or <code>0xD800</code>—<code>0xDFFF</code>), and <code>11110xxx</code> <code>10xxxxxx</code> <code>10xxxxxx</code> <code>10xxxxxx</code> (where x forms <code>0x10000</code>—<code>0x10FFFF</code>, but not <code>0x0000</code>—<code>0xFFFF</code> or <code>0x110000</code>—<code>0x1FFFFF</code>). UTF-8 text files may also start with the UTF-8 byte order mark (EF BB BF), but should still be verified for validity.
  
When a file is known to be a plain text file but UTF-32, UTF-16, ASCII, and UTF-8 were already ruled out, only 8-bit encodings or mixed single byte/double byte encodings (such as Shift JIS) remain. In this case, the only thing left (other than applying complex heuristics) is to use the regional or system text encoding, such as CP1252, CP1250, CP437, CP852, etc..
+
When a file is known to be a plain text file but [[UTF-32]], [[UTF-16]], [[ASCII]], and [[UTF-8]] were already ruled out, only 8-bit encodings or mixed single byte/double byte encodings (such as [[JIS|Shift JIS]]) remain. In this case, the only thing left (other than applying complex heuristics) is to use the regional or system text encoding, such as [[Windows 1252|CP1252]], [[Windows 1250|CP1250]], [[CP437]], [[CP852]], etc..
  
 
== See also ==
 
== See also ==
Line 40: Line 40:
 
== Software ==
 
== Software ==
 
* [http://textract.readthedocs.org/en/latest/ Textract: extract text from various document formats]
 
* [http://textract.readthedocs.org/en/latest/ Textract: extract text from various document formats]
 +
 +
== Sample files ==
 +
* {{DexvertSamples|text/txt}}
 +
* {{DexvertSamples|text/utf16Text}}
  
 
== Links and References ==
 
== Links and References ==

Latest revision as of 21:22, 10 September 2024

File Format
Name Plain text
Ontology
Extension(s) .txt, .text, .doc, .asc, (none), many others
MIME Type(s) text/plain
PRONOM x-fmt/111
Wikidata ID Q1145976

Plain text files (also known by the extension TXT) consist of characters encoded sequentially in some particular character encoding. Plain text files contain no formatting information other than white space characters. Some data formats (usually those intended to be human-readable) are based on plain text; see Text-based data for some structured formats that are stored in plain text (and hence can be opened in a plain text editor if no more specific program is available).

Traditionally, ASCII was used much of the time for maximum interoperability, though many platform-specific character sets were also in use. For non-English text an encoding supporting a broader character repertoire is needed, often UTF-8 nowadays. Note that if the file consists only of 7-bit ASCII characters, the bytes of the file are identical in us-ascii, ISO-8859-1, UTF-8, and a number of other encodings, so such a file can be identified as any of these depending on what is most convenient for a particular application. It is only when characters out of this repertoire are used that encoding-specific details need be considered. Some formats, such as HTML and XML, provide some sort of escape sequences (such as ampersands used for character references and entities) allowing special characters to be referenced within the document while leaving the document itself entirely ASCII.

Another point of contention or incompatibility in text-file formats is the conventions for line and paragraph breaks. Depending on what system the file was created on or intended to be viewed on, line breaks may be done as Carriage Return (ASCII 0D hex) and Linefeed (ASCII 0A hex) together (usually in that order, though in rare cases in the opposite order), or just one of those characters alone. Some text viewing or editing programs that are not cross-platform-friendly will really mess up badly in attempting to view/edit files using a different line break convention than the program expects, so you might see lines overwriting one another instead of going to the next line, or peculiar control characters show up within the file, or other strangeness. Files with linefeed alone are often referred to as "UNIX mode" (and the linefeed, in this context, referred to as NL for Newline), while files with carriage return alone are referred to as "Mac mode" (though it's also common in other early platforms such as the Apple II and Commodore 64, and no longer used in current Macs), while the CR+LF format is called "DOS" or "PC" or "Windows" mode (though it was used in various mainframes and network protocols as well).

Files may also use hard line breaks to keep line length within a fixed number of columns (usually 80, but other values such as 40 or 65 are used sometimes), or just have line breaks at the end of paragraphs and expect systems to word-wrap long lines; encountering files of a different convention than you expect may result in lines running way off to the right of the screen and requiring horizontal scrolling, or else short, choppy lines. Many text editors have a "paragraph reformat" command to bring paragraphs into compliance with your desired conventions.

Most operating systems include a simple text editor (e.g., Windows Notepad) which can open text files, but many other text editors exist (and computer people sometimes have "holy wars" over which one is best). Some of the common text editors are EMACS, vi, and UltraEdit. In the earlier days of computing, there was less distinction between text editors and word processors than there is now, as word processors generally used a format that was mostly plain text and could even be completely plain text if you refrained from using special embedded commands and features. However, modern word processors such as Microsoft Word default to using program-specific save formats that have little resemblance to plain text, unless you go out of your way to "Save As" .txt. A common "newbie error" is to attempt to create or edit plain text files in such a program, leaving the files as proprietarily-formatted in a way that messes up the operation of other programs that expect to find plain text.

Creating artwork using text characters is known as ASCII Art, or other variants such as ANSI Art if special control or escape codes are used in addition to the plain text characters.

Contents

[edit] Extension

The traditional extension for text files is .txt, but lots of other extensions have been used. Occasionally on systems permitting extensions longer than three letters, .text has been used, and .asc for ASCII has also had some use; .doc has also sometimes been used for files "documenting" something (like the manual accompanying a piece of downloaded software), but that went out of common use once that extension became associated with Microsoft Word's DOC format.

[edit] Identification

UTF-32 text files are arrays of 32-bit integers representing Unicode code points and are usually detected by starting with the Byte Order Mark (BOM) consisting of the bytes FF FE 00 00 (for little endian 0x0000FEFF) or 00 00 FE FF (for big endian 0x0000FEFF). In some cases UTF-32 files may occur without the BOM, however, only 0x000000000x0000D7FF and 0x0000E0000x0010FFFF are valid ranges for dwords; 0x0000D8000x0000DFFF and 0x001100000xFFFFFFFF are invalid.

UTF-16 text files are arrays of 16-bit integers representing code units and are usually detected by starting with the byte order mark (BOM) consisting of the bytes FF FE (for little endian 0xFEFF) or FE FF (for big endian 0xFEFF). However, in some cases UTF-16 files may occur without the BOM, in which case, detection is not guaranteed to be reliable, but the line feed (0x000A) in its byte reversal (0x0A00) is not in Unicode 16.0, and null bytes are unlikely to occur in other text encodings, so the presence of word-aligned 00 0A or 0A 00 can rule out 8-bit encodings and one of the endianness and therefore may be used for UTF-16 detection. On the other hand, the bytes 0D 0A in little endian form U+0A0D which is not in Unicode 16.0 either but it is a common newline in 8-bit encodings. The detection of UCS-2 text works similarly, since UCS-2 is the precursor of UTF-16, as UTF-16 introduced surrogate pairs formed by 0xD8000xDBFF followed by 0xDC000xDFFF, with other combinations of 0xD8000xDFFF being invalid.

ASCII-only text files may be detected by verifying that the file has all 0x010x7F bytes. 0x800xFF are not used in ASCII encoding, and null characters by 0x00 are not typically found in plain text; null bytes are much more likely to be in UTF-16 or UTF-32 text.

UTF-8 text files may be detected by presence of any bytes from 0x800xFF (to avoid processing ASCII-only files as UTF-8), absence of null bytes (if UTF-16 and UTF-32 haven't been ruled out yet), and verifying that the file is valid UTF-8. UTF-8 has many error cases; the only valid bit patterns are 0xxxxxxx (where x forms 0x000x7F), 110xxxxx 10xxxxxx (where x forms 0x00800x07FF, but not 0x000x7F), 1110xxxx 10xxxxxx 10xxxxxx (where x forms 0x08000xD7FF 0xE0000xFFFF, but not 0x00000x07FF or 0xD8000xDFFF), and 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (where x forms 0x100000x10FFFF, but not 0x00000xFFFF or 0x1100000x1FFFFF). UTF-8 text files may also start with the UTF-8 byte order mark (EF BB BF), but should still be verified for validity.

When a file is known to be a plain text file but UTF-32, UTF-16, ASCII, and UTF-8 were already ruled out, only 8-bit encodings or mixed single byte/double byte encodings (such as Shift JIS) remain. In this case, the only thing left (other than applying complex heuristics) is to use the regional or system text encoding, such as CP1252, CP1250, CP437, CP852, etc..

[edit] See also

[edit] Software

[edit] Sample files

[edit] Links and References

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox