Softdisk Text Compressor

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
Line 15: Line 15:
 
Next was the original filename of the uncompressed file, as a variable-length null-terminated string.
 
Next was the original filename of the uncompressed file, as a variable-length null-terminated string.
  
Then followed a table of fixed character sequences which were to be substituted for particular characters in the file. First there were 29 sequences of up to 5 characters, which were the uncompressed strings corresponding to the file input characters #0 - #9, #11 - #12, and #13 - #31 (decimal), in other words the [[C0 controls]] except for carriage return and linefeed. Each such sequence terminated at 5 characters or when a null (#0) was encountered, whichever came first.
+
Then followed a table of fixed character sequences which were to be substituted for particular characters in the file. First there were 29 sequences of up to 5 characters, which were the uncompressed strings corresponding to the file input characters #0 - #9, #11 - #12, and #14 - #31 (decimal), in other words the [[C0 controls]] except for carriage return and linefeed. Each such sequence terminated at 5 characters or when a null (#0) was encountered, whichever came first.
  
 
Next was a table of two-character sequences, 127 of them corresponding to the file input characters #128-#254.
 
Next was a table of two-character sequences, 127 of them corresponding to the file input characters #128-#254.

Revision as of 22:32, 9 December 2012

File Format
Name Softdisk Text Compressor
Ontology
Extension(s) .ctx

The Softdisk Text Compresssor was yet another format used by Softdisk Publishing on its diskmagazines in the 1990s. It was not released to the public, but was used as an internal utility to prepare text files for publication, with the decompressor embedded in the "shell" program used to display the articles when the diskmagazine issue was read. It was developed in 1993 and used on some issues of Softdisk PC. It was a simple compression routine designed to squeeze a few bytes out of typical English-language ASCII files in a time when saving a handful of bytes was still important; the diskmagazines were published on floppy disks and expected to be runnable directly from the disk without installing on a hard drive, so getting those couple of bytes out of the file so it went one 1K disk block smaller was critical to resolve a "disk full" error during the deadline crunch to get out an issue. (You kids have it so easy with your terabyte hard drives...)

The basic technique was to represent a bunch of specific character strings (which was taken from a fixed hand-generated list of common character sequences in text files used by Softdisk) as single bytes in the compressed file, with the bytes corresponding to ASCII control characters (other than CR and LF) and 8-bit characters (#128-#254, with #255 reserved as an escape character) used for this purpose. Character #255 signaled either an escaped special character to be treated literally, or a signal to repeat a character, as shown below.

The format was as follows:

The first six bytes were flag bytes to indicate the file was of this format; they were expected to be (hex) 03 43 54 30 31 31, which was Control-C followed by CT001.

Next was the original filename of the uncompressed file, as a variable-length null-terminated string.

Then followed a table of fixed character sequences which were to be substituted for particular characters in the file. First there were 29 sequences of up to 5 characters, which were the uncompressed strings corresponding to the file input characters #0 - #9, #11 - #12, and #14 - #31 (decimal), in other words the C0 controls except for carriage return and linefeed. Each such sequence terminated at 5 characters or when a null (#0) was encountered, whichever came first.

Next was a table of two-character sequences, 127 of them corresponding to the file input characters #128-#254.

After this came the compressed data itself. The characters (bytes) were to be read one at a time, and treated as follows:

If it is a CR (#13), output a line break. (Linefeeds were stripped and ignored to save that precious one byte per line.)

If it is #0-#9, #11-#12, or #13-31 (other control characters), replace it with the corresponding string sequence in the up-to-five-character table that was read in earlier.

If it is #128-#254, replace it with the corresponding string sequence in the two-character table that was read in earlier.

If it is #255, read the next character (byte). If it is an ASCII printable character (#32-#127), subtract 30 from the byte value and consider this quantity n. Then read one more character c, and output that character n times. (This is handy for encoding sequences of repeated characters such as dashed lines.)

If the character after the #255 character is anything else, output it as a literal. This allows control characters and high-bit characters to be included in the file, though they become two-character sequences in the "compressed" data which could make the file actually get bigger on compression if there are many characters of this sort.

This compression routine can be used on any files in an 8-bit character encoding, but works best on ones limited to 7-bit ASCII.

See also

Sample files

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox