Sol BASIC tokenized file

Sol was a line of computers in the late 1970s, the most popular of which was the SOL-20. It was one of the S-100 bus computers of that era which, if you added a disk drive, could run the CP/M operating system, but was often used with cassette data storage instead. It had a version of the BASIC programming language (not in ROM; you had to load it). When you saved a BASIC program to tape or disk, you could add a parameter to the SAVE command to make it save the program as plain text, which was more suitable for transfer to other systems. However, the default save mode was the more compact (but less transferable) tokenized form. On cassette, the low-level format was Kansas City standard (Or maybe CUTS?).

Documenting the format
No documentation of the specific tokenized format appeared to be readily accessible (but see below), but it is possible to piece it together with the help of the Solace emulator (linked below). It does a great job of imitating a SOL-20 computer in MS-Windows, even down to saving a BASIC program into a file which imitates the form in which it would have been written to cassette on a real SOL-20. Then with a bit of "geek detective" skills, one can piece together how the data is structured. (First you have to figure out how to do anything in the emulator in the first place... it meticulously imitates everything on the SOL, including the things that are a pain in the butt like the need to enter all commands in uppercase and the need to load BASIC first before using BASIC programs.)

If you enter this program:

10 PRINT "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abc" 20 FOR I=1 TO 10 30 PRINT I,I*4 40 NEXT I 50 PRINT "Done." 60 END

then save it to a "virtual cassette" (and use the "File" menu of the "virtual cassette player" window to save that to a real disk file on your computer; it will have a .SVT extension), you get this (in a format specific to the emulator, but to some extent a representation of what would be written to a cassette on a real SOL-20; however, it can't entirely be relied upon in this regard since some of its content is emulator-specific):

C 29 H PROG C2 005B 1AD9 0000 D 2E0A0089224142434445464748494A4B D 4C4D4E4F505152535455565758595A30 D 313233343536373839616263220D0B14 D 008849F5319E31300D0A1E0089492C49 D E2340D06280081490D0C32008922446F D 6E652E220D053C008D0D0120 C 10

The lines starting with "C" and "H" appear to be part of the filesystem (which include the name the program was saved as, "PROG"), so any documentation on their format belongs in the Filesystem section (or would if they were the true tape format of a SOL-20 rather than just an emulated version that might differ). The "D" lines encode the program file data itself, as a series of hexadecimal digits. Take them in pairs to get the successive bytes of the tokenized BASIC file.

It appears to be a series of program lines, separated by the carriage return character (hex 0D). The first byte of each line (each program line, that is; ignore the physical lines in the representation of the file and divide lines only at 0D bytes) represents the number of bytes the line takes up; you can quickly skip to the next line by going forward that number of bytes. The next two bytes are the line number, represented as a 2-byte unsigned integer (little-endian). Then follows the tokenized program code itself. ASCII printable characters represent themselves (in literal strings, variable names, and numeric constants, the latter of which are simply stored as the series of ASCII digits instead of being encoded as integer or floating-point numbers as some other BASICs do). Some symbols like the quote marks are also represented as their plain ASCII values, though others (such as the equal sign) have different token representations (apparently to signal that they are operators or functions with special meaning). The keywords of BASIC each have a byte (in the high-bit-set range from 80 to FF hex) representing them; for instance, 89 hex is "PRINT". All spaces other than ones within quoted strings are stripped, as they are unnecessary to the syntax. They are added back on listing the program.

In this manner, it should be possible to build a list of all the tokens by writing a program that uses all of them and seeing how it ends up when saved.

But even better...
But it's not necessary to go to all this work to find out the tokens, since the person who created the emulator (Jim Battle) has done the detective work already. You can find the token list for BASIC-80 (the standard Sol BASIC) within the code of a Perl script here (you need to unZIP it), and a similar list for the different tokens of the different "Extended BASIC" also available for the SOL-20 here.

Note that BASIC-80 and Extended BASIC (the two major BASICs used on Sol computers) use entirely different token lists.

BASIC-80 tokens
Blank values indicate either that the token is unused or is used for something unknown.

Extended BASIC tokens
Blank values indicate either that the token is unused or is used for something unknown.

Software

 * Solace: Sol emulator for Windows
 * JSMESS in-browser emulator of Sol-20
 * Some utilities

Documentation

 * Sol manuals, including several about BASIC