Sol BASIC tokenized file

Sol was a line of computers in the late 1970s, the most popular of which was the SOL-20. It was one of the S-100 bus computers of that era which, if you added a disk drive, could run the CP/M operating system, but was often used with cassette data storage instead. It had a version of the BASIC programming language (not in ROM; you had to load it). When you saved a BASIC program to tape or disk, you could add a parameter to the SAVE command to make it save the program as plain text, which was more suitable for transfer to other systems. However, the default save mode was the more compact (but less transferable) tokenized form.

No documentation of the specific tokenized format appears to be readily accessible, but it is possible to piece it together with the help of the Solace emulator (linked below). It does a great job of imitating a SOL-20 computer in MS-Windows, even down to saving a BASIC program into a file which imitates the form in which it would have been written to cassette on a real SOL-20. Then with a bit of "geek detective" skills, one can piece together how the data is structured. (First you have to figure out how to do anything in the emulator in the first place... it meticulously imitates everything on the SOL, including the things that are a pain in the butt like the need to enter all commands in uppercase and the need to load BASIC first before using BASIC programs.)

If you enter this program:

10 PRINT "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789abc" 20 FOR I=1 TO 10 30 PRINT I,I*4 40 NEXT I 50 PRINT "Done." 60 END

then save it to a "virtual cassette" (and use the "File" menu of the "virtual cassette player" window to save that to a real disk file on your computer; it will have a .SVT extension), you get this (in a format specific to the emulator, but a representation of what would be written to a cassette on a real SOL-20):

C 29 H PROG C2 005B 1AD9 0000 D 2E0A0089224142434445464748494A4B D 4C4D4E4F505152535455565758595A30 D 313233343536373839616263220D0B14 D 008849F5319E31300D0A1E0089492C49 D E2340D06280081490D0C32008922446F D 6E652E220D053C008D0D0120

C 10

The lines starting with "C" and "H" appear to be part of the filesystem (which include the name the program was saved as, "PROG"), so any documentation on their format belongs in the Filesystem section. The "D" lines encode the program file data itself, as a series of hexadecimal digits. Take them in pairs to get the successive bytes of the tokenized BASIC file.

It appears to be a series of program lines, separated by the carriage return character (hex 0D). The first byte of each line represents the number of bytes the line takes up; you can quickly skip to the next line by going forward that number of bytes. The next two bytes are the line number, represented as a 2-byte unsigned integer (little-endian). Then follows the tokenized program code itself. ASCII printable characters represent themselves (in literal strings, variable names, and numeric constants, the latter of which are simply stored as the series of ASCII digits instead of being encoded as integer or floating-point numbers as some other BASICs do). Some symbols like the quote marks are also represented as their plain ASCII values, though others (such as the equal sign) have different token representations (apparently to signal that they are operators or functions with special meaning). The keywords of BASIC each have a byte (in the high-bit-set range from 80 to FF hex) representing them; for instance, 89 hex is "PRINT". All spaces other than ones within quoted strings are stripped, as they are unnecessary to the syntax. They are added back on listing the program.

In this manner, it should be possible to build a list of all the tokens by writing a program that uses all of them and seeing how it ends up when saved.

Software

 * Solace: Sol emulator for Windows

Documentation

 * Sol manuals, including several about BASIC