Atari BASIC tokenized file

Atari BASIC was used on the Atari 400 and 800 computers, among the many systems competing for the home computer market in the late '70s and early '80s. While Atari considered using an adapted Microsoft BASIC like some other manufacturers, they ultimately used an independently-developed BASIC instead, meaning that many characteristics of this BASIC (including its manner of tokenization) differ greatly from the other BASICs of the time.

To understand the differences in the file format, a quick review of other BASICs will be helpful. Some BASICs, notably TinyBASIC, performed very little conversion of the program code typed in by the user. All it did was read the line number, if present, and converted that to an 8-bit integer value. This made it easier to search for in GOTOs and such. The rest of the line was stored exactly as it had been typed, in ASCII format with a trailing CR. Microsoft-like BASICs went slightly further, first converting the line number to a 16-bit value, and then converting the first word on the line to an 8-bit value, the token. This makes it easier to look up the runtime code associated with the instruction, it doesn't have to convert from text form. The rest of the line, as in Tiny, was left as the original text.

In contrast, Atari BASIC tokenized every item in the line to an internal format, thereby eliminating any runtime parsing. For instance, any numbers in the code were converted to their five-byte floating-point format and put into memory in that form, with a lead token to indicate it's a numeric constant, eliminating the need to do any conversion at runtime. So while one can read the tokenized form of Tiny or MS with some ability to understand what is going on, Atari BASIC files are completely binary. The only thing that could be read as-is were string literals. String constants were marked by the byte 0F (hex), followed by a byte giving the string length (0-255), then the characters of the string itself. Numeric constants were marked by 0E (hex), followed by six bytes holding a floating-point value.

Additionally, Atari BASIC split its tokens into two groups; statements were the first item on any line (or sub-statement in the case of colons) and had their own library of 256 tokens in the Statement Name Tokens, while other items on the line were taken from another 256-entry table for functions, operators, or variables. The variables in the program were itemized in a variable table stored at the beginning of the program, so that references to a variable in the program used only the single-byte token, representing the name. There were 128 positions in the token list for variables (comprising the high-bit values), meaning that only 128 different variables could be used in a program.

Literal characters are in ATASCII, Atari's not-quite-ASCII character set.

Software

 * Atari Memopad: detokenizes BASIC programs among other functions
 * gfalist: De-tokenizes GFA-BASIC programs from Atari

Sample files

 * https://telparia.com/fileFormatSamples/text/gfaBASICAtari/