GW-BASIC tokenized file

GW-BASIC tokenized files stored programs in the version of the BASIC programming language used on IBM PC compatibles in the days when interpreted BASIC was regularly included on personal computers as shipped from the factory. Originally the IBM PC had versions of BASIC called BASIC and BASICA, the latter being an "advanced" BASIC with a few more features. Part of it was in ROM, and part was loaded from disk. Other manufacturers' PC compatibles (or "clones") didn't have the ROM BASIC, but used a BASIC from Microsoft which was compatible to it, and went by a few manufacturer-specific names but was generically known as GW-BASIC (with varying claims existing about what the GW stands for, either the initials of a Microsoft employee (Greg Whitten) involved in adapting it from Bill Gates' original CP/M BASIC, or possibly for "Gee Whiz").

Like most BASICs of its era, BASIC/BASICA/GW-BASIC used a tokenized format to save its programs, rather than plain-text source code. Printable ASCII characters (space through tilde) generally stood for themselves (except when part of a multi-byte sequence), but other bytes had different meanings. The "high-bit" bytes from #128-#255 stood for the various BASIC commands (some as single bytes, others as part of two-byte sequences), while some of the control characters had special meanings including signifying the start of a binary-encoded sequence encapsulating a numeric constant. A null (#0) byte marked the end of a program line, and some header bytes were used at the start of the line to encode the line number and some byte offsets: specifically, two bytes containing the offset of the next line as a little-endian integer and two bytes containing the line number as a little-endian integer.

Files saved to disk are preceded by a single byte to indicate if the program was protected: 0FEh if protected, 0FFh if not. Files saved to cassette tape omit this byte, because byte 9 of the cassette header holds the protection status. The file was terminated with Ctrl-Z (1A hex).

Tokens
Blanks are unused, or at least unknown.

As noted, some of the tokens are preceded or followed by other bytes representing other symbols which are suppressed on listing the program (so they are "invisible"). These are presumably there to make parsing by the interpreter easier.

List 2: 2nd-byte tokens following FD
These are preceded by a FD (hex) byte.

List 3: 2nd-byte tokens following FE
These are preceded by a FE (hex) byte.

List 4: 2nd-byte tokens following FF
These are preceded by a FF (hex) byte.

Format documentation

 * GW-BASIC tokenised program format

Sample files

 * Many files on Between Heaven and Hell Version II CD, such as /200/104/*.bas

Software

 * Bascat: decode GW-BASIC tokenized files

Other links and references

 * Wikipedia article: GW-BASIC
 * Source code to GW-BASIC from 1983