BWTC32Key

From Just Solve the File Format Problem
Revision as of 23:14, 28 November 2023 by St. GIGA (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
File Format
Name BWTC32Key
Ontology
Extension(s) .b3k
Compression Always Lossless
Extended From bzip
Magic Bytes 0xFEFF4D00, "bwtc", "nomo", "dfsm", "fenw", 0x4D01
Spec https://github.com/stgiga/bwtc32key
Spec Availability Free
Reference Implementation https://stgiga.github.io/bwtc32key
Endianness Big-Endian
TPM Encryption
Error Resilience Yes
Patent License Unencumbered
Developed By stgiga
Maintained By stgiga
Released 2019

BWTC32Key is a single-file compression tool and format with optional encryption, that also is text-armored, and also initially defined the "*.B3K" file extension that was first used by the format.

Discussion

The code is based upon specific JavaScript implementations of Base32768, AES, SHA-256, and a spiritual successor to the original bzip format. The code is based upon JS code that runs in pure JS with no dependencies and is housed in the HTML frontend as a single monolithic program.

The output of the encoder is a text string. B3K files are always UTF-16 Big-endian text documents bearing the Byte Order Mark that contain said string. That string is a version of Base32768 which uses Hangul Syllable blocks and Han ideographs to allow font support while keeping size down in bytes. Also, the string is essentially a Korean message but in a different style. The file starts with a header of 0xFEFF4D00 and ends with a trailer of 0x4D01. The file CAN be concatenated, but to reverse that, one must use a text editor to extract the portion you need, due to the way the original program currently works.

All of the code is stream and chunk compatible, and this includes the AES256 implementation which uses the Counter mode. The password field as of 2023 accepts UTF-8 and has never had a length limit. Also, the encryption can be blanked out, allowing the format to be used in things that encryption wouldn't be useful in, such as an image compression format, essentially leaving the encryption feature unused when the password field is left blank or at default.

The format was written in pure JavaScript and is purely FOSS. The format was written by the author starting at age 15 (the earliest POSSIBLE estimate COULD very well be AROUND 14) and was definitely finished by the time they turned 17. This does show in the code. The compression and encryption functionality of this program coincidentally harks back to the Classic Mac OS days of PackIt, which featured similar sequential concatenation and compression of multiple files and forks into the archive as well as encryption, all far more primitive and inefficient than BWTC32Key.

The Base32768 final step is essentially the antithesis of the original BinHex, because instead of using an algorithm that doubles the binary input size via base 16, the Base32768 step makes the AES256-CTR encrypted BWTC archive (with standard US ASCII magic number "bwtc" in the Base32768-decoded decrypted compressed BWTC archive, which uses another US-ASCII magic number of "nomo" for the NoModel step of the process, before a DefSum step with yet another US-ASCII magic number of "dfsm" (assuming a compression level of five or lower), followed by the Fenwick Tree-based range coder step that has one more US-ASCII magic number of "fenw", (if the compression level is higher than 5. The five-or-lower is called "Fast mode" in the code.) (All 8-bit and 7-bit US-ASCII variants of these ASCII magic numbers are allowed.) all of which must each be present in their respective steps during each of their respective spots in the decompression steps of the decompression process, which is done last after the AES256-CTR decryption and the Base32768 decoding have completed, which all must successfully happen correctly with everything in the right places and sequences in order for the program to work, which is why all of these magic numbers exist. All of them are used to make the program safer from data corruption and other errors that may occur at any point during the execution of the program.) compression only 16/15 of the original compressed size prior to the inherently padding-free AES and Base32768 steps, assuming that the UTF-16BE with BOM output is the encoding to be fed into the output text file, which uses the ".B3K" extension instead of the .txt extension that is normally used for standard plain text documents. It should be noted that since the BWTC compressor is very simple compared to even the original bzip, and that the 256-bit AES variety used uses the counter mode which needs no padding at all, the BWTC32Key format is very slim and subtle in every possible way.

As a text based format that closely resembles human text, it can be used where text is exclusively required (in most cases), while also being similar in spirit to authentic human scripts, which allows it to be injected into written works as if it were a Korean section of actual human writing, with no humanly-distinguishable traces at all. Also, due to it being stream compatible, broadcasting it can be done as a means of sending data through live channels as a stream of data one could opt into. Another feature it has is that it will never decode corrupt input, without computing anything. Meaning, it will fail if the magic number of the BWTC archive ("bwtc") (or the other magic numbers) isn't present in the compressed data due to corruption or the wrong key, or if the Base32768 text itself has junk thrown in or isn't properly formed or decodable. And if the corruption corrupts the Base32768 text data itself, or even the UTF16BE (with BOM) encoded Base32768 text itself, it will also fail in either case. This ensures that corrupted files will not be created by the decoder or your system, which can help stop damage to your system if something like a firmware blob or an executable was affected. This format does not care about file information of any kind. Hence, this is why it can be used as a chunk or stream format in cases where file info isn't needed. This is only possible because of the encryption feature being optional to apply to the input, thus allowing data (such as uncompressed image data, typeface data, soundbank data, and open streams) that would be pointless and unwise to encrypt to be able to be used with this format. As mentioned before, live input streams are allowed, and real time data input is as well. And in other implementations, multiple files could be encoded and then concatenated, optionally with different keys per file, all with the right logic.

The format is well suited to tarred input because tar files are a text-like block and stream based format that has many null bytes, all of which make the tar format very similar to the rest of the entire BWTC32Key format itself. Thus, it is a match made in heaven, and that means that it is a good idea to use tarred input for that very same reason.

The Korean text output does not interfere with actual Korean because Zhuyin (also known as Bopomofo) and Hangul Jamo can be used to replace the Han ideographs (Hanja), and Korean syllable blocks (Hangul) respectively in a message running alongside the data. Thus, it does not render Korean communication impossible when using it, which allows for use in textual messages as if it were native text but with the extra feature of not locking out anyone's language from being spoken there. This means one can seamlessly put this in something like a written work in txt format. This is superficially similar to using PGP on part of a written work in regards to how the headers and text encoding work. BWTC32Key is its own animal entirely though, and the details may require accurate emulation to ensure that nothing messes up in ports/etc. Coincidentally, the letter "K" in the ".B3K" file extension can be interpreted as a reference to the format using Korean-esque text as its output, which is similar in meaning to the format's intentional usage of UTF16BE in order to alphabetically match up the "BE" part of UTF16BE to the "B3" part of the ".B3K" file extension. On systems with all-caps filenames, the ".B3K" file extension has the extra bonus feature of properly acronyming the full name of the format according to most rules for capitalized acronym casing.

Some things to note: In 2022, support for the program generating and reading its own files without needing the user to manually paste the string was added (this also made it so that the files are done correctly. UTF-8 harms the compression savings, and 0xEFBBBFE4B480 is, due to being UTF-8, indicative of wrongful UTF-8 transcoding. The first bytes of .B3K files should be 0xFEFF4D00, not THAT. The ending bytes should be 0x4D01. Basically, reading .B3K files that aren't in compliance with a header of 0xFEFF4D00 and a trailer of 0x4D01 is allowed, but never generate anything that is not that. That being said, because Unicode itself using Big-endian goes against many processor architectures in modern use, this fact isn't intended to be an immovable directive. Basically, the preferred encoding is UTF-16BE.

Also, the program's "application/octet-stream" MIME type may fit NOW, but a dedicated one is planned, yet it is not intended to be implemented in a way breaking existing code. Furthermore, if one is surmising said "application/octet-stream" MIME type, the place it shows up in the code's reference implementation is in the "MIME type" box, which is actually supposed to be the MIME type of a decoded file, not an encoded one. Since the format stores only the data of a file, when decoding it, the user specifies the file name, extension, and MIME type to "download" after decoding. If one wants multiple files, attributes, names, and dates, TAR beforehand is advised. Basically, the developer never gave the program a MIME type.

As for downloading a .B3K with the button in the reference implementation, the code involved downloads whatever is in the output string box as a UTF-16BE text document, and the relevant code assigns a MIME type of "text/plain". Samsung Internet on Android gives a ".B3K.txt" as the file downloaded, but nothing else does. That said, because ANYTHING put into the box gets saved, you don't need a text editor to encrypt text. All you need is to put your text in the box, change the extension in the "B3K filename" box to .txt or the extension of a text-based format of your choice, and download. Then, upload THAT with your desired key. The decoding of files doesn't care what you upload to be put in the box. That code reads whatever you feed it as UTF-16BE. So, after decryption, you can upload the output file into the .B3K uploader, and the text/etc will appear. So, the program can operate as a textual cipher. So trying to make the extension on such a Samsung case be .B3K would actually reduce functionality, albeit unintended functionality, which the program has a LOT of. One of the best ones is password generation and storage, in conjunction with other cryptographic and additional uses, of which there are many.)

Also, in 2022, the Compression Level was allowed to be set by the user rather than hard-coded at 9. It can be made to go higher, which can net extra savings if you are clever. Also, one should test to see whether the output successfully decompresses on a chosen overdriven level. Secondly, it supports going below one, which is helpful if speed is a priority over anything else. Basically, you can go lower than 0.5. The way the compression level in this and in the BZip-family archivers work is that they multiply the level times 100,000 to get the block size. So one can use non-integral numbers if the program lets you, and the developer of BWTC32Key sees no need to restrict to integers between 1 and 9. Also, that year, a fix was found for the reference implementation having problems on newer macOS versions, namely due to their stripping of the Byte Order Mark. Basically, the program was made to behave fine in such a situation.

Additionally, in 2023, fixes were made that don't break compatibility, and one of the side bonuses of the fix was Unicode support in the password box. So, in the span of several years since the 2019 version, various limitations of the program were lifted thanks to further development.

Links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox