BWTC32Key

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Added extra trivia on code)
(Discussion: Link article on Hangul)
Line 22: Line 22:
 
The code is based upon specific [[JavaScript]] implementations of Base32768, [[AES]], [[SHA-256]], and a spiritual successor to the original [[bzip]] format. The code is based upon JS code that runs in pure JS with no dependencies and is housed in the [[HTML]] frontend as a single monolithic program.
 
The code is based upon specific [[JavaScript]] implementations of Base32768, [[AES]], [[SHA-256]], and a spiritual successor to the original [[bzip]] format. The code is based upon JS code that runs in pure JS with no dependencies and is housed in the [[HTML]] frontend as a single monolithic program.
  
The output of the encoder is a text string. B3K files are always [[UTF-16]] Big-endian text documents bearing the Byte Order Mark that contain said string. That string is a version of Base32768 which uses Hangul Syllable blocks and Han ideographs to allow font support while keeping size down in bytes. Also, the string is essentially a Korean message but in a different style. The file starts with a header of 0xFEFF4D00 and ends with a trailer of 0x4D01. The file CAN be concatenated, but to reverse that, one must use a text editor to extract the portion you need, due to the way the original program currently works.  
+
The output of the encoder is a text string. B3K files are always [[UTF-16]] Big-endian text documents bearing the Byte Order Mark that contain said string. That string is a version of Base32768 which uses [[Hangul]] Syllable blocks and Han ideographs to allow font support while keeping size down in bytes. Also, the string is essentially a Korean message but in a different style. The file starts with a header of 0xFEFF4D00 and ends with a trailer of 0x4D01. The file CAN be concatenated, but to reverse that, one must use a text editor to extract the portion you need, due to the way the original program currently works.  
  
 
All of the code is stream and chunk compatible, and this includes the AES256 implementation which uses the Counter mode. The password field only accepts 8 bit [[ASCII]] to minimize character set headaches, but due to there being no password length limit, [[UTF-7]] or [[Punycode]] can be used to allow non-Latin passwords to be used. Also, the encryption can be blanked out, allowing the format to be used in things that encryption wouldn't be useful in, such as an image compression format, essentially leaving the encryption feature unused when the password field is left blank or at default.  
 
All of the code is stream and chunk compatible, and this includes the AES256 implementation which uses the Counter mode. The password field only accepts 8 bit [[ASCII]] to minimize character set headaches, but due to there being no password length limit, [[UTF-7]] or [[Punycode]] can be used to allow non-Latin passwords to be used. Also, the encryption can be blanked out, allowing the format to be used in things that encryption wouldn't be useful in, such as an image compression format, essentially leaving the encryption feature unused when the password field is left blank or at default.  

Revision as of 14:01, 1 March 2020

File Format
Name BWTC32Key
Ontology
Extension(s) .b3k
Compression Always Lossless
Extended From bzip
Magic Bytes 0xFEFF4D00, "bwtc", "nomo", "dfsm", "fenw", 0x4D01
Spec https://github.com/sentogiga/bwtc32key
Spec Availability Free
Reference Implementation https://sentogiga.github.io/bwtc32key
Endianness Big-Endian
TPM Encryption
Error Resilience Yes
Patent License Unencumbered
Developed By stgiga
Maintained By stgiga
Released 2019

BWTC32Key is a single-file compression tool and format with optional encryption, that also is text-armored, and also initially defined the "*.B3K" file extension that was first used by the format.

Discussion

The code is based upon specific JavaScript implementations of Base32768, AES, SHA-256, and a spiritual successor to the original bzip format. The code is based upon JS code that runs in pure JS with no dependencies and is housed in the HTML frontend as a single monolithic program.

The output of the encoder is a text string. B3K files are always UTF-16 Big-endian text documents bearing the Byte Order Mark that contain said string. That string is a version of Base32768 which uses Hangul Syllable blocks and Han ideographs to allow font support while keeping size down in bytes. Also, the string is essentially a Korean message but in a different style. The file starts with a header of 0xFEFF4D00 and ends with a trailer of 0x4D01. The file CAN be concatenated, but to reverse that, one must use a text editor to extract the portion you need, due to the way the original program currently works.

All of the code is stream and chunk compatible, and this includes the AES256 implementation which uses the Counter mode. The password field only accepts 8 bit ASCII to minimize character set headaches, but due to there being no password length limit, UTF-7 or Punycode can be used to allow non-Latin passwords to be used. Also, the encryption can be blanked out, allowing the format to be used in things that encryption wouldn't be useful in, such as an image compression format, essentially leaving the encryption feature unused when the password field is left blank or at default.

The format was written in pure JavaScript and is purely FOSS. The format was written by the author starting at age 15 and was definitely finished by the time they turned 17. This does show in the code. The compression and encryption functionality of this program coincidentally harks back to the Classic Mac OS days of PackIt, which featured similar sequential concatenation and compression of multiple files and forks into the archive as well as encryption, all far more primitive and inefficient than BWTC32Key.

The Base32768 final step is essentially the antithesis of the original BinHex, because instead of using an algorithm that doubles the binary input size via base 16, the base32768 step makes the AES256-CTR encrypted BWTC archive (with standard US ASCII magic number "bwtc" in the Base32768-decoded decrypted compressed BWTC archive, which uses another US-ASCII magic number of "nomo" for the NoModel step of the process, before a DefSum step with yet another US-ASCII magic number of "dfsm", followed by the Fenwick Tree-based range coder step that has one more US-ASCII magic number of "fenw", (All 8-bit and 7-bit US-ASCII variants of these ASCII magic numbers are allowed.) all of which must each be present in their respective steps during each of their respective spots in the decompression steps of the decompression process, which is done last after the AES256-CTR decryption and the Base32768 decoding have completed, which all must successfully happen correctly with everything in the right places and sequences in order for the program to work, which is why all of these magic numbers exist. All of them are used to make the program safer from data corruption and other errors that may occur at any point during the execution of the program.) compression only 16/15 of the original compressed size prior to the inherently padding-free AES and Base32768 steps, assuming that the UTF-16BE with BOM output is the encoding to be fed into the output text file, which uses the ".B3K" extension instead of the .txt extension that is normally used for standard plain text documents. It should be noted that since the BWTC compressor is very simple compared to even the original bzip, and that the 256bit AES variety used uses the counter mode which needs no padding at all, the BWTC32Key format is very slim and subtle in every possible way.

As a text based format that closely resembles human text, it can be used where text is exclusively required (in most cases), while also being similar in spirit to authentic human scripts, which allows it to be injected into written works as if it were a Korean section of actual human writing, with no humanly-distinguishable traces at all. Also, due to it being stream compatible, broadcasting it can be done as a means of sending data through live channels as a stream of data one could opt into. Another feature it has is that it will never decode corrupt input, without computing anything. Meaning, it will fail if the magic number of the BWTC archive ("bwtc") isn't present in the compressed data due to corruption or the wrong key, or if the Base32768 text itself has junk thrown in or isn't properly formed or decodable. And if the corruption corrupts the Base32768 text data itself, or even the UTF16BE (with BOM) encoded Base32768 text itself, it will also fail un either case. This ensures that corrupted files will not be created by the decoder or your system, which can help stop damage to your system if something like a firmware blob or an executable was affected. This format does not care about file information of any kind. Hence, this is why it can be used as a chunk or stream format in cases where file info isn't needed. This is only possible because of the encryption feature being optional to apply to the input, thus allowing data (such as uncompressed image data, typeface data, soundbank data, and open streams) that would be pointless and unwise to encrypt to be able to be used with this format. As mentioned before, live input streams are allowed, and real time data input is as well. And in other implementations, multiple files could be encoded and then concatenated, optionally with different keys per file, all with the right logic.

The format is well suited to tarred input because tar files are a text-like block and stream based format that has many null bytes, all of which make the tar format very similar to the rest of the entire BWTC32Key format itself. Thus, it is a match made in heaven, and that means that it is a good idea to use tarred input for that very same reason.

The Korean text output does not interfere with actual Korean because Zhuyin and Hangul Jamo can be used to replace the Han ideographs (Hanja), and Korean syllable blocks (Hangul) respectively in a message running alongside the data. Thus, it does not render Korean communication impossible when using it, which allows for use in textual messages as if it were native text but with the extra feature of not locking out anyone's language from being spoken there. This means one can seamlessly put this in something like a written work in txt format. This is superficially similar to using PGP on part of a written work in regards to how the headers and text encoding work. BWTC32Key is its own animal entirely though, and the details may require accurate emulation to ensure that nothing messes up in ports/etc.

Links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox