Talk:EBZip

From Just Solve the File Format Problem
Jump to: navigation, search

Contents

ebzip/ebuzip caveats

Intended to be a place to note down issues, as well as tips and tricks with either eb-4.4.3 and/or ebu-4.5-20220808.

Hardcoded to look specifically for a file that begins with "CATALOGS" (case-insensitive), or "CATALOGS.ebz"

Custom dictionaries doesn't always follow standard convention to store their content(s), like having "START.ebz" in the top level. Even if they do, they may also compress "CATALOGS" as "CATALOGS.ebz." eb-4.4.3 specifically looks for either "CATALOGS" or "CATALOGS.ebz", and either decompress or compress the files specified within, per user's request, e.g. ebzip -u will tell it to decompress and rename any of the listed files that are noted in either "CATALOGS" or "CATALOGS.ebz", and has ".ebz" extension, ignoring any other files in the same path, and doesn't decompress "CATALOGS.ebz" even if instructed. There are ways to workaround that:

Decompress "CATALOGS.ebz"

In my case,

  • Simply rename the decompressed "HONMON" to something like "HONMON.old",
  • Copy the "CATALOGS.ebz" there, but have it named as "HONMON.ebz",
  • Run decompression via the usual ebzip -u from the top level where it doesn't contain "HONMON.ebz" but only "CATALOGS.ebz",
  • Once decompressed, move decompressed "HONMON" (not "HONMON.old") back to where "CATALOGS.ebz" was, have it renamed as "CATALOGS",
  • Rename "HONMON.old" back to "HONMON".

Optionally, delete "CATALOGS.ebz" considering that is now redundant.

Decompress other .ebz that were not listed

Same as above, but instead of copying, you can just move it, as,

  • ebzip -u will only look for "CATALOGS" or "CATALOGS.ebz", and
  • It saves having to manually delete the redundant file(s) afterwards.

Maliciously/Erroneously renamed files

ebuzip specifically looks for the extension .ebz to determine the level of compression. If,

  • CATALOGS.ebz which was compressed, but renamed as CATALOGS, running ebuzip -i would show the following error message, "ebuzip: failed to read a catalog file".
  • HONMON.ebz which was compressed, but again renamed as HONMON, running ebuzip -i would indicate that file as not compressed.

This bug is likely to also affect original ebzip.

Zero-length (un)compressed files not overwritten upon manual intervention

When a path that has zero-length files, but are listed in either "CATALOGS" or "CATALOGS.ebz", ebzip -u for instance may not overwrite it. This may also apply if ebzip -uf is used. To fix this, just delete the zero-length files, ensuring that they have a duplicate file that does not have .ebz extension, then re-run the command. When compressing, the opposite is true, to look for duplicate zero-length files with .ebz extension, delete them and run the utility to compress instead.

On modern Linux environments, these could be achieved via the following:

Enumerating

 find . -size 0 -type f

Deleting

 find . -size 0 -type f -exec sudo rm -iv "{}" \;

Care must be taken to avoid any potential losses when running this command.

Anonymoususer852 (talk) 15:02, 5 September 2025 (UTC)

LZMA benchmarks against other compression/utilities

In response to maintainer of ebu's blog entry,[1] I decided to run a couple of benchmarks with different dictionaries, books or other referencing materials on my end. The following packages and their versions are as follows; kernel-6.15.8, zlib-1.3.1, libdeflate-1.24, xz-5.4.2, dwarfs-0.12.4, squashfs-tools-4.7.1.

zlib versus libdeflate versus LZMA benchmark (single electronic book)

For the following, file sizes are reported from ls -l, therefore size would be reported in blocks[2]. The tested dictionaries using no compression, zlib, libdeflate, or XZ are tarballed for consistency. For compressed file systems, XZ's LZMA2 is used. Both zlib and libdeflate has a maximum compression level of 5, anything beyond 5 would result in error, however XZ goes up to level 9, and has further options like --extreme as well as various other tweaks which I will keep it simple and use xz -9evv instead.

DWARFS example parameters:

 mkdwarfs --input . --output=../JMdict_eng_2021-02-26_UTC.dwarfs --block-size-bits=26 --compression=lzma:level=9:extreme --schema-compression=lzma:level=9:extreme --metadata-compression=lzma:level=9:extreme --no-history \
 --pack-metadata=all,force --file-hash=sha3-512 --no-history-timestamp --no-create-timestamp --no-history-command-line

SquashFS example parameters:

 mksquashfs . ../JMdict_eng_2021-02-26_UTC.squashfs -b 1048576 -comp xz -Xdict-size 100% -progress -info -no-xattrs

Percentages are rounded to two decimal points, otherwise rounded up. Compression ratio against uncompressed in percentage is offered instead.

kanjidic NHKAccent Eijiro Kenkyusha eiwa daijiten v6 Kenkyusha waei daijiten v5
No compression 130,385,920 1,578,055,680 1,374,771,200 484,945,920 328,192,000
ebuzip -z -l 5
(zlib)
42,035,200 1,393,920,000 379,975,680 117,596,160 109,721,600
Compression ratio against uncompressed in percentage 32.24% 88.33% 27.63% 24.25% 33.43%
ebuzip -z -l 5
(libdeflate)
40,171,520 1,383,505,920 353,269,760 107,765,760 102,113,280
Compression ratio against uncompressed in percentage 30.81% 87.67% 25.70% 22.22% 31.11%
xz -9evv -T1
(liblzma)
21,607,608 1,350,900,536 197,134,768 55,220,788 50,109,852
Compression ratio against uncompressed in percentage 16.57% 85.61% 14.34% 11.39% 15.27%
xz -9evv -T0
(liblzma)
21,607,616 1,351,560,168 202,934,760 56,670,664 50,747,640
Compression ratio against uncompressed in percentage 16.57% 85.65% 14.76% 11.69% 15.46%
mkdwarfs
(liblzma)
22,145,746 1,353,699,364 212,307,176 60,043,728 54,541,980
Compression ratio against uncompressed in percentage 16.99% 85.78% 15.44% 12.38% 16.62%
mksquashfs
(liblzma)
31,793,152 1,367,302,144 272,068,608 81,805,312 160,956,416
Compression ratio against uncompressed in percentage 24.38% 86.64% 19.79% 16.87% 49.04%

XZ compression results are provided for the sake of discussion only, more on this in conclusion. While it does show extremely good results against all other utilities, the seek times are abysmally poor, worse than DWARFS with LZMA2 on machines with weaker hardware. Tests confirms liblzma having better compression regardless, against both zlib and libdeflate.

NHKAccent is the exception here, with poor compression results. This is due to electronic book containing multimedia files, which won't compress well if it was already compressed.

It is worthy to note here that compression/archiving utilities embracing liblzma, could by default take advantage of multi-threading, at the expense of producing potentially larger file. This however, is a small trade off, for effectively making full use of available CPU in the given environment.

Modern compressed file system with multiple electronic books shootout

Author of ebu mentioned the case with smartphones, which I do personally find that these are also limited in performance, not just with storage capacities. SquashFS was or is used in Android devices,[3] so it makes sense to try using that on non system files in addition.

For these tests, the sample data consists of ~189 dictionaries in a single directory. Both DWARFS and SquashFS are configured to use XZ's LZMA2 where possible, as it is known to be one of the best, freely available (especially under Unix-like platforms) compression. Both DWARFS and SquashFS parameters are the same as in prior section. Due to the inherent way of how ebuzip works, requiring "CATALOGS" to be in the top directory, both variations of ebuzip using either zlib or libdeflate are omitted from this test. Besides, single compression job per dictionary, not including other files in the that path, e.g. "START" would make the overall process time consuming, if the process isn't already tedious that it requires extensive scripting.

Method Size (blocks)
Uncompressed
tar
29,014,026,240
DWARFS 10,613,886,184
SquashFS 12,269,518,848

Where DWARFS shines, is the ability to also save on segmenting data, in addition to file de-duplication. For a small penalty I've given to DWARFS, in that it is to use sha3-512 as hash, it still compresses the collection better and faster, than SquashFS. Then again, the focus here is on overall file size, not performance in file creation.

Conclusion

Now of course, there are plenty of other compression utilities available, such as, but are not limited to Zstandard, BZip3, LZO, etc. From the various results I have read, indicates that these other compression benchmarks, while trading better performance, these fall behind in final size constraints. The benchmark showcase here outlines the need to have as small of a compressed file as possible, with modern compression utilities and modern PC hardware, therefore making (speed-wise) performance testing out of the scope, as is the idea of creating the likes of ZIP bombs, or nested compressed files out of the scope.

I have also tried to run mkfs.erofs in order to include it against the modern compressed file system with multiple dictionaries shootout. With version 1.8.10, no matter what parameters I set with LZMA level=9 or level=109 (109 is the extreme variant), even with lowering dictsize=, the results would cause system to run out of memory (16GB of RAM) before even completing the file system creation.

These benchmarks are in no way discrediting or disavowing the use of libdeflate over zlib. If file size is a concern, more modern compression algorithms and/or compression techniques could be considered. Modern EPWING viewer programs generally don't care if the electronic book is compressed using zlib or not compressed at all, and there are many ways of handling decompression transparently in modern Unix-like environments. Even more so, are the ever growing hardware performance as well as affordability with storage capacities in modern computing, questions the use of the what is widely considered as legacy compression algorithms for data archival. Last but not least, while single-threaded compression (as witnessed in the likes of xz -9evv -T1 above) or tasks might be extremely beneficial in reducing overheads, especially with LZMA2, it is also a slow process which may subsequently not translate well in modern, demanding contexts, where time honored traditions are casted away for practical, convenient, modernized, and easy to use solutions.

  1. EPWINGを圧縮する (libdeflateで) / Compressing EPWING via libdeflate - Kazuhiro's blog
  2. file block size - difference between stat and ls - Unix & Linux - Stack Exchange
  3. SquashFS explanation for Android system - Android Enthusiasts - Stack Exchange

Anonymoususer852 (talk) 18:46, 2 September 2025 (UTC)

The viability ebuzip as a possible drop-in replacement candidate

As covered in the main page, that EBUZip uses the more modern libdeflate, I decided to also test this theory with once again using the same JMDict as the subject for these experimentation. On my modern Linux machine, running ebuzip -i on the path that contains JMdict as the name of subdirectory,

Using the sample JMDict tarball as is

 ==> /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/JMdict/DATA/HONMON.ebz <==
 130332672 -> 42407287 bytes (32.5%, ebzip level 5 compression)
 
 ==> /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/JMdict/GAIJI/GAI16H.ebz <==
 4096 -> 529 bytes (12.9%, ebzip level 5 compression)
 
 ==> /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/CATALOGS <==
 2048 bytes (not compressed)
 

This is likely compressed using ebzip with zlib. Running ebuzip -u indicates the following,

 ==> uncompress /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/JMdict/DATA/HONMON.ebz <==
 output to ./JMdict/DATA/HONMON
 51.5% done (67108864 / 130332672 bytes)
 completed (130332672 / 130332672 bytes)
 ==> uncompress /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/JMdict/GAIJI/GAI16H.ebz <==
 output to ./JMdict/GAIJI/GAI16H
 completed (4096 / 4096 bytes)
 ==> copy /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/CATALOGS <==
 output to ./CATALOGS
 the input and output files are the same, skipped.
 

Then I ran ebuzip -q -z -l 5 which compresses it using the same level (5) compression, via libdeflate and silenced output. After a few minutes, I was greeted with my prompt, so I ran ebuzip -i,

 ==> /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/JMdict/DATA/HONMON.ebz <==
 130332672 -> 40120400 bytes (30.8%, ebzip level 5 compression)
 
 ==> /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/JMdict/GAIJI/GAI16H.ebz <==
 4096 -> 513 bytes (12.5%, ebzip level 5 compression)
 
 ==> /home/nobody/EPWING-research/JMDict-temp/a/JMdict_eng_2021-02-26_UTC/CATALOGS <==
 2048 bytes (not compressed)
 

EBUZip with zlib decompressing the same sample compressed JMDict, then recompressing it using zlib at first, then recompressing it finally using libdeflate

I then decided to experiment JMDict but with ebuzip compiled to use zlib, in a separate path, which I used that to first decompress the original zlib level 5 JMDict, then recompress it again using zlib level 5, before it is finally recompressed (after decompressing) using libdeflate. Running ebuzip -i shows,

 ==> /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/JMdict/DATA/HONMON.ebz <==
 130332672 -> 41985166 bytes (32.2%, ebzip level 5 compression)
 
 ==> /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/JMdict/GAIJI/GAI16H.ebz <==
 4096 -> 529 bytes (12.9%, ebzip level 5 compression)
 
 ==> /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/CATALOGS <==
 2048 bytes (not compressed)
 

As you can see, HONMON.ebz has different overall size/compression ratio compared to the sample compressed JMDict. zlib project is still being maintained, so it's not terribly surprising to see ever so slight increase in compression ratio of 0.3%.

Again, decompressing it using the same ebuzip -u,

 ==> uncompress /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/JMdict/DATA/HONMON.ebz <==
 output to ./JMdict/DATA/HONMON
 51.5% done (67108864 / 130332672 bytes)
 completed (130332672 / 130332672 bytes)
 ==> uncompress /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/JMdict/GAIJI/GAI16H.ebz <==
 output to ./JMdict/GAIJI/GAI16H
 completed (4096 / 4096 bytes)
 ==> copy /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/CATALOGS <==
 output to ./CATALOGS
 the input and output files are the same, skipped.
 

Finally, with both ebuzip -q -z -l 5 and ebuzip -i,

 ==> /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/JMdict/DATA/HONMON.ebz <==
 130332672 -> 40120400 bytes (30.8%, ebzip level 5 compression)
 
 ==> /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/JMdict/GAIJI/GAI16H.ebz <==
 4096 -> 513 bytes (12.5%, ebzip level 5 compression)
 
 ==> /home/nobody/EPWING-research/JMDict-temp/b/JMdict_eng_2021-02-26_UTC/CATALOGS <==
 2048 bytes (not compressed)
 

Observation notes

ebuzip does not indicate whether or not,

  • If compression was done by ebzip, and,
  • The library used, if it was zlib compressed or that it was subsequently using libdeflate.

Therefore, no matter which one of the two different ebuzip was used overall, the results are the same, with compressed sizes being different.

As my aim with either ebzip or ebuzip is to simply decompress the files, before recompressing them all using LZMA via the likes of DWARFS or SquashFS, I have not tested whether or not if JMDict with EBUZip via libdeflate compression could successfully load using a general e-book viewer. But considering that the decompressing and recompressing indicates that the file could still be read, even if I used ebuzip here instead of the original ebzip, I am going to assume it will successfully load.

Anonymoususer852 (talk) 15:55, 5 September 2025 (UTC)

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox