ZIP

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Specifications)
 
(31 intermediate revisions by 7 users not shown)
Line 1: Line 1:
:''Not to be confused with [[Zip disk]], an unrelated disk cartridge unit.''
 
 
{{FormatInfo
 
{{FormatInfo
 
|formattype=electronic
 
|formattype=electronic
Line 12: Line 11:
 
|released=1989
 
|released=1989
 
}}
 
}}
'''ZIP''' is one of the most popular file compression formats. It was created in 1989 as the native format of the PKZIP program, which was introduced by Phil Katz in the wake of a lawsuit (which he lost) against him by the makers of the then-popular [[ARC (compression format)|ARC]] program (and file format) for copyright and trademark infringement in an earlier program PKARC which had been file-compatible with ARC.  This resulted in Katz creating a new file format, which rapidly overtook ARC in popularity (to a large extent because of BBS sysops, then the primary users of such compression, resenting the lawsuit). Many programs have been released for a variety of operating systems to compress and decompress ZIP files, and native support for the format is built into several popular operating systems.
+
'''ZIP''' is one of the most popular file compression/archiving formats.
  
ZIP implementations vary in their support for features in the specification from PKWARE<ref>http://www.pkware.com/documents/casestudies/APPNOTE.TXT</ref>, particularly features added since version 2 (1993), some of which are protected by patents and require licensing.  Many implementations limit the use of compression to the [[DEFLATE]] algorithm, introduced with version 2. Extensions incorporated into the specification that have been widely adopted are: long filenames; large files (using a technique known as ZIP64); and filenames in [[UTF-8]].  In 2011 work began on an interoperable subset of the latest APPNOTE.TXT with the intention of publication as ISO/IEC 21320-1, Document Container File -- Part 1: Core.  As of November 2012, a discussion draft is available<ref>http://kikaku.itscj.ipsj.or.jp/sc34/open/1855.pdf</ref>.  Designed to promote interoperable implementations, the draft ISO/IEC 21320-1 prohibits compression other than using [[DEFLATE]], segmentation or multiple volumes, and features that are subject to patents.
+
== Disambiguation ==
 +
There are many other things, related or not, whose names use the word "Zip". Only a few are listed here.
  
While .zip is the usual file extension, ZIP-formatted files can be found with many other extensions since a number of other file formats use ZIP compression but store their files in application-specific extensions. See [[:Category:ZIP based file formats]] for a list of such formats.
+
* [[Zip disk]], an unrelated disk cartridge unit
 +
* Z-language Interpreter Program (ZIP) - See [[Z-code]].
 +
* Zip-Archiv - See [[ZAR (Zip-Archiv)]] (an urelated format).
 +
* "ZIP compression" can sometimes refer to [[DEFLATE]], or something based on it ([[zlib]], [[Gzip]]).
  
== See also ==
+
See also the [[#See also|"See also"]] section, elsewhere on this page.
* [[PKLITE]]
+
* [[Self-extracting ZIP]]
+
* [[Zipx]]
+
  
== Disambiguation ==
+
== Discussion ==
The term "ZIP compression" is sometimes misleadingly used to mean [[DEFLATE]] (which is by far the most common compression scheme used in ZIP files). In such cases, the compressed data format could turn out to be raw [[DEFLATE]], or [[zlib]], or [[gzip]].
+
ZIP was created in 1989 as the native format of the [[PKZIP]] program, which was introduced by Phil Katz (with co-creator Gary Conway) in the wake of a lawsuit (which he lost) against him by the makers of the then-popular [[ARC (compression format)|ARC]] program (and file format) for copyright and trademark infringement in an earlier program [[PKARC]] which had been file-compatible with ARC. This resulted in Katz creating a new file format, which rapidly overtook ARC in popularity (to a large extent because of BBS sysops, then the primary users of such compression, resenting the lawsuit). Many programs have been released for a variety of operating systems to compress and decompress ZIP files, and native support for the format is built into several popular operating systems.
  
== Identification ==
+
ZIP implementations vary in their support for features in the specification from PKWARE (see "APPNOTE" in the Specifications section below), particularly features added since version 2 (1993)<!--, some of which are protected by patents and require licensing-->. Many implementations limit the use of compression to the [[DEFLATE]] algorithm, introduced with version 2. Extensions incorporated into the specification that have been widely adopted include large files (using a technique known as ZIP64), and filenames in [[UTF-8]].
The byte sequence <code>'P' 'K' 0x05 0x06</code> (the "end of central directory signature") appears somewhere in the file, usually beginning exactly 22 bytes from the end of the file. However, it will appear earlier if the file contains a "ZIP file comment" (common in the BBS era, but rare today), or for various other reasons. There seems to be no theoretical limit to how far back you may have to search for the signature, but some software limits it to around 64KB, which is the maximum length of a comment.
+
  
Most ZIP files (but not [[self-extracting ZIP]] files) happen to begin with <code>'P' 'K' 0x03 0x04</code>. This is not a global file signature, but is the signature that appears once for every compressed file inside the ZIP file. Some ZIP-based formats are designed such that they necessarily begin in this way. But in general, it is even legal for a ZIP file to contain zero files, and such a ZIP file would not contain this signature at all.
+
An interoperable subset of ZIP has been defined, and published as ''ISO/IEC 21320-1: Document Container File'' (refer to the Specifications section below). Designed to promote interoperable implementations, it prohibits various features, including compression other than [[DEFLATE]], multiple volumes, and various encryption-related features.
 +
 
 +
While .zip is the usual file extension, ZIP-formatted files can be found with many other extensions since a number of other file formats use ZIP compression but store their files in application-specific extensions. See [[:Category:ZIP based file formats]] for a list of such formats.
  
== Compression ==
+
== Format details ==
 +
=== Compression ===
 
Each file in a ZIP file is compressed using one of a number of compression algorithms. Only compression types 0 (uncompressed) and 8 (DEFLATE) are likely to be seen in modern portable ZIP files. In old ZIP files, types 1 (Shrink) and 6 (Implode) are common.
 
Each file in a ZIP file is compressed using one of a number of compression algorithms. Only compression types 0 (uncompressed) and 8 (DEFLATE) are likely to be seen in modern portable ZIP files. In old ZIP files, types 1 (Shrink) and 6 (Implode) are common.
  
Line 41: Line 43:
 
|0 || Uncompressed
 
|0 || Uncompressed
 
|-
 
|-
|1 || Shrink ([[LZW]]) || Used by PKZIP prior to v2.0.
+
|1 || Shrink || [[LZW]]. Used by PKZIP 0.x and 1.x.
 
|-
 
|-
|2–5 || Reduce || Used by PKZIP v0.x.
+
|2–5 || Reduce || [[LZ77]] + prediction. Used by PKZIP v0.x. See also [[SCRNCH]].
 
|-
 
|-
|6 || Implode || Used by PKZIP v1.x.
+
|6 || Implode || [[LZ77 with Huffman coding|LZ77 + Huffman]]. Used by PKZIP v1.x.
 
|-
 
|-
 
|7 || Tokenized || Never used?
 
|7 || Tokenized || Never used?
 
|-
 
|-
|8 || [[DEFLATE]] || Used by PKZIP v2.0+.
+
|8 || [[DEFLATE]] || [[LZ77 with Huffman coding|LZ77 + Huffman]]. Used by PKZIP v2.0+.
 
|-
 
|-
|9 || Deflate64, a.k.a. Enhanced Deflate || Defined in ZIP specification v2.1+.
+
|9 || Deflate64, a.k.a. Enhanced Deflate || Format version 2.1+.
 
|-
 
|-
|10 || [[PKWARE DCL Implode|PKWARE Data Compression Library (DCL) Imploding]] (old IBM TERSE)
+
|10 || [[PKWARE DCL Implode]] (old IBM TERSE) || Format version 2.5+.
 
|-
 
|-
|12 || [[Bzip2]] || Defined in ZIP specification v4.6+.
+
|12 || [[Bzip2]] || Format version 4.6+.
 
|-
 
|-
 
|14 || [[LZMA]] (EFS) || Defined in ZIP specification v6.3+.
 
|14 || [[LZMA]] (EFS) || Defined in ZIP specification v6.3+.
 
|-
 
|-
|16 || IBM z/OS CMPSC
+
|16 || IBM z/OS CMPSC || Defined in ZIP specification v6.3.5+.
 
|-
 
|-
|18 || IBM TERSE (new)
+
|18 || IBM [[TERSE]] (new) || Defined in ZIP specification v6.2.2+.
 
|-
 
|-
|19 || IBM LZ77 z Architecture (PFS)
+
|19 || IBM LZ77 z Architecture (PFS) || Defined in ZIP specification v6.3.5+.
 
|-
 
|-
|94 || [[MP3]] || Supported by WinZip 21+.
+
|93 || [[Zstandard]] || Defined in ZIP specification v6.3.8+.
 
|-
 
|-
|95 || [[XZ]] || Supported by WinZip 18+.
+
|94 || [[MP3]] || Defined in ZIP specification v6.3.8+. Supported by WinZip 21+.
 
|-
 
|-
|96 || [[JPEG]] variant
+
|95 || [[XZ]] || Defined in ZIP specification v6.3.8+. Supported by WinZip 18+.
 +
|-
 +
|96 || [[JPEG]] variant || Defined in ZIP specification v6.3.5+.
 
|-
 
|-
 
|97 || [[WavPack]] || Defined in ZIP specification v6.3.2+.
 
|97 || [[WavPack]] || Defined in ZIP specification v6.3.2+.
Line 75: Line 79:
 
|98 || [[PPMd]] version I, Rev 1 || Defined in ZIP specification v6.3+.
 
|98 || [[PPMd]] version I, Rev 1 || Defined in ZIP specification v6.3+.
 
|-
 
|-
|99 || AES / AE-x encryption marker
+
|99 || AES / AE-x encryption marker || Defined in ZIP specification v6.3.5+.
 
|}
 
|}
  
== Extensible data fields ==
+
=== Extensible data fields ===
 
Each member file of a ZIP file may have one or more ''extensible data fields'' (or ''extra fields''), containing arbitrary data. Each field is tagged with a 16-bit identifier. Extra fields are normally used for platform-specific or filesystem-specific metadata, or to work around limitations of the original ZIP format. They are not normally used for application-specific data.
 
Each member file of a ZIP file may have one or more ''extensible data fields'' (or ''extra fields''), containing arbitrary data. Each field is tagged with a 16-bit identifier. Extra fields are normally used for platform-specific or filesystem-specific metadata, or to work around limitations of the original ZIP format. They are not normally used for application-specific data.
  
 
Most of the extra fields in use are documented in the ZIP "APPNOTE" specification, or by the Info-ZIP software (e.g. the proginfo/extrafld.txt file in the Zip program's source distribution).
 
Most of the extra fields in use are documented in the ZIP "APPNOTE" specification, or by the Info-ZIP software (e.g. the proginfo/extrafld.txt file in the Zip program's source distribution).
  
=== Known extensible data fields ===
+
Known extensible data fields:
 
{| class="wikitable"
 
{| class="wikitable"
 
!ID !! Owner !! Description !! Reference (identification) !! Reference (details)
 
!ID !! Owner !! Description !! Reference (identification) !! Reference (details)
Line 190: Line 194:
 
|-
 
|-
 
|0x7875 "<code>ux</code>" || || Info-ZIP Unix (new) || Info-ZIP || Info-ZIP
 
|0x7875 "<code>ux</code>" || || Info-ZIP Unix (new) || Info-ZIP || Info-ZIP
 +
|-
 +
|0xa11e || || Data Stream Alignment || APPNOTE || APPNOTE
 
|-
 
|-
 
|0xa220 || || Microsoft Open Packaging Growth Hint || APPNOTE || APPNOTE
 
|0xa220 || || Microsoft Open Packaging Growth Hint || APPNOTE || APPNOTE
Line 197: Line 203:
 
|0xfd4a || || SMS/QDOS || APPNOTE ||
 
|0xfd4a || || SMS/QDOS || APPNOTE ||
 
|}
 
|}
 +
 +
=== Multi-part archives ===
 +
The ZIP format has supported archives consisting of multiple files from day one, though this feature does not appear to have been utilized by PKZIP until floppy disk spanning features were added in the v2.xx series.
 +
 +
The first fragment of a multi-part archive usually begins with signature bytes {{magic|'P' 'K' 0x07 0x08}}. This signature can be present even for single-part archives, if the disk spanning feature was enabled but turned out not to be needed.
 +
 +
The APPNOTE documentation (at least v4.5+) also mentions signature {{magic|'P' 'K' 0x30 0x30}}, calling it the "temporary spanning marker", but more research is needed to understand when it is used.
 +
 +
Fragments of a multi-part archive that are neither first nor last may not have any identifying signatures.
 +
 +
Only the last fragment necessarily contains the usual {{magic|'P' 'K' 0x05 0x06}} signature.
 +
 +
=== Self-extracting archives ===
 +
Refer to [[Self-extracting ZIP]].
 +
 +
=== Character encoding ===
 +
In general there is [https://twitter.com/tef/status/436555938879655937 no official file name encoding for ZIP files], and non ASCII filenames [https://stackoverflow.com/questions/106367/add-non-ascii-file-names-to-zip-in-java are not generally well supported]. The original implementation specified [[CP437|IBM Code Page 437]] for filenames, but as many characters cannot be expressed in that encoding, the filename bytes have often been interpreted using the current system codepage (implementation dependent behaviour). There is a flag to specify [[UTF-8]] as the encoding, but it is not supported in all major clients (e.g. Windows Explorer).
 +
 +
== Identification ==
 +
The byte sequence <code>'P' 'K' 0x05 0x06</code> (the "end of central directory signature") appears somewhere in the file, usually beginning exactly 22 bytes from the end of the file. However, it will appear earlier if the file contains a "ZIP file comment" (common in the BBS era, but rare today), or for various other reasons. There seems to be no theoretical limit to how far back you may have to search for the signature, but some software limits it to around 64KB, which is the maximum length of a comment.
 +
 +
Most ZIP files (but not [[self-extracting ZIP]] files) happen to begin with <code>'P' 'K' 0x03 0x04</code>. This is not a global file signature, but is the signature that appears once for every compressed file inside the ZIP file. Some ZIP-based formats are designed such that they necessarily begin in this way. But in general, it is even legal for a ZIP file to contain zero files, and such a ZIP file would not contain this signature at all.
 +
 +
Refer to the [[#Multi-part archives]] section, elsewhere on this page, for additional relevant information.
 +
 +
That Phil Katz guy has thus managed to get his initials at the start of a large number of files on many millions of computers and devices, given how many file formats are based on ZIP (even if they use different extensions). He died in 2000, but this memorial to him will live on indefinitely.
 +
 +
== See also ==
 +
* [[Self-extracting ZIP]]
 +
* [[Zipx]]
 +
* [[SOF (Spectrum Software)]] - Variant format
 +
* [[DEFLATE]]
 +
* [[:Category:ZIP based file formats]]
 +
* [[:Category:PKWARE]]
  
 
== Specifications ==
 
== Specifications ==
Line 209: Line 249:
 
** Bundled with PKZIP software through v1.93 - refer to the [[#Software]] section below.
 
** Bundled with PKZIP software through v1.93 - refer to the [[#Software]] section below.
 
* [http://www.iana.org/assignments/media-types/application/zip IANA registration for application/zip in July 1993] (corresponds to version 2 of APPNOTE.TXT)
 
* [http://www.iana.org/assignments/media-types/application/zip IANA registration for application/zip in July 1993] (corresponds to version 2 of APPNOTE.TXT)
* [http://kikaku.itscj.ipsj.or.jp/sc34/open/1855.pdf November 2012 working draft of ISO/IEC WD 21320-1, Document Container File -- Part 1: Core] Intended as restricted subset of APPNOTE 6.3.3 designed to promote interoperability.
+
* [https://www.iso.org/standard/60101.html ISO/IEC 21320-1: Document Container File] (see also {{LoCFDD|fdd000361}})
* [http://kikaku.itscj.ipsj.or.jp/sc34/open/1886.pdf February 2013 committee draft of ISO/IEC CD 21320-1, Document Container File -- Part 1: Core] Essentially the same as November 2012 working draft except that it mandates use of the UTF-8 indicator.
+
 
* [http://apple2.org.za/gswv/a2zine/GS.WorldView/Resources/The.MacShrinkIt.Project/ARCHIVES.TXT Archive format info, including ZIP] (from 1989, when ZIP was newly released)
 
* [http://apple2.org.za/gswv/a2zine/GS.WorldView/Resources/The.MacShrinkIt.Project/ARCHIVES.TXT Archive format info, including ZIP] (from 1989, when ZIP was newly released)
 
* [http://www.textfiles.com/programming/FORMATS/arc_fmts.txt ZIP file header format (among other archive types)]
 
* [http://www.textfiles.com/programming/FORMATS/arc_fmts.txt ZIP file header format (among other archive types)]
 
* [http://rescene.wikidot.com/torrentzip TorrentZip]
 
* [http://rescene.wikidot.com/torrentzip TorrentZip]
* Note that in general there is [https://twitter.com/tef/status/436555938879655937 no official file name encoding for ZIP files], and non ASCII filenames [http://stackoverflow.com/questions/106367/add-non-ascii-file-names-to-zip-in-java are not generally well supported]. The original implementation specified [[CP437|IBM Code Page 437]] for filenames, but as many characters cannot be expressed in that encoding, the filename bytes have often been interpreted using the current system codepage (implementation dependent behaviour). There is a flag to specify [[UTF-8]] as the encoding, but it is not supported in all major clients (e.g. Windows Explorer).
 
 
* [http://www.opensource.apple.com/source/zip/zip-6/unzip/unzip/proginfo/extra.fld Info-ZIP's "extra fields" documentation]
 
* [http://www.opensource.apple.com/source/zip/zip-6/unzip/unzip/proginfo/extra.fld Info-ZIP's "extra fields" documentation]
  
Line 222: Line 260:
  
 
== Software ==
 
== Software ==
* [http://www.info-zip.org/ Info-ZIP]: [http://www.info-zip.org/Zip.html Zip], [http://www.info-zip.org/UnZip.html UnZip]
+
* [http://www.info-zip.org Info-ZIP]: [http://www.info-zip.org/Zip.html Zip], [http://www.info-zip.org/UnZip.html UnZip]
 
* [[7-Zip]]
 
* [[7-Zip]]
 
* [[zlib]] - The zlib library does not support ZIP format, but it is distributed with "minizip" code that supports most ZIP files.
 
* [[zlib]] - The zlib library does not support ZIP format, but it is distributed with "minizip" code that supports most ZIP files.
 
* [http://www.nih.at/libzip/ libzip] - Uses zlib.
 
* [http://www.nih.at/libzip/ libzip] - Uses zlib.
* [http://www.libarchive.org/ libarchive] - Uses zlib.
+
* [http://www.libarchive.org libarchive] - Uses zlib.
* [http://zziplib.sourceforge.net/ zziplib]
+
* [http://zziplib.sourceforge.net zziplib]
 
** [http://search.cpan.org/~vspader/Archive-ZZip-0.13/ZZip/ZZip.pm Archive::ZZip]: Perl bindings for zziplib
 
** [http://search.cpan.org/~vspader/Archive-ZZip-0.13/ZZip/ZZip.pm Archive::ZZip]: Perl bindings for zziplib
 
* [https://github.com/richgel999/miniz miniz]
 
* [https://github.com/richgel999/miniz miniz]
* PKZIP
+
* [[PKZIP]]
** For DOS: {{CdTextfiles|swinnund/disk3/FILEUTIL/PKZ080.ZIP|0.80-beta}} · {{CdTextfiles|rbbsv3n1/pool/pkz090.exe|0.90}} · {{CdTextfiles|rbbsv3n1/pool/pkz092.exe|0.92}} · {{CdTextfiles|rbbsv3n1/pool/pkz101.exe|1.01}} · {{CdTextfiles|rbbsv3n1/pool/pkz102.exe|1.02}} · '''{{CdTextfiles|rbbsv3n1/pool/pkz110.exe|1.10}}''' · {{CdTextfiles|somuch/smsharew2/UTIL2/PKZ193A.EXE|1.93-alpha}} · {{CdTextfiles|20mnn/ARCHIVE/PKZ204C.EXE|2.04c}} · {{CdTextfiles|megmonster/ARCHIVE/PKZ204E.EXE|2.04e}} · '''{{CdTextfiles|simtel/simtel20/MSDOS/ZIP/PKZ204G.EXE|2.04g}}''' · {{CdTextfiles|simtel/simtel0101/simtel/arcers/pk250dos.exe|2.50}}
+
** For DOS: {{CdTextfiles|rbbsv3n1/pool/pkz110.exe|1.10}} · {{CdTextfiles|simtel/simtel20/MSDOS/ZIP/PKZ204G.EXE|2.04g}} · {{CdTextfiles|simtel/simtel0101/simtel/arcers/pk250dos.exe|2.50}}
 +
** For Windows console: [{{SACFTPURL|pack|pkzc400s.exe}} 4.00]
 +
** See [[PKZIP#Software]] for more versions.
 
* [[Konvertor]]
 
* [[Konvertor]]
 +
* {{Deark}} (for analysis, or converting old compression methods)
 +
* [{{SACFTPURL|pack|pcdezip.zip}} PCDEZIP] - Source code and DOS binary to decompress ''Deflate'' and all older methods (1993)
 +
* {{XAD}}
 +
* [[Xelitan Zip]]
  
 
== Sample files ==
 
== Sample files ==
 
* https://github.com/corkami/pocs/tree/master/zip
 
* https://github.com/corkami/pocs/tree/master/zip
 
* Examples that use the uncommon "Reduce" compression scheme: {{CdTextfiles|ccbwindows93/CORELDRA/VISA_CRD.ZIP|VISA_CRD.ZIP}}, {{CdTextfiles|librisbritannia/GRAPHICS/CLIPMAC/1608A.ZIP|1608A.ZIP}} → D1-MAC.ZIP
 
* Examples that use the uncommon "Reduce" compression scheme: {{CdTextfiles|ccbwindows93/CORELDRA/VISA_CRD.ZIP|VISA_CRD.ZIP}}, {{CdTextfiles|librisbritannia/GRAPHICS/CLIPMAC/1608A.ZIP|1608A.ZIP}} → D1-MAC.ZIP
 
+
* {{DexvertSamples|archive/zip}}
== References ==
+
<references/>
+
  
 
== Links ==
 
== Links ==
* [[Wikipedia:Zip (file format)|Wikipedia: Zip (file format)]]
+
* [[Wikipedia:Zip (file format)|Wikipedia: ZIP (file format)]]
 
* [[Wikipedia:PKZIP|Wikipedia: PKZIP]]
 
* [[Wikipedia:PKZIP|Wikipedia: PKZIP]]
 
* [http://research.swtch.com/zip Zip files all the way down] (creating an infinitely-regressed ZIP file)
 
* [http://research.swtch.com/zip Zip files all the way down] (creating an infinitely-regressed ZIP file)
 
* [http://imgur.com/a/PbN8H#1 ZIP101 an archive walkthrough]
 
* [http://imgur.com/a/PbN8H#1 ZIP101 an archive walkthrough]
 
* [http://literarymachin.es/deepzoom-osd-server/ Serve deepzoom images from a zip archive with openseadragon]
 
* [http://literarymachin.es/deepzoom-osd-server/ Serve deepzoom images from a zip archive with openseadragon]
* [https://stackoverflow.com/questions/20762094/how-are-zlib-gzip-and-zip-related-what-do-they-have-in-common-and-how-are-they/20765054#20765054 How are zlib, gzip and Zip related? What do they have in common and how are they different?] - Response to StackOverflow question by zlib/gzip co-creator Mark Adler  
+
* [https://stackoverflow.com/questions/20762094/how-are-zlib-gzip-and-zip-related-what-do-they-have-in-common-and-how-are-they/20765054#20765054 How are zlib, gzip and Zip related? What do they have in common and how are they different?] - Response to StackOverflow question by zlib/gzip co-creator Mark Adler
 +
* [https://www.bitsgalore.org/2020/03/11/does-microsoft-onedrive-export-large-ZIP-files-that-are-corrupt Does Microsoft OneDrive export large ZIP files that are corrupt?] - Discusses an issue where large ZIP files generated by Microsoft OneDrive result in read errors when they are opened with tools like Info-Zip and 7-Zip
 +
* [https://blog.archive.org/2019/02/13/zip-is-broken-except-its-not-except-it-is/ ZIP is Broken, Except it’s Not, Except it Is]
 +
* [https://www.hanshq.net/zip.html Zip Files: History, Explanation and Implementation] and [https://www.hanshq.net/zip2.html The Legacy Zip Compression Methods]
 +
* [https://games.greggman.com/game/zip-rant/ Zip - How not to design a file format.]
 +
* [https://gynvael.coldwind.pl/download.php?f=TenThousandSecurityPitfalls_theZIPfileFormat.pdf Ten thousand security pitfalls: The ZIP file format by Gynvael Coldwind (2018)]
 +
 
 
[[Category:Compression]]
 
[[Category:Compression]]
 
[[Category:Metaformats]]
 
[[Category:Metaformats]]
 
[[Category:ZIP based file formats]]
 
[[Category:ZIP based file formats]]
 
[[Category:PKWARE]]
 
[[Category:PKWARE]]

Latest revision as of 20:59, 5 October 2024

File Format
Name ZIP
Ontology
Extension(s) .zip
MIME Type(s) application/zip
LoCFDD fdd000354, fdd000355, fdd000362, fdd000361
PRONOM x-fmt/263
UTI com.pkware.zip-archive
Wikidata ID Q136218
Kaitai Struct Spec zip.ksy
Released 1989

ZIP is one of the most popular file compression/archiving formats.

Contents

[edit] Disambiguation

There are many other things, related or not, whose names use the word "Zip". Only a few are listed here.

  • Zip disk, an unrelated disk cartridge unit
  • Z-language Interpreter Program (ZIP) - See Z-code.
  • Zip-Archiv - See ZAR (Zip-Archiv) (an urelated format).
  • "ZIP compression" can sometimes refer to DEFLATE, or something based on it (zlib, Gzip).

See also the "See also" section, elsewhere on this page.

[edit] Discussion

ZIP was created in 1989 as the native format of the PKZIP program, which was introduced by Phil Katz (with co-creator Gary Conway) in the wake of a lawsuit (which he lost) against him by the makers of the then-popular ARC program (and file format) for copyright and trademark infringement in an earlier program PKARC which had been file-compatible with ARC. This resulted in Katz creating a new file format, which rapidly overtook ARC in popularity (to a large extent because of BBS sysops, then the primary users of such compression, resenting the lawsuit). Many programs have been released for a variety of operating systems to compress and decompress ZIP files, and native support for the format is built into several popular operating systems.

ZIP implementations vary in their support for features in the specification from PKWARE (see "APPNOTE" in the Specifications section below), particularly features added since version 2 (1993). Many implementations limit the use of compression to the DEFLATE algorithm, introduced with version 2. Extensions incorporated into the specification that have been widely adopted include large files (using a technique known as ZIP64), and filenames in UTF-8.

An interoperable subset of ZIP has been defined, and published as ISO/IEC 21320-1: Document Container File (refer to the Specifications section below). Designed to promote interoperable implementations, it prohibits various features, including compression other than DEFLATE, multiple volumes, and various encryption-related features.

While .zip is the usual file extension, ZIP-formatted files can be found with many other extensions since a number of other file formats use ZIP compression but store their files in application-specific extensions. See Category:ZIP based file formats for a list of such formats.

[edit] Format details

[edit] Compression

Each file in a ZIP file is compressed using one of a number of compression algorithms. Only compression types 0 (uncompressed) and 8 (DEFLATE) are likely to be seen in modern portable ZIP files. In old ZIP files, types 1 (Shrink) and 6 (Implode) are common.

Code Compression scheme Notes and references
0 Uncompressed
1 Shrink LZW. Used by PKZIP 0.x and 1.x.
2–5 Reduce LZ77 + prediction. Used by PKZIP v0.x. See also SCRNCH.
6 Implode LZ77 + Huffman. Used by PKZIP v1.x.
7 Tokenized Never used?
8 DEFLATE LZ77 + Huffman. Used by PKZIP v2.0+.
9 Deflate64, a.k.a. Enhanced Deflate Format version 2.1+.
10 PKWARE DCL Implode (old IBM TERSE) Format version 2.5+.
12 Bzip2 Format version 4.6+.
14 LZMA (EFS) Defined in ZIP specification v6.3+.
16 IBM z/OS CMPSC Defined in ZIP specification v6.3.5+.
18 IBM TERSE (new) Defined in ZIP specification v6.2.2+.
19 IBM LZ77 z Architecture (PFS) Defined in ZIP specification v6.3.5+.
93 Zstandard Defined in ZIP specification v6.3.8+.
94 MP3 Defined in ZIP specification v6.3.8+. Supported by WinZip 21+.
95 XZ Defined in ZIP specification v6.3.8+. Supported by WinZip 18+.
96 JPEG variant Defined in ZIP specification v6.3.5+.
97 WavPack Defined in ZIP specification v6.3.2+.
98 PPMd version I, Rev 1 Defined in ZIP specification v6.3+.
99 AES / AE-x encryption marker Defined in ZIP specification v6.3.5+.

[edit] Extensible data fields

Each member file of a ZIP file may have one or more extensible data fields (or extra fields), containing arbitrary data. Each field is tagged with a 16-bit identifier. Extra fields are normally used for platform-specific or filesystem-specific metadata, or to work around limitations of the original ZIP format. They are not normally used for application-specific data.

Most of the extra fields in use are documented in the ZIP "APPNOTE" specification, or by the Info-ZIP software (e.g. the proginfo/extrafld.txt file in the Zip program's source distribution).

Known extensible data fields:

ID Owner Description Reference (identification) Reference (details)
0x0001 PKWARE Zip64 extended information APPNOTE APPNOTE, Info-ZIP
0x0007 PKWARE AV Info APPNOTE
0x0008 PKWARE Reserved for extended language encoding data (PFS) APPNOTE
0x0009 PKWARE OS/2 APPNOTE APPNOTE, Info-ZIP
0x000a PKWARE NTFS APPNOTE APPNOTE, Info-ZIP
0x000c PKWARE OpenVMS APPNOTE APPNOTE, Info-ZIP
0x000d PKWARE UNIX APPNOTE APPNOTE, Info-ZIP
0x000e PKWARE Reserved for file stream and fork descriptors APPNOTE
0x000f PKWARE Patch Descriptor APPNOTE APPNOTE, Info-ZIP
0x0014 PKWARE PKCS#7 Store for X.509 Certificates APPNOTE APPNOTE, Info-ZIP
0x0015 PKWARE X.509 Certificate ID and Signature for individual file APPNOTE APPNOTE, Info-ZIP
0x0016 PKWARE X.509 Certificate ID for Central Directory APPNOTE APPNOTE, Info-ZIP
0x0017 PKWARE Strong Encryption Header APPNOTE APPNOTE
0x0018 PKWARE Record Management Controls APPNOTE APPNOTE
0x0019 PKWARE PKCS#7 Encryption Recipient Certificate List APPNOTE APPNOTE
0x0020 PKWARE Reserved for Timestamp APPNOTE
0x0021 PKWARE Policy Decryption Key APPNOTE APPNOTE
0x0022 PKWARE Smartcrypt Key Provider APPNOTE APPNOTE
0x0023 PKWARE Smartcrypt Policy Key Data APPNOTE APPNOTE
0x0065 PKWARE MVS / IBM S/390 (Z390) attributes - uncompressed APPNOTE APPNOTE
PKWARE OS/400 / AS/400 (I400) attributes - uncompressed APPNOTE APPNOTE
0x0066 PKWARE Reserved for IBM S/390 (Z390), AS/400 (I400) attributes - compressed APPNOTE
0x07c8 Macintosh (Info-ZIP Macintosh, old) APPNOTE Info-ZIP
0x2605 ZipIt Macintosh APPNOTE APPNOTE, Info-ZIP
0x2705 ZipIt Macintosh 1.3.5+ (w/o full filename) APPNOTE APPNOTE, Info-ZIP
0x2805 ZipIt Macintosh 1.3.5+ APPNOTE APPNOTE
0x334d "M3" Info-ZIP Macintosh APPNOTE Info-ZIP
0x4154 "TA" Tandem NSK Info-ZIP Info-ZIP
0x4341 "AC" Acorn/SparkFS APPNOTE Info-ZIP
0x4453 "SE" Windows NT security descriptor (binary ACL) APPNOTE Info-ZIP
0x4690 PKWARE POSZIP 4690 (reserved) APPNOTE
0x4704 VM/CMS APPNOTE Info-ZIP
0x470f MVS APPNOTE Info-ZIP
0x4854 "TH" Theos (old) Info-ZIP Info-ZIP
0x4b46 "FK" FWKCS MD5 APPNOTE APPNOTE, Info-ZIP
0x4c41 "AL" OS/2 access control list (text ACL) APPNOTE Info-ZIP
0x4d49 "IM" Info-ZIP OpenVMS APPNOTE Info-ZIP
0x4d63 "cM" Macintosh SmartZIP Info-ZIP Info-ZIP
0x4f4c "LO" Xceed original location APPNOTE
0x5350 "PS" (Observed in some Psion files.)
0x5356 "VS" AOS/VS (binary ACL) APPNOTE Info-ZIP
0x5455 "UT" Extended timestamp APPNOTE Info-ZIP
0x554e "NU" Xceed unicode APPNOTE
0x5855 "UX" Info-ZIP UNIX (original, also OS/2, NT, etc.) APPNOTE Info-ZIP
0x6375 "uc" Info-ZIP Unicode Comment APPNOTE APPNOTE, Info-ZIP
0x6542 "Be" BeOS (BeBox, PowerMac, etc.) APPNOTE Info-ZIP
0x6854 "Th" Theos Info-ZIP Info-ZIP
0x7075 "up" Info-ZIP Unicode Path APPNOTE APPNOTE, Info-ZIP
0x7441 "At" AtheOS Old Info-ZIP Old Info-ZIP (e.g. zip v2.32 [1])
0x756e "nu" ASi UNIX APPNOTE Info-ZIP
0x7855 "Ux" Info-ZIP Unix (previous new) APPNOTE Info-ZIP
0x7875 "ux" Info-ZIP Unix (new) Info-ZIP Info-ZIP
0xa11e Data Stream Alignment APPNOTE APPNOTE
0xa220 Microsoft Open Packaging Growth Hint APPNOTE APPNOTE
0xfb4a SMS/QDOS Info-ZIP Info-ZIP
0xfd4a SMS/QDOS APPNOTE

[edit] Multi-part archives

The ZIP format has supported archives consisting of multiple files from day one, though this feature does not appear to have been utilized by PKZIP until floppy disk spanning features were added in the v2.xx series.

The first fragment of a multi-part archive usually begins with signature bytes 'P' 'K' 0x07 0x08. This signature can be present even for single-part archives, if the disk spanning feature was enabled but turned out not to be needed.

The APPNOTE documentation (at least v4.5+) also mentions signature 'P' 'K' 0x30 0x30, calling it the "temporary spanning marker", but more research is needed to understand when it is used.

Fragments of a multi-part archive that are neither first nor last may not have any identifying signatures.

Only the last fragment necessarily contains the usual 'P' 'K' 0x05 0x06 signature.

[edit] Self-extracting archives

Refer to Self-extracting ZIP.

[edit] Character encoding

In general there is no official file name encoding for ZIP files, and non ASCII filenames are not generally well supported. The original implementation specified IBM Code Page 437 for filenames, but as many characters cannot be expressed in that encoding, the filename bytes have often been interpreted using the current system codepage (implementation dependent behaviour). There is a flag to specify UTF-8 as the encoding, but it is not supported in all major clients (e.g. Windows Explorer).

[edit] Identification

The byte sequence 'P' 'K' 0x05 0x06 (the "end of central directory signature") appears somewhere in the file, usually beginning exactly 22 bytes from the end of the file. However, it will appear earlier if the file contains a "ZIP file comment" (common in the BBS era, but rare today), or for various other reasons. There seems to be no theoretical limit to how far back you may have to search for the signature, but some software limits it to around 64KB, which is the maximum length of a comment.

Most ZIP files (but not self-extracting ZIP files) happen to begin with 'P' 'K' 0x03 0x04. This is not a global file signature, but is the signature that appears once for every compressed file inside the ZIP file. Some ZIP-based formats are designed such that they necessarily begin in this way. But in general, it is even legal for a ZIP file to contain zero files, and such a ZIP file would not contain this signature at all.

Refer to the #Multi-part archives section, elsewhere on this page, for additional relevant information.

That Phil Katz guy has thus managed to get his initials at the start of a large number of files on many millions of computers and devices, given how many file formats are based on ZIP (even if they use different extensions). He died in 2000, but this memorial to him will live on indefinitely.

[edit] See also

[edit] Specifications

[edit] Metaformat files

[edit] Software

[edit] Sample files

[edit] Links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox