WARC

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Other links and references)
(Changed is first paragraph for clarity; additional details on how gzipping works)
Line 6: Line 6:
 
}}
 
}}
  
'''WARC''' is the successor to the [[ARC (Internet Archive)]] format. Standardized as ISO 28500:2009, Information and documentation -- WARC file format.  Developed under the auspices of the [http://netpreserve.org/ International Internet Preservation Consortium]. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving.
+
'''WARC''' is an archive file format which has been the predominant format for [[Web]] ([[HTTP]]) archives from 2009 to (as of 2019) the present; it is also used for archives of documents collected through other prococols (e.g. [[FTP]]), and could technically be used to store a collection of ordinary files. It is the successor to the [[ARC (Internet Archive)|ARC]] format. It was developed under the auspices of the [http://netpreserve.org/ International Internet Preservation Consortium], and standardized as ISO 28500:2009, Information and documentation -- WARC file format. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving.
  
WARC files are often compressed using [[gzip]], resulting in a '''.warc.gz''' extension.
+
WARC files are often compressed using [[gzip]], resulting in a '''.warc.gz''' extension. In cases where the warc.gz file needs to randomly accessed (viz., as part of web archives accessible page-by-page), this will consist of one gzip file for each WARC record, concatenated together (which makes for a valid gzip file). This allows any single record to be accessed by an offset, and (when the entire file is decompressed) also preserves the original WARC.
  
 
There is also a specification for a [[Web Archive Metadata File]]. Another metadata format used with WARC files is [[CDX]].
 
There is also a specification for a [[Web Archive Metadata File]]. Another metadata format used with WARC files is [[CDX]].

Revision as of 07:48, 23 September 2019

File Format
Name WARC
Ontology
Extension(s) .warc, .warc.gz
MIME Type(s) application/warc, application/warc-fields
PRONOM fmt/289

WARC is an archive file format which has been the predominant format for Web (HTTP) archives from 2009 to (as of 2019) the present; it is also used for archives of documents collected through other prococols (e.g. FTP), and could technically be used to store a collection of ordinary files. It is the successor to the ARC format. It was developed under the auspices of the International Internet Preservation Consortium, and standardized as ISO 28500:2009, Information and documentation -- WARC file format. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving.

WARC files are often compressed using gzip, resulting in a .warc.gz extension. In cases where the warc.gz file needs to randomly accessed (viz., as part of web archives accessible page-by-page), this will consist of one gzip file for each WARC record, concatenated together (which makes for a valid gzip file). This allows any single record to be accessed by an offset, and (when the entire file is decompressed) also preserves the original WARC.

There is also a specification for a Web Archive Metadata File. Another metadata format used with WARC files is CDX.

Contents

Specifications

Sample files

Tools

Other links and references

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox