WARC
From Just Solve the File Format Problem
(Difference between revisions)
Dan Tobias (Talk | contribs) (→References) |
Dan Tobias (Talk | contribs) |
||
Line 1: | Line 1: | ||
{{FormatInfo | {{FormatInfo | ||
|subcat=Archiving | |subcat=Archiving | ||
− | |extensions={{ext|warc}} | + | |extensions={{ext|warc}}, {{ext|warc.gz}} |
|pronom={{PRONOM|fmt/289}} | |pronom={{PRONOM|fmt/289}} | ||
}} | }} | ||
− | + | '''WARC''' is the successor to the [[ARC (Internet Archive)]] format. Standardized as ISO 28500:2009, Information and documentation -- WARC file format. Developed under the auspices of the [http://netpreserve.org/ International Internet Preservation Consortium]. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving. | |
WARC files are often compressed using [[gzip]], resulting in a '''.warc.gz''' extension. | WARC files are often compressed using [[gzip]], resulting in a '''.warc.gz''' extension. | ||
+ | |||
+ | There is also a specification for a [[Web Archive Metadata File]]. | ||
+ | |||
+ | == Specifications == | ||
+ | * [http://bibnum.bnf.fr/WARC/warc_ISO_DIS_28500.pdf Draft of ISO-DIS 28500] As circulated for ISO ballot and approval. | ||
+ | * [http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml WARC, Web ARChive file format, from Library of Congress resource on Sustainability of Digital Formats] | ||
+ | * [http://archive-access.sourceforge.net/warc/ Working drafts for WARC specification] | ||
== Sample files == | == Sample files == | ||
Line 17: | Line 24: | ||
* [https://github.com/chfoo/warcat warcat: Tool and library for handling Web ARChive (WARC) files.] | * [https://github.com/chfoo/warcat warcat: Tool and library for handling Web ARChive (WARC) files.] | ||
− | == | + | == Other links and references == |
− | + | ||
− | + | ||
− | + | ||
* [http://bibnum.bnf.fr/WARC/ The WARC File Format (ISO 28500) - Information, Maintenance, Drafts] | * [http://bibnum.bnf.fr/WARC/ The WARC File Format (ISO 28500) - Information, Maintenance, Drafts] | ||
* [http://www.hanzoarchives.com/learning/warc_files Slide show on WARC] | * [http://www.hanzoarchives.com/learning/warc_files Slide show on WARC] |
Revision as of 13:31, 13 January 2015
WARC is the successor to the ARC (Internet Archive) format. Standardized as ISO 28500:2009, Information and documentation -- WARC file format. Developed under the auspices of the International Internet Preservation Consortium. WARC was developed as an extension to ARC in part to provide better capabilities for managing Web archives for the long term, allowing for capture of more metadata about the circumstances of archiving.
WARC files are often compressed using gzip, resulting in a .warc.gz extension.
There is also a specification for a Web Archive Metadata File.
Contents |
Specifications
- Draft of ISO-DIS 28500 As circulated for ISO ballot and approval.
- WARC, Web ARChive file format, from Library of Congress resource on Sustainability of Digital Formats
- Working drafts for WARC specification
Sample files
- Test WARC Files warc.gz file from Internet Archive.
Tools
- WARC Tools (in Python)
- Some history on the Python tools is available on here on the COPTR wiki.
- warcat: Tool and library for handling Web ARChive (WARC) files.