ARC (Internet Archive)
From Just Solve the File Format Problem
(Difference between revisions)
m |
(Mention zipped form) |
||
Line 17: | Line 17: | ||
The '''Internet Archive ARC''' format has a successor format, [[WARC]] (ISO 28500:2009, Information and documentation -- WARC file format). | The '''Internet Archive ARC''' format has a successor format, [[WARC]] (ISO 28500:2009, Information and documentation -- WARC file format). | ||
+ | |||
+ | By default the Internet Archive crawler, Heritrix, creates records in zipped form, using gzip, with extension arc.gz. An ARC file in zipped form has each ARC record individually zipped and concatenated. | ||
== References == | == References == | ||
− | * [http://archive.org/web/researcher/ArcFileFormat.php Documentation | + | * [http://archive.org/web/researcher/ArcFileFormat.php Documentation at Internet Archive site] |
+ | * [http://crawler.archive.org/articles/developer_manual/arcs.html Internet Archive ARC files] from documentation for Heritrix. | ||
* [http://www.digitalpreservation.gov/formats/fdd/fdd000235.shtml ARC_IA, Internet Archive ARC file format, from Library of Congress resource on Sustainability of Digital Formats] | * [http://www.digitalpreservation.gov/formats/fdd/fdd000235.shtml ARC_IA, Internet Archive ARC file format, from Library of Congress resource on Sustainability of Digital Formats] |
Revision as of 16:34, 26 November 2012
File Formats | > | Electronic File Formats | > | Compression | > | ARC (Internet Archive) |
The Internet Archive ARC format is different from and not compatible with the better-known ARC format popular on bulletin board systems in the 1980s. It consists of a series of uncompressed files or data streams (taken directly as received from the Web or another source) combined in a single file with headers giving the original location (URL) from which the data was retrieved, and the archive date.
The Internet Archive ARC format has a successor format, WARC (ISO 28500:2009, Information and documentation -- WARC file format).
By default the Internet Archive crawler, Heritrix, creates records in zipped form, using gzip, with extension arc.gz. An ARC file in zipped form has each ARC record individually zipped and concatenated.