File Format
Internet Archive metadata

The Internet Archive stores some metadata with its archived files. Some of the files you may encounter alongside Internet Archive items:


Filename ending in _files.xml

<file name="Creative_Computing_v11_n06_1985_Jun.cbz" source="original">
<format>Comic Book ZIP</format>

This has some basic data about the file in XML format, along with multiple checksums/hashes for data integrity checking.

format: Specifies the file format. Hopefully it's one you can look up in this site.

mtime: Unix-style timestamp of file

size: Size in bytes. Yes, the one in the example file above is over a gigabyte; those Internet Archive things can get huge!

md5: MD5 hash value

crc32: CRC-32 checksum

sha1: SHA-1 hash value

Filename ending in _meta.xml

<title>Creative Computing Magazine (June 1985) Volume 11 Number 06</title>
<publicdate>2012-12-15 20:58:26</publicdate>
<addeddate>2012-12-15 20:58:26</addeddate>

Various metadata for the file, also in XML format.

identifier: A unique identifier for the item

date: Original creation/release date of item. YYYY-MM in this case; maybe others are YYYY-MM-DD?

language: Language code of item (human language, not computer language!)

mediatype: texts for stuff like scanned books/magazines; probably other codes for things like audio and video

title: A (human-readable) title of the item

collection: Designates which group of items it is part of

publicdate: Date this archive version was made public

uploader: Identifies who uploaded it

addeddate: Date it was added (in this case the same as the public date, and a few seconds before the timestamp in the first file above)

collection: Another collection; I guess an item can be part of more than one of them.

identifier-access: The URL where it can be found on Internet Archive

identifier-ark: Hmmm... seems like yet another identifier/locator thingy, in some proprietary protocol?

Filename ending in _meta.txt

ETag: "21bf35870c1d94c8a0b18b315ad5008c"
accept: */*
authorization: LOW D7JBDnum0KwPVX5x:REDACTED_BY_IA_S3
connection: close
content-length: 1221828790
expect: 100-continue
user-agent: curl/7.21.6 (x86_64-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/ libidn/1.22 librtmp/2.3
x-amz-auto-make-bucket: 1
x-archive-meta-date: 1985-06
x-archive-meta-language: eng
x-archive-meta-mediatype: texts
x-archive-meta-title: Creative Computing Magazine (June 1985) Volume 11 Number 06
x-archive-meta01-collection: creativecomputing
x-upload-date: 2012-12-15T20:59:11.000Z

Various metadata from the above files, organized in HTTP header format.

Filename ending in _meta.sqlite

An SQLite version of the metadata.

