Internet Archive metadata

The Internet Archive stores some metadata with its archived files. Some of the files you may encounter alongside Internet Archive items:

Filename ending in _files.xml
 Comic Book ZIP 1355605150 1221828790  21bf35870c1d94c8a0b18b315ad5008c 18e75ee6 80893211dc53a1931e745d15005a6a40bf7d1422

This has some basic data about the file in XML format, along with multiple checksums/hashes for data integrity checking.

format: Specifies the file format. Hopefully it's one you can look up in this site.

mtime: Unix-style timestamp of file

size: Size in bytes. Yes, the one in the example file above is over a gigabyte; those Internet Archive things can get huge!

md5: MD5 hash value

crc32: CRC-32 checksum

sha1: SHA-1 hash value

Filename ending in _meta.xml
creativecomputing-1985-06 1985-06 eng texts Creative Computing Magazine (June 1985) Volume 11 Number 06 creativecomputing 2012-12-15 20:58:26 jscott@archive.org 2012-12-15 20:58:26 computermagazines http://archive.org/details/creativecomputing-1985-06 ark:/13960/t3hx2n91q

Various metadata for the file, also in XML format.

identifier: A unique identifier for the item

date: Original creation/release date of item. YYYY-MM in this case; maybe others are YYYY-MM-DD?

language: Language code of item (human language, not computer language!)

mediatype: texts for stuff like scanned books/magazines; probably other codes for things like audio and video

title: A (human-readable) title of the item

collection: Designates which group of items it is part of

publicdate: Date this archive version was made public

uploader: Identifies who uploaded it

addeddate: Date it was added (in this case the same as the public date, and a few seconds before the timestamp in the first file above)

collection: Another collection; I guess an item can be part of more than one of them.

identifier-access: The URL where it can be found on Internet Archive

identifier-ark: Hmmm... seems like yet another identifier/locator thingy, in some proprietary protocol?

Filename ending in _meta.txt
ETag: "21bf35870c1d94c8a0b18b315ad5008c" accept: */* authorization: LOW D7JBDnum0KwPVX5x:REDACTED_BY_IA_S3 connection: close content-length: 1221828790 expect: 100-continue host: s3.us.archive.org user-agent: curl/7.21.6 (x86_64-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3 x-amz-auto-make-bucket: 1 x-archive-meta-date: 1985-06 x-archive-meta-language: eng x-archive-meta-mediatype: texts x-archive-meta-title: Creative Computing Magazine (June 1985) Volume 11 Number 06 x-archive-meta01-collection: creativecomputing x-upload-date: 2012-12-15T20:59:11.000Z

Various metadata from the above files, organized in HTTP header format.

Filename ending in _meta.sqlite
An SQLite version of the metadata.

Links

 * Internet Archive Metadata (Official documentation)
 * The Internet Archive S3 API Documentation (one way of accessing Internet Archive materials)
 * Python interface to archive.org
 * Jim On IA Item Structure And Perms (WebEx Aug 2016)-.wmv: Internet Archive-produced "class" on items and metadata, apparently intended primarily for employees, but released to the public also
 * Data Mining the Internet Archive Collection