Internet Archive metadata

File Format
Name	Internet Archive metadata
Ontology	Electronic File Formats Archiving Internet Archive metadata ; ; ;

The Internet Archive stores some metadata with its archived files. Some of the files you may encounter alongside Internet Archive items:

Filename ending in _files.xml

<file name="Creative_Computing_v11_n06_1985_Jun.cbz" source="original">
<format>Comic Book ZIP</format>
<mtime>1355605150</mtime>
<size>1221828790</size>
<md5>21bf35870c1d94c8a0b18b315ad5008c</md5>
<crc32>18e75ee6</crc32>
<sha1>80893211dc53a1931e745d15005a6a40bf7d1422</sha1>
</file>

This has some basic data about the file in XML format, along with multiple checksums/hashes for data integrity checking.

format: Specifies the file format. Hopefully it's one you can look up in this site.

mtime: Unix-style timestamp of file

size: Size in bytes. Yes, the one in the example file above is over a gigabyte; those Internet Archive things can get huge!

md5: MD5 hash value

crc32: CRC-32 checksum

sha1: SHA-1 hash value

Filename ending in _meta.xml

<metadata>
<identifier>creativecomputing-1985-06</identifier>
<date>1985-06</date>
<language>eng</language>
<mediatype>texts</mediatype>
<title>Creative Computing Magazine (June 1985) Volume 11 Number 06</title>
<collection>creativecomputing</collection>
<publicdate>2012-12-15 20:58:26</publicdate>
<uploader>jscott@archive.org</uploader>
<addeddate>2012-12-15 20:58:26</addeddate>
<collection>computermagazines</collection>
<identifier-access>http://archive.org/details/creativecomputing-1985-06</identifier-access>
<identifier-ark>ark:/13960/t3hx2n91q</identifier-ark>
</metadata>

Various metadata for the file, also in XML format.

identifier: A unique identifier for the item

date: Original creation/release date of item. YYYY-MM in this case; maybe others are YYYY-MM-DD?

language: Language code of item (human language, not computer language!)

mediatype: texts for stuff like scanned books/magazines; probably other codes for things like audio and video

title: A (human-readable) title of the item

collection: Designates which group of items it is part of

publicdate: Date this archive version was made public

uploader: Identifies who uploaded it

addeddate: Date it was added (in this case the same as the public date, and a few seconds before the timestamp in the first file above)

collection: Another collection; I guess an item can be part of more than one of them.

identifier-access: The URL where it can be found on Internet Archive

identifier-ark: Hmmm... seems like yet another identifier/locator thingy, in some proprietary protocol?

Filename ending in _meta.txt

ETag: "21bf35870c1d94c8a0b18b315ad5008c"
accept: */*
authorization: LOW D7JBDnum0KwPVX5x:REDACTED_BY_IA_S3
connection: close
content-length: 1221828790
expect: 100-continue
host: s3.us.archive.org
user-agent: curl/7.21.6 (x86_64-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3
x-amz-auto-make-bucket: 1
x-archive-meta-date: 1985-06
x-archive-meta-language: eng
x-archive-meta-mediatype: texts
x-archive-meta-title: Creative Computing Magazine (June 1985) Volume 11 Number 06
x-archive-meta01-collection: creativecomputing
x-upload-date: 2012-12-15T20:59:11.000Z

Various metadata from the above files, organized in HTTP header format.

Filename ending in _meta.sqlite

An SQLite version of the metadata.

Links

The Internet Archive S3 API Documentation (one way of accessing Internet Archive materials)
Python interface to archive.org
Data Mining the Internet Archive Collection

Internet Archive metadata

Contents

Filename ending in _files.xml

Filename ending in _meta.xml

Filename ending in _meta.txt

Filename ending in _meta.sqlite

Links

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox