Internet Archive metadata
The Internet Archive stores some metadata with its archived files. Some of the files you may encounter alongside Internet Archive items:
Contents |
Filename ending in _files.xml
<file name="Creative_Computing_v11_n06_1985_Jun.cbz" source="original"> <format>Comic Book ZIP</format> <mtime>1355605150</mtime> <size>1221828790</size> <md5>21bf35870c1d94c8a0b18b315ad5008c</md5> <crc32>18e75ee6</crc32> <sha1>80893211dc53a1931e745d15005a6a40bf7d1422</sha1> </file>
This has some basic data about the file in XML format, along with multiple checksums/hashes for data integrity checking.
format: Specifies the file format. Hopefully it's one you can look up in this site.
mtime: Unix-style timestamp of file
size: Size in bytes. Yes, the one in the example file above is over a gigabyte; those Internet Archive things can get huge!
md5: MD5 hash value
crc32: CRC-32 checksum
sha1: SHA-1 hash value
Filename ending in _meta.xml
<metadata> <identifier>creativecomputing-1985-06</identifier> <date>1985-06</date> <language>eng</language> <mediatype>texts</mediatype> <title>Creative Computing Magazine (June 1985) Volume 11 Number 06</title> <collection>creativecomputing</collection> <publicdate>2012-12-15 20:58:26</publicdate> <uploader>jscott@archive.org</uploader> <addeddate>2012-12-15 20:58:26</addeddate> <collection>computermagazines</collection> <identifier-access>http://archive.org/details/creativecomputing-1985-06</identifier-access> <identifier-ark>ark:/13960/t3hx2n91q</identifier-ark> </metadata>
Various metadata for the file, also in XML format.
identifier: A unique identifier for the item
date: Original creation/release date of item. YYYY-MM in this case; maybe others are YYYY-MM-DD?
language: Language code of item (human language, not computer language!)
mediatype: texts for stuff like scanned books/magazines; probably other codes for things like audio and video
title: A (human-readable) title of the item
collection: Designates which group of items it is part of
publicdate: Date this archive version was made public
uploader: Identifies who uploaded it
addeddate: Date it was added (in this case the same as the public date, and a few seconds before the timestamp in the first file above)
collection: Another collection; I guess an item can be part of more than one of them.
identifier-access: The URL where it can be found on Internet Archive
identifier-ark: Hmmm... seems like yet another identifier/locator thingy, in some proprietary protocol?
Filename ending in _meta.txt
ETag: "21bf35870c1d94c8a0b18b315ad5008c" accept: */* authorization: LOW D7JBDnum0KwPVX5x:REDACTED_BY_IA_S3 connection: close content-length: 1221828790 expect: 100-continue host: s3.us.archive.org user-agent: curl/7.21.6 (x86_64-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3 x-amz-auto-make-bucket: 1 x-archive-meta-date: 1985-06 x-archive-meta-language: eng x-archive-meta-mediatype: texts x-archive-meta-title: Creative Computing Magazine (June 1985) Volume 11 Number 06 x-archive-meta01-collection: creativecomputing x-upload-date: 2012-12-15T20:59:11.000Z
Various metadata from the above files, organized in HTTP header format.
Filename ending in _meta.sqlite
An SQLite version of the metadata.
Links
- The Internet Archive S3 API Documentation (one way of accessing Internet Archive materials)
- Python interface to archive.org
- Data Mining the Internet Archive Collection