Internet Archive metadata

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Links)
Line 100: Line 100:
 
== Links ==
 
== Links ==
 
* [http://anonymoushash.vmbrasseur.com/2014/01/05/announcing-documentation-for-the-internet-archive-s3-api/ The Internet Archive S3 API Documentation] (one way of accessing Internet Archive materials)
 
* [http://anonymoushash.vmbrasseur.com/2014/01/05/announcing-documentation-for-the-internet-archive-s3-api/ The Internet Archive S3 API Documentation] (one way of accessing Internet Archive materials)
 +
* [https://pypi.python.org/pypi/internetarchive/0.5.4 Python interface to archive.org]
 
* [http://programminghistorian.org/lessons/data-mining-the-internet-archive Data Mining the Internet Archive Collection]
 
* [http://programminghistorian.org/lessons/data-mining-the-internet-archive Data Mining the Internet Archive Collection]
  
 
[[Category:Metadata]]
 
[[Category:Metadata]]
 
[[Category:Internet Archive]]
 
[[Category:Internet Archive]]

Revision as of 02:10, 1 May 2014

File Format
Name Internet Archive metadata
Ontology

The Internet Archive stores some metadata with its archived files. Some of the files you may encounter alongside Internet Archive items:

Contents

Filename ending in _files.xml

<file name="Creative_Computing_v11_n06_1985_Jun.cbz" source="original">
<format>Comic Book ZIP</format>
<mtime>1355605150</mtime>
<size>1221828790</size>
<md5>21bf35870c1d94c8a0b18b315ad5008c</md5>
<crc32>18e75ee6</crc32>
<sha1>80893211dc53a1931e745d15005a6a40bf7d1422</sha1>
</file>

This has some basic data about the file in XML format, along with multiple checksums/hashes for data integrity checking.

format: Specifies the file format. Hopefully it's one you can look up in this site.

mtime: Unix-style timestamp of file

size: Size in bytes. Yes, the one in the example file above is over a gigabyte; those Internet Archive things can get huge!

md5: MD5 hash value

crc32: CRC-32 checksum

sha1: SHA-1 hash value

Filename ending in _meta.xml

<metadata>
<identifier>creativecomputing-1985-06</identifier>
<date>1985-06</date>
<language>eng</language>
<mediatype>texts</mediatype>
<title>Creative Computing Magazine (June 1985) Volume 11 Number 06</title>
<collection>creativecomputing</collection>
<publicdate>2012-12-15 20:58:26</publicdate>
<uploader>jscott@archive.org</uploader>
<addeddate>2012-12-15 20:58:26</addeddate>
<collection>computermagazines</collection>
<identifier-access>http://archive.org/details/creativecomputing-1985-06</identifier-access>
<identifier-ark>ark:/13960/t3hx2n91q</identifier-ark>
</metadata>

Various metadata for the file, also in XML format.

identifier: A unique identifier for the item

date: Original creation/release date of item. YYYY-MM in this case; maybe others are YYYY-MM-DD?

language: Language code of item (human language, not computer language!)

mediatype: texts for stuff like scanned books/magazines; probably other codes for things like audio and video

title: A (human-readable) title of the item

collection: Designates which group of items it is part of

publicdate: Date this archive version was made public

uploader: Identifies who uploaded it

addeddate: Date it was added (in this case the same as the public date, and a few seconds before the timestamp in the first file above)

collection: Another collection; I guess an item can be part of more than one of them.

identifier-access: The URL where it can be found on Internet Archive

identifier-ark: Hmmm... seems like yet another identifier/locator thingy, in some proprietary protocol?

Filename ending in _meta.txt

ETag: "21bf35870c1d94c8a0b18b315ad5008c"
accept: */*
authorization: LOW D7JBDnum0KwPVX5x:REDACTED_BY_IA_S3
connection: close
content-length: 1221828790
expect: 100-continue
host: s3.us.archive.org
user-agent: curl/7.21.6 (x86_64-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3
x-amz-auto-make-bucket: 1
x-archive-meta-date: 1985-06
x-archive-meta-language: eng
x-archive-meta-mediatype: texts
x-archive-meta-title: Creative Computing Magazine (June 1985) Volume 11 Number 06
x-archive-meta01-collection: creativecomputing
x-upload-date: 2012-12-15T20:59:11.000Z

Various metadata from the above files, organized in HTTP header format.

Filename ending in _meta.sqlite

An SQLite version of the metadata.

Links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox