Internet Archive metadata
Dan Tobias (Talk | contribs) (→Filename ending in _meta.xml) |
(→Links: More links - official docs and a video presentation) |
||
(9 intermediate revisions by 2 users not shown) | |||
Line 27: | Line 27: | ||
'''md5''': [[MD5]] hash value | '''md5''': [[MD5]] hash value | ||
− | '''crc32''': [[CRC | + | '''crc32''': [[CRC-32]] checksum |
'''sha1''': [[SHA-1]] hash value | '''sha1''': [[SHA-1]] hash value | ||
Line 93: | Line 93: | ||
Various metadata from the above files, organized in [[HTTP]] header format. | Various metadata from the above files, organized in [[HTTP]] header format. | ||
+ | |||
+ | == Filename ending in _meta.sqlite == | ||
+ | |||
+ | An [[SQLite]] version of the metadata. | ||
+ | |||
+ | == See also == | ||
+ | * [[Web Archive Metadata File]] | ||
+ | |||
+ | == Links == | ||
+ | * [https://archive.org/services/docs/api/metadata-schema/index.html Internet Archive Metadata] (Official documentation) | ||
+ | * [http://anonymoushash.vmbrasseur.com/2014/01/05/announcing-documentation-for-the-internet-archive-s3-api/ The Internet Archive S3 API Documentation] (one way of accessing Internet Archive materials) | ||
+ | * [https://pypi.python.org/pypi/internetarchive/0.5.4 Python interface to archive.org] | ||
+ | * [https://archive.org/download/ArchiveAcademyTranscoded_201704/Jim%20On%20IA%20Item%20Structure%20And%20Perms%20%28WebEx%20Aug%202016%29-.wmv Jim On IA Item Structure And Perms (WebEx Aug 2016)-.wmv]: Internet Archive-produced "class" on items and metadata, apparently intended primarily for employees, but released to the public also | ||
+ | * [http://programminghistorian.org/lessons/data-mining-the-internet-archive Data Mining the Internet Archive Collection] | ||
[[Category:Metadata]] | [[Category:Metadata]] | ||
+ | [[Category:Internet Archive]] |
Latest revision as of 09:32, 10 September 2019
The Internet Archive stores some metadata with its archived files. Some of the files you may encounter alongside Internet Archive items:
Contents |
[edit] Filename ending in _files.xml
<file name="Creative_Computing_v11_n06_1985_Jun.cbz" source="original"> <format>Comic Book ZIP</format> <mtime>1355605150</mtime> <size>1221828790</size> <md5>21bf35870c1d94c8a0b18b315ad5008c</md5> <crc32>18e75ee6</crc32> <sha1>80893211dc53a1931e745d15005a6a40bf7d1422</sha1> </file>
This has some basic data about the file in XML format, along with multiple checksums/hashes for data integrity checking.
format: Specifies the file format. Hopefully it's one you can look up in this site.
mtime: Unix-style timestamp of file
size: Size in bytes. Yes, the one in the example file above is over a gigabyte; those Internet Archive things can get huge!
md5: MD5 hash value
crc32: CRC-32 checksum
sha1: SHA-1 hash value
[edit] Filename ending in _meta.xml
<metadata> <identifier>creativecomputing-1985-06</identifier> <date>1985-06</date> <language>eng</language> <mediatype>texts</mediatype> <title>Creative Computing Magazine (June 1985) Volume 11 Number 06</title> <collection>creativecomputing</collection> <publicdate>2012-12-15 20:58:26</publicdate> <uploader>jscott@archive.org</uploader> <addeddate>2012-12-15 20:58:26</addeddate> <collection>computermagazines</collection> <identifier-access>http://archive.org/details/creativecomputing-1985-06</identifier-access> <identifier-ark>ark:/13960/t3hx2n91q</identifier-ark> </metadata>
Various metadata for the file, also in XML format.
identifier: A unique identifier for the item
date: Original creation/release date of item. YYYY-MM in this case; maybe others are YYYY-MM-DD?
language: Language code of item (human language, not computer language!)
mediatype: texts for stuff like scanned books/magazines; probably other codes for things like audio and video
title: A (human-readable) title of the item
collection: Designates which group of items it is part of
publicdate: Date this archive version was made public
uploader: Identifies who uploaded it
addeddate: Date it was added (in this case the same as the public date, and a few seconds before the timestamp in the first file above)
collection: Another collection; I guess an item can be part of more than one of them.
identifier-access: The URL where it can be found on Internet Archive
identifier-ark: Hmmm... seems like yet another identifier/locator thingy, in some proprietary protocol?
[edit] Filename ending in _meta.txt
ETag: "21bf35870c1d94c8a0b18b315ad5008c" accept: */* authorization: LOW D7JBDnum0KwPVX5x:REDACTED_BY_IA_S3 connection: close content-length: 1221828790 expect: 100-continue host: s3.us.archive.org user-agent: curl/7.21.6 (x86_64-pc-linux-gnu) libcurl/7.21.6 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.22 librtmp/2.3 x-amz-auto-make-bucket: 1 x-archive-meta-date: 1985-06 x-archive-meta-language: eng x-archive-meta-mediatype: texts x-archive-meta-title: Creative Computing Magazine (June 1985) Volume 11 Number 06 x-archive-meta01-collection: creativecomputing x-upload-date: 2012-12-15T20:59:11.000Z
Various metadata from the above files, organized in HTTP header format.
[edit] Filename ending in _meta.sqlite
An SQLite version of the metadata.
[edit] See also
[edit] Links
- Internet Archive Metadata (Official documentation)
- The Internet Archive S3 API Documentation (one way of accessing Internet Archive materials)
- Python interface to archive.org
- Jim On IA Item Structure And Perms (WebEx Aug 2016)-.wmv: Internet Archive-produced "class" on items and metadata, apparently intended primarily for employees, but released to the public also
- Data Mining the Internet Archive Collection