EPUB

Description
ePub is an open format defined by the Open eBook Forum of the International Digital Publishing Forum (IDPF). It is based on XHTML and XML along with optional CSS stylesheets. Its predecessor was the OEB standard.

Quoted from the IDPF web site:


 * '.epub' is the file extension of an XML format for reflowable digital books and publications. '.epub' is composed of three open standards, the Open Publication Structure (OPS), Open Packaging Format (OPF) and Open Container Format (OCF), produced by the IDPF. '.epub' allows publishers to produce and send a single digital publication file through distribution and offers consumers interoperability between software/hardware for unencrypted reflowable digital books and other publications. The Open eBook Publication Structure or 'OEB', originally produced in 1999, is the precursor to OPS.

The intent of ePub is to serve both as a source file format and an end user format. For this reason the files are collected into a container for easy dissemination and use. This container is generally a zip file but the extension has been renamed to .epub. It has special requirements by including an uncompressed mime type file while the rest of the data in the file is compressed. An ePub reader should be capable of reading the content in its compressed format. http://wiki.mobileread.com/wiki/EPUB

This is currently the main format used by the Amazon Kindle, replacing earlier use of MOBI or its variant AZW / AZW3.

Version 2
The IDPF specification page contains the specifications for this format. In particular check the version 2.01 OPS and OPF specifications and the version 1.01 OCF specifications. The informational documents are also quite useful in understanding the standard's intent and content.

An alternative XML syntax called DTBook was available as an option in this version, but this was removed in EPUB 3.

Version 3
ePub version 3 is the newest version of the standard and has now been recommended by the idpf standards committee.

In version 2.01 there were three defining documents, the OPF (Open Packaging Format), the OCF (Open Container Format), and the OPS (Open Publications Structure). The OPS referenced a DAISY standard for the NCX file. The new 3.0 standard has 4 defining documents with new names. The OPF becomes the ePub Publications standard. The OCF remains the same and the OPS received the most changes to become the ePub Content Documents. This now includes the old NCX specifications which are no longer used. A fourth document is concerned with Media Overlays and is a new feature of ePub version 3. http://wiki.mobileread.com/wiki/EPub_3

Digital Rights Management & Encryption
When preserving ePub files, it is important to know what if any rights restrictions and encryptions are present, so that action can be taken to ensure the content can still be accessed in the future. Unfortunately, there are currently no reliable tools we can use to automate this analysis. A general introduction to copy protection and ePub can be found here.

Adobe attracted controversy early in 2014 by announcing a "new, improved" DRM scheme for their version of EPUB files, which has the "feature" of being incompatible with the many e-readers that support their old DRM.

How it works
According to section 3.2 ("OCF ZIP Container") of the EPUB 3 spec. (section 4 in the 2.0.1 spec.): "Conforming OCF ZIP Containers MUST NOT use the encryption features defined by the ZIP format..." i.e. a valid EPUB's content files can always be read regardless of the DRM scheme in use.

Section 2.5.5 ("Rights Management"), again 3.5.6 in the 2.0.1 spec, states that: "An OPTIONAL file with the name “rights.xml” within the “META-INF” directory at the root level of the container file system is a reserved name in a valid OCF container. "The rights.xml file MUST NOT be encrypted. "When the rights.xml file is not present, the OCF container provides no information indicating any part of the container is rights governed." i.e. the presence of the META-INF/rights.xml file is an indicator that DRM is likely in use (however, its absence is apparently not an indication of the contrary).

Section 2.5.2 ("Encryption") - 3.5.5 in 2.0.1 - states that: "...if any resource within the container is encrypted, “encryption.xml” MUST be present to indicate that the resource is encrypted and provide information on how it is encrypted." Consequenly the existence of this file indicates that one or more encrypted files exist within the EPUB. However, as the rights information is essentially decoupled from the encryption scheme


 * Encrypted but not strictly governed by rights management.
 * Governed by right management but not encrypted.
 * Both encrypted and governed by rights management.

Though in these latter two cases the rights governance may not be clear from the EPUB structure.

Furthermore, note that not all encryption schemes are 'bad', i.e. requiring individual private key information - some are mere 'obfuscation' (some sample, with obfuscated fonts, can be found here: http://idpf.github.io/epub3-samples/samples.html). Any reliable risk identification tool will have to take this into account.

Digital watermarks
Commercial e-books often have embedded "watermarks" that allow a particular copy to be traced to the person who originally purchased it. Several schemes have been used, most controversially ones that actually alter the text of the book to distinguish copies (possibly harming their literary quality in the process). Other schemes simply involve embedded numbers or images in places where they aren't very noticable. One variety uses bar codes embedded in data: URLs.


 * What an e-book watermark looks like (external link)

Software

 * Cool Reader
 * Azardi ePub3 Reader
 * Readium - Open source library for reading ePub versions 2 and 3 (including Google Chrome Web App)
 * ePubCheck - ePub validator (suppports versions 2 and 3)
 * Python wrappers for EpubCheck
 * libepubgen: EPUB generator for librevenge framework
 * Pandoc: Document format conversion swiss-army knife
 * Sigil: multi-platform EPUB editor
 * Ebooklib: Python library for reading/writing EPUB, including EPUB 3
 * calibre

Online utilities

 * Online ePub validator (based on ePubCheck)
 * EPUB validator / best-practice checker (requires registration to access)
 * epubtest: Online EPUB resources

Sample files

 * ePub 3 samples - intended to demonstrate features of ePub 3
 * Azardi ePub 3 samples
 * Azardi Fixed Layout ePub 3 samples
 * Guy de Maupassant Short Stories in 5 formats - includes ePub 3
 * ePub testsuite - Apparently by IPDF, under construction and so far without documentation (status October 2013)
 * Hindawi sample articles (publisher of open-access scientific journals, most of which offer content in ePub format)
 * Journal of Neuroinflammation - Open access scientific journal that offers content in ePub format
 * ePubs from Lippincott Williams & Wilkins journals
 * Homeland, by Cory Doctorow
 * EPub test suite by IPDF, includes scripts for making one's own sample files
 * Example file of DTBook variant of EPUB 2
 * ePub KB policy testing examples
 * https://telparia.com/fileFormatSamples/document/epub/

Links

 * EPUB for archival preservation - Blog post with link to report (2012 ) by KB / National Library of the Netherlands
 * EPUB for archival preservation: an update - Update (2013) to KB report
 * How to package an epub file using InfoZip
 * Policy-based assessment of EPUB with Epubcheck
 * The Best EPUB reader for Windows?
 * Advancing Portable Documents for the Open Web Platform: EPUB-WEB (W3C White Paper)
 * The Convergence of EPUB and the Web (Presentation)
 * EPUB Format Preservation Assessment
 * “Radical” changes in EPUB 3.1
 * The future of EPUB? A first look at the EPUB 3.1 Editor’s draft
 * (Current) Fixed Layout eBooks Considered Harmful
 * Library of Congress preservation status: EPUB 3.0.1
 * EPUB (Electronic publication) Version 3 Preservation