ePub is an open format defined by the Open eBook Forum of the International Digital Publishing Forum (IDPF). It is based on XHTML and XML along with optional CSS stylesheets. Its predecessor was the OEB standard.
Quoted from the IDPF web site:
- '.epub' is the file extension of an XML format for reflowable digital books and publications. '.epub' is composed of three open standards, the Open Publication Structure (OPS), Open Packaging Format (OPF) and Open Container Format (OCF), produced by the IDPF. '.epub' allows publishers to produce and send a single digital publication file through distribution and offers consumers interoperability between software/hardware for unencrypted reflowable digital books and other publications. The Open eBook Publication Structure or 'OEB', originally produced in 1999, is the precursor to OPS.
The intent of ePub is to serve both as a source file format and an end user format. For this reason the files are collected into a container for easy dissemination and use. This container is generally a zip file but the extension has been renamed to .epub. It has special requirements by including an uncompressed mime type file while the rest of the data in the file is compressed. An ePub reader should be capable of reading the content in its compressed format. http://wiki.mobileread.com/wiki/EPUB
The IDPF specification page contains the specifications for this format. In particular check the version 2.01 OPS and OPF specifications and the version 1.01 OCF specifications. The informational documents are also quite useful in understanding the standard's intent and content.
ePub version 3 is the newest version of the standard and has now been recommended by the idpf standards committee.
In version 2.01 there were three defining documents, the OPF (Open Packaging Format), the OCF (Open Container Format), and the OPS (Open Publications Structure). The OPS referenced a DAISY standard for the NCX file. The new 3.0 standard has 4 defining documents with new names. The OPF becomes the ePub Publications standard. The OCF remains the same and the OPS received the most changes to become the ePub Content Documents. This now includes the old NCX specifications which are no longer used. A fourth document is concerned with Media Overlays and is a new feature of ePub version 3. http://wiki.mobileread.com/wiki/EPub_3
Digital Rights Management & Encryption
When preserving ePub files, it is important to know what if any rights restrictions and encryptions are present, so that action can be taken to ensure the content can still be accessed in the future. Unfortunately, there are currently no reliable tools we can use to automated this analysis. A general introduction to copy protection and ePub can be found here.
How it works
According to section 3.2 ("OCF ZIP Container") of the EPUB 3 spec. (section 4 in the 2.0.1 spec.):
"Conforming OCF ZIP Containers MUST NOT use the encryption features defined by the ZIP format..."
i.e. a valid EPUB's content files can always be read regardless of the DRM scheme in use.
Section 2.5.5 ("Rights Management"), again 3.5.6 in the 2.0.1 spec, states that:
"An OPTIONAL file with the name “rights.xml” within the “META-INF” directory at the root level of the container file system is a reserved name in a valid OCF container. "The rights.xml file MUST NOT be encrypted. "When the rights.xml file is not present, the OCF container provides no information indicating any part of the container is rights governed."
i.e. the presence of the META-INF/rights.xml file is an indicator that DRM is likely in use (however, its absence is apparently not an indication of the contrary).
Section 2.5.2 ("Encryption") - 3.5.5 in 2.0.1 - states that:
"...if any resource within the container is encrypted, “encryption.xml” MUST be present to indicate that the resource is encrypted and provide information on how it is encrypted."
Consequenly the existence of this file indicates that one or more encrypted files exist within the EPUB. However, as the rights information is essentially decoupled from the encryption scheme
- Encrypted but not strictly governed by rights management.
- Governed by right management but not encrypted.
- Both encrypted and governed by rights management.
Though in these latter two cases the rights governance may not be clear from the EPUB structure.
Furthermore, note that not all encryption schemes are 'bad', i.e. requiring individual private key information - some are mere 'obfuscation' (some sample, with obfuscated fonts, can be found here: http://code.google.com/p/epub-samples/wiki/SamplesListing). Any reliable risk identification tool will have to take this into account.
Commercial e-books often have embedded "watermarks" that allow a particular copy to be traced to the person who originally purchased it. Several schemes have been used, most controversially ones that actually alter the text of the book to distinguish copies (possibly harming their literary quality in the process). Other schemes simply involve embedded numbers or images in places where they aren't very noticable. One variety uses bar codes embedded in file: URLs.