PDF

PDF, portable document format, based on PostScript and originally from Adobe, has many subsets.

As well as the 'full function' ISO 32000-1:2008 (or PDF 1.7), there are also PDF/X, PDF/A, PDF/E, PDF/VT and PDF/UA, all of which are ISO specifications.

PDF profiles (formalized subsets) include the following:


 * PDF/A (optimized for preservation)
 * PDF/A-1 (ISO 19005-1:2005)
 * PDF/A-2 (ISO 19005-2:2011)
 * PDF/A-3 (ISO 19005-3:2012) (extends PDF/A-2 by allowing embedded files of any type)
 * PDF/E (ISO 24517-1:2008) (for engineering workflows)
 * PDF/UA (ISO 14289-1) (making documents accessible through assistive technologies)
 * PDF/VT (ISO 16612-2) (support for variable document printing)
 * PDF/X (support for prepress graphics exchange)
 * PDF/X-1 (ISO 15930-1:2001)
 * PDF/X-1a (ISO 15930-4:2003)
 * PDF/X-2 (ISO 15930-5:2003)
 * PDF/X-3 (ISO 15930-6:2003)
 * Tagged PDF

Also see: extension PDF (this shows PRONOM codes for the various versions)

Identification
The majority of PDF files can be identified by a fixed header e.g. "%PDF-1.4", however, older documents have a number of variations.
 * Some can start with "%!PS-Adobe-N.n PDF-M.m" instead, as described here.
 * Since PDF 1.7, the major and minor version numbers have been fixed. i.e. the public version from Adobe after 1.7 was "1.7 Adobe Extension Level 3".
 * For the PDF/A families of formats, their conformance is declared via an embedded (XMP) metadata fragment.
 * Some older files from Mac OS may be wrapped up in the AppleSingle/AppleDouble formats. This is a general issue, so should perhaps be documented elsewhere. For more information, see:
 * http://en.wikipedia.org/wiki/AppleSingle_and_AppleDouble_formats
 * http://tools.ietf.org/rfc/rfc1740.txt

Compression
Images in PDF documents may use the following compression schemes:
 * LZW
 * Flate (zlib)
 * RunLength
 * CCITTFax (CCITT Group 3 and CCITT Group 4)
 * JBIG2
 * DCT (JPEG)
 * JPX (part of the JPEG2000 standard)

Digital Rights Management & Encryption
PDF has two types of 'encryption' - it uses an 'user' password to limit the ability to open the document, and a 'creator' password to limit other rights, like printing, copying, etc. The former case, where a password is required to open the file, is the main preservation concern, as our users will not be able to open a PDF encrypted in this way (unless the password can be cracked, which may be problematic both technically and legally). However, the latter case causes problems, because the PDF is encrypted here too, but with a special known user password of "" (an empty string, which is not the same as no password). So, the document is encrypted in both cases, and you can only tell which is which by attempting to decrypt the PDF using the special default password "". Some PDF analysis tools (notably JHOVE) do not implement the relavant decryption workflow, and so cannot distinguish between the two types of encryption.

An example of the decryption test workflow can be found here: https://gist.github.com/anjackson/5237071

Software

 * Adobe Reader views PDF files, either as a standalone program or a browser plugin.
 * Firefox 19.0 includes a built-in PDF reader.
 * Tabula: convert tabular data in PDFs to CSV
 * mPDF: convert HTML to PDF

Sample files

 * PDF Cabinet of Horrors - sample PDF files in corrupted or otherwise problematic formats
 * Adobe PDF Test Suites - various PDF test suites on Adobe Acrobat Engineering site

Other links

 * Jailbreaking the PDF (discussion)
 * Jailbreaking the PDF (technical aspects: glyph processing)
 * Jailbreaking the PDF hackathon