PDF

Portable Document Format (PDF) is a document file format originally from Adobe, based on PostScript. It has many subsets.

As well as the 'full function' ISO 32000-1:2008 (or PDF 1.7), there are also PDF/X, PDF/A, PDF/E, PDF/VT and PDF/UA, all of which are ISO specifications.

PDF profiles (formalized subsets) include the following:


 * PDF/A (optimized for preservation)
 * PDF/A-1 (ISO 19005-1:2005)
 * PDF/A-2 (ISO 19005-2:2011)
 * PDF/A-3 (ISO 19005-3:2012) (extends PDF/A-2 by allowing embedded files of any type)
 * PDF/E (ISO 24517-1:2008) (for engineering workflows)
 * PDF/UA (ISO 14289-1) (making documents accessible through assistive technologies)
 * PDF/VT (ISO 16612-2) (support for variable document printing)
 * PDF/X (support for prepress graphics exchange)
 * PDF/X-1 (ISO 15930-1:2001)
 * PDF/X-1a (ISO 15930-4:2003)
 * PDF/X-2 (ISO 15930-5:2003)
 * PDF/X-3 (ISO 15930-6:2003)
 * Tagged PDF

Identification
The majority of PDF files can be identified by a fixed header e.g. "%PDF-1.4", however, older documents have a number of variations.
 * Some can start with "%!PS-Adobe-N.n PDF-M.m" instead, as described here.
 * Since PDF 1.7, the major and minor version numbers have been fixed. i.e. the public version from Adobe after 1.7 was "1.7 Adobe Extension Level 3".
 * For the PDF/A families of formats, their conformance is declared via an embedded (XMP) metadata fragment.
 * Some older files from Mac OS may be wrapped up in the AppleSingle/AppleDouble formats. This is a general issue, so should perhaps be documented elsewhere. For more information, see:
 * http://en.wikipedia.org/wiki/AppleSingle_and_AppleDouble_formats
 * http://tools.ietf.org/rfc/rfc1740.txt

Compression
Images in PDF documents may use the following compression schemes:
 * LZW
 * Flate (zlib)
 * RunLength
 * CCITTFax (CCITT Group 3 and CCITT Group 4)
 * JBIG2
 * DCT (JPEG)
 * JPX (part of the JPEG 2000 standard)

Digital Rights Management & Encryption
PDF has two types of 'encryption' - it uses an 'user' password to limit the ability to open the document, and a 'creator' password to limit other rights, like printing, copying, etc. The former case, where a password is required to open the file, is the main preservation concern, as our users will not be able to open a PDF encrypted in this way (unless the password can be cracked, which may be problematic both technically and legally). However, the latter case causes problems, because the PDF is encrypted here too, but with a special known user password of "" (an empty string, which is not the same as no password). So, the document is encrypted in both cases, and you can only tell which is which by attempting to decrypt the PDF using the special default password "". Some PDF analysis tools (notably JHOVE) do not implement the relevant decryption workflow, and so cannot distinguish between the two types of encryption.

An example of the decryption test workflow can be found here: https://gist.github.com/anjackson/5237071

Some of the most locked-up PDFs anywhere can be found at the ANSI IBR Standards Portal, which has made certain standards documents that are incorporated into legislation available for browsing, but only through a convoluted procedure involving downloading a special plug-in and filling out a registration form that must be re-filled-out in every browsing session.

A "Protected PDF" (PPDF) format is reportedly used by Microsoft's Azure Rights Management Service for sharing files securely within a workgroup.

Specifications

 * Adobe PDF References Contains links to every version of the PDF Reference published by Adobe (starting with PDF 1.0) as well as associated errata, addenda and tech notes.
 * Other sources of the above documents:
 * PDF Reference and Adobe Extensions to the PDF Specification Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2. These extensions include support for geospatial features and for 3-D content using U3D and PRC formats.
 * Adobe PDF Reference Archives. Archive of specifications for earlier Adobe versions of PDF, starting with Version 1.3.

Software

 * Adobe Reader views PDF files, either as a standalone program or a browser plugin.
 * Firefox 19.0 includes a built-in PDF reader.
 * Tabula: convert tabular data in PDFs to CSV
 * mPDF: convert HTML to PDF
 * PDF24 creator
 * Apache PDFBox is an open-source PDF library that includes a PDF/A validator
 * pdfium: Open source PDF rendering engine
 * Textract: extract text from various document formats including PDF
 * pdf2svg (in JavaScript)
 * Programming with PDFMiner
 * PDFBox PDF/A Validator
 * PyPDF2

Online utilities

 * PDF to Kindle converter

Sample files

 * PDF Cabinet of Horrors - sample PDF files in corrupted or otherwise problematic formats
 * Adobe PDF Test Suites - various PDF test suites on Adobe Acrobat Engineering site
 * Homeland by Cory Doctorow
 * Sample document saved from Windows Word 2007
 * Quine PDF; contains its own TeX source

Format info

 * Portable Document Format (Wikipedia)
 * Forensics Wiki: PDF
 * Adobe Acrobat Engineering site - Dedicated Adobe site with lots of technical information, including a history of PDF and Acrobat, conforming viewers and test files.
 * PDF/A in a Nutshell 2.0 – online edition
 * Inside the PDF File Format
 * PDF101 an Adobe document walkthrough

Validation

 * PDF Validation: Dream or Yawn? - Presentation on possibilities of an open-source PDF validator
 * The pitfalls of protocol design: Attempting to write a formally verified PDF parser

Jailbreaking

 * Jailbreaking the PDF hackathon
 * Jailbreaking the PDF (discussion)
 * Jailbreaking the PDF (technical aspects: glyph processing)

Commentary

 * The Network is the Format: PDF and the Long-term Use of Digital Content Article by Sheila Morrissey of ITHAKA on the challenges of preserving PDF files based on experience. She illustrates the challenge of defining a "sufficient sub-graph of the network of information about a digital object, for effective future use."
 * The PDF’s Place in a History of Paper Knowledge: An Interview with Lisa Gitelman
 * Portable Document Format on OPF File Format Risk Registry - Lists various long-term accessibility issues in PDF and how to detect them using Apache Preflight.
 * Adobe Portable Document Format - Inventory of long-term preservation risks - Report by KB/ National Library of the Netherlands.
 * The uses and abuses of PDF
 * Apple’s Preview: Still not safe for work
 * Preserving the Grey Literature Explosion: PDF/A and the Digital Archive
 * Ensuring long-term access: PDF validation with JHOVE?
 * Researchers: it's time to ditch the PDF

Miscellaneous

 * PDF/A Competence Center
 * What preservation risks are associated with the PDF file format? - Q&A thread from Libraries and Information Sciences Stack Exchange (archived)
 * Recognizing Corrupt and Malformed PDF Files
 * Flight MH370 data was released as a PDF, but somebody extracted it to CSV to make it more useful for data analysis.
 * PDFy - free host for publicly viewable PDFs, backed up automatically to Internet Archive
 * UK judge says ‘freedom of information’ means choice of digital file format
 * The Chimera Quine; or, the ISO PDF
 * PDF info/links for attendees of conference on it
 * Does JHOVE validate PDF/A files?
 * Methods of Repairing Corrupted or Damaged PDFs
 * How do I dump embedded ICC profile information in PDF? (command line or GUI tools)
 * How to check PDF pages for resolution (DPI) of embedded images?
 * A Fast Preprocessing Method for Table Boundary Detection: Narrowing Down the Sparse Lines using Solely Coordinate Information