PDF

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Sample files)
Line 127: Line 127:
  
 
Some of the most locked-up PDFs anywhere can be found at the [http://ibr.ansi.org/ ANSI IBR Standards Portal], which has made certain standards documents that are incorporated into legislation available for browsing, but only through a convoluted procedure involving downloading a special plug-in and filling out a registration form that must be re-filled-out in every browsing session.
 
Some of the most locked-up PDFs anywhere can be found at the [http://ibr.ansi.org/ ANSI IBR Standards Portal], which has made certain standards documents that are incorporated into legislation available for browsing, but only through a convoluted procedure involving downloading a special plug-in and filling out a registration form that must be re-filled-out in every browsing session.
 +
 +
== Specifications ==
 +
* [http://www.adobe.com/devnet/pdf/pdf_reference.html PDF Reference and Adobe Extensions to the PDF Specification] Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2. These extensions include support for geospatial features and for 3-D content using [[U3D]] and [[PRC]] formats.
 +
* [http://www.adobe.com/devnet/pdf/pdf_reference_archive.html Adobe PDF Reference Archives.] Archive of specifications for earlier Adobe versions of PDF, starting with Version 1.3.
  
 
== Software ==
 
== Software ==
Line 141: Line 145:
 
* [http://craphound.com/homeland/Cory_Doctorow_-_Homeland.pdf Homeland by Cory Doctorow]
 
* [http://craphound.com/homeland/Cory_Doctorow_-_Homeland.pdf Homeland by Cory Doctorow]
  
== References ==
+
== Links ==
* [http://www.adobe.com/devnet/pdf/pdf_reference.html PDF Reference and Adobe Extensions to the PDF Specification] Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2.  These extensions include support for geospatial features and for 3-D content using [[U3D]] and [[PRC]] formats.
+
* [http://www.adobe.com/devnet/pdf/pdf_reference_archive.html Adobe PDF Reference Archives.]  Archive of specifications for earlier Adobe versions of PDF, starting with Version 1.3.
+
 
* [http://en.wikipedia.org/wiki/Portable_Document_Format Portable Document Format (Wikipedia)]
 
* [http://en.wikipedia.org/wiki/Portable_Document_Format Portable Document Format (Wikipedia)]
 
* [http://www.pdfa.org/ PDF/A Competence Center]
 
* [http://www.pdfa.org/ PDF/A Competence Center]
 
* [http://www.portico.org/digital-preservation/wp-content/uploads/2012/11/TheNetworkIsTheFormat.pdf The Network is the Format: PDF and the Long-term Use of Digital Content] Article by Sheila Morrissey of ITHAKA on the challenges of preserving PDF files based on experience.  She illustrates the challenge of defining a "sufficient sub-graph of the network of information about a digital object, for effective future use."
 
* [http://www.portico.org/digital-preservation/wp-content/uploads/2012/11/TheNetworkIsTheFormat.pdf The Network is the Format: PDF and the Long-term Use of Digital Content] Article by Sheila Morrissey of ITHAKA on the challenges of preserving PDF files based on experience.  She illustrates the challenge of defining a "sufficient sub-graph of the network of information about a digital object, for effective future use."
* [http://www.digitalpreservation.gov/formats/fdd/fdd000030.shtml PDF (Portable Document Format) from Library of Congress resource on Sustainability of Digital Formats] Links to individual pages for Adobe chronological versions 1.3 through 1.7 and for several versions approved as ISO standards.
 
 
*[http://acroeng.adobe.com/wp/ Adobe Acrobat Engineering site] - Dedicated Adobe site with lots of technical information, including a history of PDF and Acrobat, conforming viewers and test files.
 
*[http://acroeng.adobe.com/wp/ Adobe Acrobat Engineering site] - Dedicated Adobe site with lots of technical information, including a history of PDF and Acrobat, conforming viewers and test files.
 
*[http://wiki.opf-labs.org/display/TR/Portable+Document+Format Portable Document Format on OPF File Format Risk Registry] - Lists various long-term accessibility issues in PDF and how to detect them using Apache Preflight.
 
*[http://wiki.opf-labs.org/display/TR/Portable+Document+Format Portable Document Format on OPF File Format Risk Registry] - Lists various long-term accessibility issues in PDF and how to detect them using Apache Preflight.
Line 155: Line 156:
 
* [http://www.pdfa.org/2013/04/pdfa-in-a-nutshell-2_0/ PDF/A in a Nutshell 2.0 – online edition]
 
* [http://www.pdfa.org/2013/04/pdfa-in-a-nutshell-2_0/ PDF/A in a Nutshell 2.0 – online edition]
 
* [http://www.forensicswiki.org/wiki/PDF Forensics Wiki: PDF]
 
* [http://www.forensicswiki.org/wiki/PDF Forensics Wiki: PDF]
 
+
* [http://www.infinitepartitions.com/cgi-bin/showarticle.cgi?article=art019 Inside the PDF File Format]
== Other links ==
+
 
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/ Jailbreaking the PDF (discussion)]
 
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/ Jailbreaking the PDF (discussion)]
 
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-2-technical-aspects-glyph-processing/ Jailbreaking the PDF (technical aspects: glyph processing)]
 
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-2-technical-aspects-glyph-processing/ Jailbreaking the PDF (technical aspects: glyph processing)]

Revision as of 18:03, 4 January 2014

File Format
Name PDF
Ontology
Extension(s) .pdf
MIME Type(s) application/pdf
LoCFDD fdd000146, others
PRONOM fmt/276, others

PDF, portable document format, based on PostScript and originally from Adobe, has many subsets.

As well as the 'full function' ISO 32000-1:2008 (or PDF 1.7), there are also PDF/X, PDF/A, PDF/E, PDF/VT and PDF/UA, all of which are ISO specifications.

PDF profiles (formalized subsets) include the following:

  • PDF/A (optimized for preservation)
    • PDF/A-1 (ISO 19005-1:2005)
    • PDF/A-2 (ISO 19005-2:2011)
    • PDF/A-3 (ISO 19005-3:2012) (extends PDF/A-2 by allowing embedded files of any type)
  • PDF/E (ISO 24517-1:2008) (for engineering workflows)
  • PDF/UA (ISO 14289-1) (making documents accessible through assistive technologies)
  • PDF/VT (ISO 16612-2) (support for variable document printing)
  • PDF/X (support for prepress graphics exchange)
    • PDF/X-1 (ISO 15930-1:2001)
    • PDF/X-1a (ISO 15930-4:2003)
    • PDF/X-2 (ISO 15930-5:2003)
    • PDF/X-3 (ISO 15930-6:2003)
  • Tagged PDF

Contents

Identifiers

Format PRONOM LoCFDD
PDF fdd000146
PDF 1.0 fmt/14 fdd000316
PDF 1.1 fmt/15
PDF 1.2 fmt/16
PDF 1.3 fmt/17
PDF 1.4 fmt/18 fdd000122
PDF 1.5 fmt/19 fdd000123
PDF 1.6 fmt/20 fdd000276
PDF 1.7 fmt/276 fdd000277
PDF 1.7, Ext. 3 fdd000313
PDF/A fdd000318
PDF/A-1 fdd000125
PDF/A-1a fmt/95 fdd000251
PDF/A-1b fmt/354 fdd000252
PDF/A-2 fdd000319
PDF/A-2a fmt/476 fdd000320
PDF/A-2b fmt/477 fdd000322
PDF/A-2u fmt/478 fdd000321
PDF/A-3a fmt/479 fdd000360
PDF/A-3b fmt/480
PDF/A-3u fmt/481
PDF/X-1 fmt/144, fmt/145 fdd000124
PDF/X-1a fmt/157, fmt/146
PDF/X-2 fmt/147
PDF/X-3 fmt/158, fmt/148
PDF/X-4 fmt/488
PDF/X-4p fmt/489
PDF/X-5g fmt/490
PDF/X-5pg fmt/491
PDF/X-5n fmt/492
PDF/UA-1 fdd000350
PDF/E-1 fmt/493
PDF, Geospatial fdd000315
GeoPDF 2.2 fdd000312

Identification

The majority of PDF files can be identified by a fixed header e.g. "%PDF-1.4", however, older documents have a number of variations.

  • Some can start with "%!PS-Adobe-N.n PDF-M.m" instead, as described here.
  • Since PDF 1.7, the major and minor version numbers have been fixed. i.e. the public version from Adobe after 1.7 was "1.7 Adobe Extension Level 3".
  • For the PDF/A families of formats, their conformance is declared via an embedded (XMP) metadata fragment.
  • Some older files from Mac OS may be wrapped up in the AppleSingle/AppleDouble formats. This is a general issue, so should perhaps be documented elsewhere. For more information, see:

Compression

Images in PDF documents may use the following compression schemes:

Digital Rights Management & Encryption

PDF has two types of 'encryption' - it uses an 'user' password to limit the ability to open the document, and a 'creator' password to limit other rights, like printing, copying, etc. The former case, where a password is required to open the file, is the main preservation concern, as our users will not be able to open a PDF encrypted in this way (unless the password can be cracked, which may be problematic both technically and legally). However, the latter case causes problems, because the PDF is encrypted here too, but with a special known user password of "" (an empty string, which is not the same as no password). So, the document is encrypted in both cases, and you can only tell which is which by attempting to decrypt the PDF using the special default password "". Some PDF analysis tools (notably JHOVE) do not implement the relavant decryption workflow, and so cannot distinguish between the two types of encryption.

An example of the decryption test workflow can be found here: https://gist.github.com/anjackson/5237071

Some of the most locked-up PDFs anywhere can be found at the ANSI IBR Standards Portal, which has made certain standards documents that are incorporated into legislation available for browsing, but only through a convoluted procedure involving downloading a special plug-in and filling out a registration form that must be re-filled-out in every browsing session.

Specifications

Software

Sample files

Links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox