PDF

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Commentary)
(Digital Rights Management & Encryption)
(45 intermediate revisions by 5 users not shown)
Line 7: Line 7:
 
|pronom={{PRONOM|fmt/276}}, others
 
|pronom={{PRONOM|fmt/276}}, others
 
}}
 
}}
'''PDF''', portable document format, based on [[PostScript]] and originally from Adobe, has many subsets.  
+
'''Portable Document Format''' ('''PDF''') is a document file format originally from Adobe, based on [[PostScript]]. It has many subsets.
  
 
As well as the 'full function' ISO 32000-1:2008 (or PDF 1.7), there are also PDF/X, PDF/A, PDF/E, PDF/VT and PDF/UA, all of which are ISO specifications.
 
As well as the 'full function' ISO 32000-1:2008 (or PDF 1.7), there are also PDF/X, PDF/A, PDF/E, PDF/VT and PDF/UA, all of which are ISO specifications.
Line 26: Line 26:
 
** PDF/X-3 (ISO 15930-6:2003)
 
** PDF/X-3 (ISO 15930-6:2003)
 
* Tagged PDF
 
* Tagged PDF
 +
 +
A PDF 2.0 spec (ISO 32000-2) was published in 2017-07, with some new features as well as clarification of conformance with existing features.
  
 
== Identifiers ==
 
== Identifiers ==
Line 52: Line 54:
 
|-
 
|-
 
|PDF 1.7, Ext. 3 ||  || {{LoCFDD|fdd000313}}
 
|PDF 1.7, Ext. 3 ||  || {{LoCFDD|fdd000313}}
 +
|-
 +
|PDF 2.0 || {{PRONOM|fmt/1129}}
 
|-
 
|-
 
|PDF/A    ||  || {{LoCFDD|fdd000318}}
 
|PDF/A    ||  || {{LoCFDD|fdd000318}}
Line 129: Line 133:
  
 
A "Protected PDF" (PPDF) format is [http://www.eweek.com/mobile/microsoft-enterprise-mobility-suite-cozies-up-to-office.html reportedly] used by Microsoft's Azure Rights Management Service for sharing files securely within a workgroup.
 
A "Protected PDF" (PPDF) format is [http://www.eweek.com/mobile/microsoft-enterprise-mobility-suite-cozies-up-to-office.html reportedly] used by Microsoft's Azure Rights Management Service for sharing files securely within a workgroup.
 +
 +
== Document redaction ==
 +
 +
Occasionally the attempts of technically-inept users to obscure content in PDF files get in the news. People have sometimes had the mistaken impression that if a section of text is overlayed with a solid-black shape, or set to white-on-white text, or some such thing, before the publicly distributed document is sent out, that would make the redacted sections unavailable; this is not true, as it is in fact easy to find text that has been obscured in such manners, often as simple as dragging a mouse over it to highlight it. This happened in a [http://www.sun-sentinel.com/opinion/fl-op-editorial-judge-elizabeth-scherer-20180823-story.html 2018 Florida case] connected with the school shooting there, where some parts of the school district's report about the shooter were badly redacted and disclosed by a local newspaper, leading to a judge threatening punishment of the paper and prior restraint of future publications of theirs because of this "hacking", raising all sorts of legal and constitutional issues.
  
 
== Specifications ==
 
== Specifications ==
 
* [http://acroeng.adobe.com/wp/?page_id=321 Adobe PDF References]  Contains links to every version of the PDF Reference published by Adobe (starting with PDF 1.0) as well as associated errata, addenda and tech notes.
 
* [http://acroeng.adobe.com/wp/?page_id=321 Adobe PDF References]  Contains links to every version of the PDF Reference published by Adobe (starting with PDF 1.0) as well as associated errata, addenda and tech notes.
 
* Other sources of the above documents:
 
* Other sources of the above documents:
** [http://www.adobe.com/devnet/pdf/pdf_reference.html PDF Reference and Adobe Extensions to the PDF Specification] Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2. These extensions include support for geospatial features and for 3-D content using [[U3D]] and [[PRC]] formats.  
+
** [http://www.adobe.com/devnet/pdf/pdf_reference.html PDF Reference and Adobe Extensions to the PDF Specification] Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2. These extensions include support for geospatial features and for 3-D content using [[U3D]] and [[Adobe PRC|PRC]] formats.  
 
** [http://www.adobe.com/devnet/pdf/pdf_reference_archive.html Adobe PDF Reference Archives.] Archive of specifications for earlier Adobe versions of PDF, starting with Version 1.3.
 
** [http://www.adobe.com/devnet/pdf/pdf_reference_archive.html Adobe PDF Reference Archives.] Archive of specifications for earlier Adobe versions of PDF, starting with Version 1.3.
 +
* [https://www.iso.org/standard/51502.html ISO 32000-1:2008]: PDF 1.7 (not free to download)
 +
* [https://www.iso.org/standard/63534.html ISO 32000-2:2017]: PDF 2.0 (not free to download)
  
 
== Software ==
 
== Software ==
Line 143: Line 153:
 
* [http://en.pdf24.org/ PDF24 creator]
 
* [http://en.pdf24.org/ PDF24 creator]
 
* [http://pdfbox.apache.org/ Apache PDFBox] is an open-source PDF library that includes a PDF/A validator
 
* [http://pdfbox.apache.org/ Apache PDFBox] is an open-source PDF library that includes a PDF/A validator
* [https://code.google.com/p/pdfium/ pdfium: Open source PDF rendering engine]
+
* [https://pdfium.googlesource.com/pdfium/ pdfium: Open source PDF rendering engine]
 
* [http://textract.readthedocs.org/en/latest/ Textract: extract text from various document formats including PDF]
 
* [http://textract.readthedocs.org/en/latest/ Textract: extract text from various document formats including PDF]
 
* [https://github.com/pramodhkp/pdf2svg/ pdf2svg (in JavaScript)]
 
* [https://github.com/pramodhkp/pdf2svg/ pdf2svg (in JavaScript)]
 
* [https://euske.github.io/pdfminer/programming.html Programming with PDFMiner]
 
* [https://euske.github.io/pdfminer/programming.html Programming with PDFMiner]
 
* [https://github.com/friesey/preservation-tools/releases/tag/v0.1_alpha_PDFBox_Statistics PDFBox PDF/A Validator]
 
* [https://github.com/friesey/preservation-tools/releases/tag/v0.1_alpha_PDFBox_Statistics PDFBox PDF/A Validator]
 +
* [https://pypi.python.org/pypi/PyPDF2/1.24 PyPDF2]
 +
* [https://github.com/sumatrapdfreader Sumatra PDF Reader]
 +
* [https://chrome.google.com/webstore/detail/pdf-viewer/oemmndcbldboiebfnladdacbdfmadadm?hl=en PDF viewer for Chrome]
 +
* [http://verapdf.org/software/ veraPDF library (PDF validator)]
 +
* [http://www.metachris.com/pdfx/ PDFx - Extract metadata and URLs from PDFs, and download all referenced PDFs]
 +
* [https://github.com/ANSSI-FR/caradoc Caradoc: PDF parser and validator]
 +
* [https://github.com/uds-datalab/PDBF PBDF: Create documents that are simultaneously valid PDF, HTML, and VirtualBox OVA.]
 +
* [https://blog.didierstevens.com/programs/pdf-tools/ PDF Tools]
 +
* [https://www.tracker-software.com/product/pdf-xchange-viewer PDF-XChange Viewer]
  
 
== Online utilities ==
 
== Online utilities ==
 
* [http://www.pdf4kindle.com/ PDF to Kindle converter]
 
* [http://www.pdf4kindle.com/ PDF to Kindle converter]
 +
* [https://pdftables.com/ PDF to Excel (and some other formats)]
  
 
== Sample files ==
 
== Sample files ==
Line 157: Line 177:
 
* [http://craphound.com/homeland/Cory_Doctorow_-_Homeland.pdf Homeland by Cory Doctorow]
 
* [http://craphound.com/homeland/Cory_Doctorow_-_Homeland.pdf Homeland by Cory Doctorow]
 
* [http://www.dan.info/sampledata/msword/testing.pdf Sample document saved from Windows Word 2007]
 
* [http://www.dan.info/sampledata/msword/testing.pdf Sample document saved from Windows Word 2007]
* [https://code.google.com/p/corkami/source/browse/trunk/src/pdf/quine.pdf?spec=svn1907&r=1907 Quine PDF; contains its own TeX source]
+
* [https://github.com/corkami/pocs/blob/master/pdf/quine.pdf Quine PDF; contains its own TeX source]
 +
* [https://www.alchemistowl.org/pocorgtfo/pocorgtfo08.pdf Newsletter designed to work as PDF, ZIP, or shell script]
 +
* [https://github.com/veraPDF/veraPDF-corpus veraPDF corpus]
 +
* [https://github.com/osnr/horrifying-pdf-experiments Horrifying PDF Experiments]
 +
* [https://github.com/mozilla/pdf.js/tree/master/test/pdfs Test PDFs used by Mozilla PDF Reader]
 +
* [https://github.com/pdf-association/pdf20examples PDF 2.0 example files by the PDF Association]
 +
 
 +
== See also ==
 +
* [[Ascii85]]
 +
* [[FDF]]
 +
* [[PostScript]]
 +
* [[WWF]]
 +
* [[XFDF]]
  
 
== Links ==
 
== Links ==
Line 172: Line 204:
 
* [http://vimeopro.com/pdfassociation/technical-conference-europe-2013/video/68945979 PDF Validation: Dream or Yawn?] - Presentation on possibilities of an open-source PDF validator
 
* [http://vimeopro.com/pdfassociation/technical-conference-europe-2013/video/68945979 PDF Validation: Dream or Yawn?] - Presentation on possibilities of an open-source PDF validator
 
* [http://www.docdroid.net/ciex/5103a198-1.pdf.html The pitfalls of protocol design: Attempting to write a formally verified PDF parser]
 
* [http://www.docdroid.net/ciex/5103a198-1.pdf.html The pitfalls of protocol design: Attempting to write a formally verified PDF parser]
 +
* [https://fileformats.wordpress.com/2015/04/22/verapdf/ New open-source file validation project]
  
 
=== Jailbreaking ===
 
=== Jailbreaking ===
Line 177: Line 210:
 
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/ Jailbreaking the PDF (discussion)]
 
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/ Jailbreaking the PDF (discussion)]
 
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-2-technical-aspects-glyph-processing/ Jailbreaking the PDF (technical aspects: glyph processing)]
 
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-2-technical-aspects-glyph-processing/ Jailbreaking the PDF (technical aspects: glyph processing)]
 +
* [http://blog.didierstevens.com/2015/04/15/pdf-password-cracking-with-john-the-ripper/ PDF Password Cracking With John The Ripper]
  
 
=== Commentary ===
 
=== Commentary ===
Line 188: Line 222:
 
* [http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/ Ensuring long-term access: PDF validation with JHOVE?]
 
* [http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/ Ensuring long-term access: PDF validation with JHOVE?]
 
* [http://www.theguardian.com/higher-education-network/2015/feb/11/researchers-its-time-to-ditch-the-pdf Researchers: it's time to ditch the PDF]
 
* [http://www.theguardian.com/higher-education-network/2015/feb/11/researchers-its-time-to-ditch-the-pdf Researchers: it's time to ditch the PDF]
 +
* [http://wiki.dpconline.org/images/5/51/PDF_Assessment_v1.2_external.pdf PDF Format Preservation Assessment (British Library)]
 +
* [http://www.pdfa.org/2015/06/what-will-pdf-2-0-bring/ What will PDF 2.0 bring?]
 +
* [http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_PDF_A3_report_final022014.pdf?loclr=blogsig The Benefits and Risks of the PDF/A-3 file format for archival institutions]
 +
* [https://nicolastreeten.wordpress.com/2015/09/19/becoming-of-age-pdf/ Becoming of Age: PDF (comic)]
 +
* [http://www.pdfa.org/2016/06/what-does-support-pdf-really-mean/ What does "support PDF" really mean?]
 +
* [http://openpreservation.org/blog/2016/12/09/pdfa-as-a-preferred-sustainable-format-for-spreadsheets/ PDF/A as a preferred, sustainable format for spreadsheets?]
  
 
=== Miscellaneous ===
 
=== Miscellaneous ===
Line 202: Line 242:
 
* [http://stackoverflow.com/questions/17740175/how-do-i-dump-embedded-icc-profile-information-in-pdf-command-line-or-gui-tool/27464166#27464166 How do I dump embedded ICC profile information in PDF? (command line or GUI tools)]
 
* [http://stackoverflow.com/questions/17740175/how-do-i-dump-embedded-icc-profile-information-in-pdf-command-line-or-gui-tool/27464166#27464166 How do I dump embedded ICC profile information in PDF? (command line or GUI tools)]
 
* [http://stackoverflow.com/questions/27938551/how-to-check-pdf-pages-for-resolution-dpi-of-embedded-images/27942530 How to check PDF pages for resolution (DPI) of embedded images?]
 
* [http://stackoverflow.com/questions/27938551/how-to-check-pdf-pages-for-resolution-dpi-of-embedded-images/27942530 How to check PDF pages for resolution (DPI) of embedded images?]
 +
* [http://chemxseer.ist.psu.edu/about/digital_library/das08-liu.pdf A Fast Preprocessing Method for Table Boundary Detection: Narrowing Down the Sparse Lines using Solely Coordinate Information]
 +
* [https://github.com/angea/PDF101/tree/master/handcoded/textextract Why text extracting doesn't work for all PDFs]
 +
* [http://stackoverflow.com/questions/29342542/how-can-i-extract-a-javascript-from-a-pdf-file-with-a-command-line-tool/29364036 How can I extract a JavaScript from a PDF file with a command line tool?]
 +
* [http://stackoverflow.com/questions/29331731/postscript-code-to-un-hide-hidden-text-in-pdf/29334742 How to un-hide hidden text in PDF]
 +
* [http://www.pdfa.org/2015/04/infographics-pdfua-and-wcag-2-0/ Infographics: PDF/UA and WCAG 2.0]
 +
* [http://www.prepressure.com/pdf/basics/history The history of PDF] according to prepressure.com, a site for "prepress & print devotees".
 +
* [https://isc.sans.edu/diary/Handling+Special+PDF+Compression+Methods/19597 Handling Special PDF Compression Methods]
 +
* [https://speakerdeck.com/ange/lets-write-a-pdf-file Let's write a PDF file]
 +
* [https://blog.didierstevens.com/2016/06/07/recovering-a-ransomed-pdf/ Recovering a ransomed PDF]
 +
* [https://github.com/digital-preservation/droid/issues/114 PDF version numbers based on deprecated mechanism]
 +
* [https://madfileformatscience.garymcgath.com/2016/09/26/pdf-version/ Figuring out the PDF version is harder than you think]
 +
* [https://www.pdfa.org/slides-and-video-recordings-of-the-pdf-days-europe-2017/ Slides and video recordings of the PDF Days Europe 2017]
 +
* [https://www.pcworld.com/article/2096946/5-cheaper-alternatives-to-acrobat-for-pdf-editing.html 5 cheaper alternatives to Acrobt for PDF editing]
  
 
[[Category:Page description languages]]
 
[[Category:Page description languages]]
 
[[Category:Adobe]]
 
[[Category:Adobe]]

Revision as of 17:06, 20 October 2018

File Format
Name PDF
Ontology
Extension(s) .pdf
MIME Type(s) application/pdf
LoCFDD fdd000146, others
PRONOM fmt/276, others

Portable Document Format (PDF) is a document file format originally from Adobe, based on PostScript. It has many subsets.

As well as the 'full function' ISO 32000-1:2008 (or PDF 1.7), there are also PDF/X, PDF/A, PDF/E, PDF/VT and PDF/UA, all of which are ISO specifications.

PDF profiles (formalized subsets) include the following:

  • PDF/A (optimized for preservation)
    • PDF/A-1 (ISO 19005-1:2005)
    • PDF/A-2 (ISO 19005-2:2011)
    • PDF/A-3 (ISO 19005-3:2012) (extends PDF/A-2 by allowing embedded files of any type)
  • PDF/E (ISO 24517-1:2008) (for engineering workflows)
  • PDF/UA (ISO 14289-1) (making documents accessible through assistive technologies)
  • PDF/VT (ISO 16612-2) (support for variable document printing)
  • PDF/X (support for prepress graphics exchange)
    • PDF/X-1 (ISO 15930-1:2001)
    • PDF/X-1a (ISO 15930-4:2003)
    • PDF/X-2 (ISO 15930-5:2003)
    • PDF/X-3 (ISO 15930-6:2003)
  • Tagged PDF

A PDF 2.0 spec (ISO 32000-2) was published in 2017-07, with some new features as well as clarification of conformance with existing features.

Contents

Identifiers

Format PRONOM LoCFDD
PDF fdd000146
PDF 1.0 fmt/14 fdd000316
PDF 1.1 fmt/15
PDF 1.2 fmt/16
PDF 1.3 fmt/17
PDF 1.4 fmt/18 fdd000122
PDF 1.5 fmt/19 fdd000123
PDF 1.6 fmt/20 fdd000276
PDF 1.7 fmt/276 fdd000277
PDF 1.7, Ext. 3 fdd000313
PDF 2.0 fmt/1129
PDF/A fdd000318
PDF/A-1 fdd000125
PDF/A-1a fmt/95 fdd000251
PDF/A-1b fmt/354 fdd000252
PDF/A-2 fdd000319
PDF/A-2a fmt/476 fdd000320
PDF/A-2b fmt/477 fdd000322
PDF/A-2u fmt/478 fdd000321
PDF/A-3a fmt/479 fdd000360
PDF/A-3b fmt/480
PDF/A-3u fmt/481
PDF/X-1 fmt/144, fmt/145 fdd000124
PDF/X-1a fmt/157, fmt/146
PDF/X-2 fmt/147
PDF/X-3 fmt/158, fmt/148
PDF/X-4 fmt/488
PDF/X-4p fmt/489
PDF/X-5g fmt/490
PDF/X-5pg fmt/491
PDF/X-5n fmt/492
PDF/UA-1 fdd000350
PDF/E-1 fmt/493
PDF, Geospatial fdd000315
GeoPDF 2.2 fdd000312

Identification

The majority of PDF files can be identified by a fixed header e.g. "%PDF-1.4", however, older documents have a number of variations.

Compression

Images in PDF documents may use the following compression schemes:

Digital Rights Management & Encryption

PDF has two types of 'encryption' - it uses an 'user' password to limit the ability to open the document, and a 'creator' password to limit other rights, like printing, copying, etc. The former case, where a password is required to open the file, is the main preservation concern, as our users will not be able to open a PDF encrypted in this way (unless the password can be cracked, which may be problematic both technically and legally). However, the latter case causes problems, because the PDF is encrypted here too, but with a special known user password of "" (an empty string, which is not the same as no password). So, the document is encrypted in both cases, and you can only tell which is which by attempting to decrypt the PDF using the special default password "". Some PDF analysis tools (notably JHOVE) do not implement the relevant decryption workflow, and so cannot distinguish between the two types of encryption.

An example of the decryption test workflow can be found here: https://gist.github.com/anjackson/5237071

Some of the most locked-up PDFs anywhere can be found at the ANSI IBR Standards Portal, which has made certain standards documents that are incorporated into legislation available for browsing, but only through a convoluted procedure involving downloading a special plug-in and filling out a registration form that must be re-filled-out in every browsing session.

A "Protected PDF" (PPDF) format is reportedly used by Microsoft's Azure Rights Management Service for sharing files securely within a workgroup.

Document redaction

Occasionally the attempts of technically-inept users to obscure content in PDF files get in the news. People have sometimes had the mistaken impression that if a section of text is overlayed with a solid-black shape, or set to white-on-white text, or some such thing, before the publicly distributed document is sent out, that would make the redacted sections unavailable; this is not true, as it is in fact easy to find text that has been obscured in such manners, often as simple as dragging a mouse over it to highlight it. This happened in a 2018 Florida case connected with the school shooting there, where some parts of the school district's report about the shooter were badly redacted and disclosed by a local newspaper, leading to a judge threatening punishment of the paper and prior restraint of future publications of theirs because of this "hacking", raising all sorts of legal and constitutional issues.

Specifications

  • Adobe PDF References Contains links to every version of the PDF Reference published by Adobe (starting with PDF 1.0) as well as associated errata, addenda and tech notes.
  • Other sources of the above documents:
  • ISO 32000-1:2008: PDF 1.7 (not free to download)
  • ISO 32000-2:2017: PDF 2.0 (not free to download)

Software

Online utilities

Sample files

See also

Links

Format info

Validation

Jailbreaking

Commentary

Miscellaneous

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox