Just Solve the File Format Problem - User contributions [en]

DOCX

2023-10-27T11:36:26Z

Sebras: /* Specs */ Updated dead link to ECMA specification.

{{FormatInfo
|formattype=electronic
|subcat=Document
|subcat2=Word Processor
|subcat3=Microsoft Word
|extensions={{ext|docx}}
|mimetypes={{mimetype|application/vnd.openxmlformats-officedocument.wordprocessingml.document}}
}}
[[Office Open XML]] (OOXML) representation (.DOCX) is the default file format for documents created by Microsoft Word as of Word 2007. The format is based on [[XML]] component files in a container based on the [[ZIP]] format. It replaced the binary [[DOC]] format used in earlier Word versions, and comes in two flavours, 'strict' and 'transitional' (see below).

This format (or the XML components of it) has been referred to as [[WordProcessingML]], a name also used for the standalone XML files the earlier Word 2003 was able to generate.

Graphical inserted elements may be stored in the form of [[DrawingML]], embedded in the XML.

== History ==

This (along with the other Office Open XML document types [[PPTX]] and [[XLSX]]) was initially standardized as ECMA-376 in 2006. Three versions of ECMA-376 have been produced; the second version corresponds to ISO/IEC 29500:2008, approved as an ISO/IEC standard in April 2008. Changes to the standard between 2008 and 2012 were primarily corrections based on individual defects reported as implementation of the standard proceeded and required to ensure functional interoperability with existing applications. They do not introduce new functionality.

== Format ==

=== High-level structure ===
Like the other "Open XML" formats, this file format actually consists of various files (mostly [[XML]]) compressed into a [[ZIP]] archive, with this fact obscured from the end user by the use of a different file extension.

=== Strict versus Transitional ===
The OOXML standard actually defines two different format variations: 'strict' and 'transitional' OOXML. The transitional form is not fully specified within the standard documentation, as it is very closely bound to the specific behaviour of Microsoft Office and the older binary formats. The strict form is the fully standardised form, but Microsoft have been slow to fully support OOXML-Strict as the default format for Office documents, leading to interoperability problems. See [http://blog.gardeviance.org/2013/12/once-more-unto-breach-dear-friends-once.html this blog post for a more detailed look at the interoperability issues], and here [https://twitter.com/swardley/status/436463566410244097 for some context from 2014 concerning government support for open formats]. Some more commentary is [http://www.robweir.com/blog/2009/11/asking-right-questions-about-office.html here].

== Specs ==
* [https://www.ecma-international.org/publications-and-standards/standards/ecma-376/ ECMA-376 specification ]
* [http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html ISO publicly available standards, including the latest ISO/IEC 29500 specification] (as of November 2012, this is ISO/IEC 29500:2012)
* [http://www.digitalpreservation.gov/formats/fdd/fdd000395.shtml?loclr=blogsig OOXML Format Family -- ISO/IEC 29500 and ECMA 376 (Library of Congress)]
* [http://www.digitalpreservation.gov/formats/fdd/fdd000397.shtml?loclr=blogsig DOCX Transitional (Office Open XML), Library of Congress]
* [http://www.digitalpreservation.gov/formats/fdd/fdd000400.shtml?loclr=blogsig DOCX Strict (Office Open XML), Library of Congress]
* [http://www.digitalpreservation.gov/formats/fdd/fdd000396.shtml?loclr=blogsig Markup Compatibility and Extensibility (Office Open XML), Library of Congress]

== Sample files ==
* [https://www.dan.info/sampledata/msword/testing.docx Windows Word 2007 sample file]
* https://telparia.com/fileFormatSamples/document/docx/

== Software ==
* [http://johnmacfarlane.net/pandoc/ Pandoc: Document format conversion swiss-army knife]
* [https://github.com/jkr/docx2pandoc docx2pandoc: translate DOCX to Pandoc output formats]
* [http://textract.readthedocs.org/en/latest/ Textract: extract text from various document formats including DOCX]
* [https://pypi.python.org/pypi/Python-OOXML/0.12 Python library for parsing Office Open XML files]

== Other links and references ==
* [http://en.wikipedia.org/wiki/Office_Open_XML Office Open XML : Wikipedia]
* [http://support.microsoft.com/kb/924074 How to open new file formats in earlier versions of Microsoft Office]
* [{{ForensicsWikiURL|word_document_%28docx%29}} Forensics Wiki article]
* [http://www.afr.com/p/technology/why_it_might_be_time_to_dump_word_lQriIFyzmyoStP3nijq8bO Why it might be time to dump Word]
* [https://joinup.ec.europa.eu/elibrary/case/complex-singularity-versus-openness Complex singularity versus openness]

[[Category:XML based file formats]]
[[Category:ZIP based file formats]]
[[Category:Microsoft]]

PSD

2023-05-15T16:08:38Z

Sebras: Updated the link to a server that has the file (and it is saved in the Wayback machine for future reference if necessary).

{{FormatInfo
|formattype=electronic
|subcat=Graphics
|extensions={{ext|psd}}
|mimetypes={{mimetype|image/vnd.adobe.photoshop}}
|pronom={{PRONOM|x-fmt/92}}
|locfdd={{LoCFDD|fdd000523}}
|wikidata={{wikidata|Q2141903}}
|released=≥1990
}}
'''PSD''' is the native layered raster graphics file format of the [[Photoshop|Adobe Photoshop]] program line. The format has gone through multiple versions, each being downwards (but not always upwards) compatible.

PSD is a partially-documented proprietary format. It is very large and complex. Despite this, decoding the primary image of a PSD file is often fairly simple. If an application claims to support PSD, that could mean just about anything.

== Format details ==
=== Compression ===
Images are usually compressed with [[PackBits]], or uncompressed. "Zip" compression, which apparently means [[zlib]], is also supported.

=== Text encoding ===
PSD files often contain both [[Unicode]] and non-Unicode text. It's not clear whether there is a good way to determine the encoding of the non-Unicode text. Some sources claim [[MacRoman]], which is true in many cases, but other encodings have been observed.

== Identification ==
PSD files begin with bytes <code>'8' 'B' 'P' 'S' 0x00 0x01</code>.

== See Also ==
* [[PSB]]
* [[PhotoDeluxe]] (PDD)

See [[Photoshop]] for other related formats.

== Specifications ==
* [https://www.adobe.com/devnet-apps/photoshop/fileformatashtml/ Adobe Photoshop File Formats Specification] (current version)
** Other versions (from archive.org): [https://web.archive.org/web/20110109163057/http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/ 2010-07] · [https://web.archive.org/web/20120910224552/http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/ 2012-06] · [https://web.archive.org/web/20121209170357/http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/ 2012-12] · [https://web.archive.org/web/20130725152233/http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/ 2013-06] · [https://web.archive.org/web/20160617040616/https://www.adobe.com/devnet-apps/photoshop/fileformatashtml/ 2013-10] · [https://web.archive.org/web/20160701113240/https://www.adobe.com/devnet-apps/photoshop/fileformatashtml/ 2016-06] · [https://web.archive.org/web/20160917163012/http://www.adobe.com/devnet-apps/photoshop/fileformatashtml/ 2016-08]
* [https://oldschoolprg.x10.mx/downloads/ps6ffspecsv2.pdf Photoshop File Formats Specification V6.0 Release 2]
* [ftp://ftp.ora.com/pub/examples/gff/CDROM/GFF/VENDSPEC/ADOBEPHO/PHOTOSDK.PDF PHOTOSDK.PDF] - Adobe Photoshop 3.0.4 SDK
** [ftp://ftp.ora.com/pub/examples/gff/CDROM/GFF/VENDSPEC/ADOBEPHO/ADOBE.TXT ADOBE.TXT] - Photoshop 3.0.4 File Format
* [ftp://ftp.ora.com/pub/examples/gff/CDROM/GFF/VENDSPEC/ADOBEPHO/PHOTOSHP.TXT PHOTOSHP.TXT] - Adobe Photoshop 2.5 File Format
* [https://github.com/layervault/psd.rb/wiki/Anatomy-of-a-PSD-File PSD.rb: Anatomy of a PSD File]

== Software ==
=== Viewers, editors, and converters ===
* [[Adobe Photoshop]]
* [[ImageMagick]]
* [[Konvertor]]
* [[XnView]]
* [[Tom's Viewer]]

=== Libraries and tools ===
* [https://github.com/layervault/psd.rb PSD.rb] (Ruby)
* [https://pypi.python.org/pypi/psd-tools/ psd-tools] (Python)
* [https://sourceforge.net/projects/libpsd/ Libpsd] (C)
* [https://github.com/alco/psdump psdump] (C++; uses Libpsd)

== Sample files ==
* "Free PSD" websites are abundant. Some examples:
** [http://www.psdking.eu/ PsdKing]
** [http://www.psdgraphics.com/ psdGraphics]
** [http://www.freepik.com/free-psd Freepik → PSD]
* [https://github.com/devbrain/tombexcavator/tree/master/data/PSD tombexcavator samples]
* https://telparia.com/fileFormatSamples/image/psd/ → *.psd

== Links ==
See also [[Photoshop#Links]].

* [[Wikipedia: Adobe Photoshop#File format]]
* [https://www.adobe.com/products/photoshop.html Adobe Photoshop website]
* {{EGFF|psd|Adobe Photoshop File Format Summary}}, from the [[Encyclopedia of Graphics File Formats]]
* [http://git.gnome.org/browse/gimp/tree/plug-ins/file-psd Open-Source PSD import code from GIMP]

=== Commentary ===
* [https://github.com/gco/xee/blob/7aec0d65f776fa59c58eb6cf163b59dd4f1de3bd/XeePhotoshopLoader.m#L108 Rant about PSD format in comments of a program's source code]
* [https://jnack.com/blog/2009/05/04/some_thoughts_about_the_psd_format/ Some thoughts about the PSD format], in response to above's comments

[[Category:Adobe]]
[[Category:Photoshop]]

JPEG XR

2023-03-15T02:47:28Z

Sebras: /* Sample files */ Update sample image list.

{{FormatInfo
|formattype=electronic
|subcat=Graphics
|extensions={{ext|jxr}}, {{ext|hdp}}, {{ext|wdp}}
|locfdd={{LoCFDD|fdd000243}}
|pronom={{PRONOM|fmt/590}}
|mimetypes={{mimetype|image/vnd.ms-photo}}
|released=≤2009
}}
'''JPEG XR''' is an image compression standard and file format. It supports both lossy and lossless compression. It was originally developed by Microsoft and called '''Windows Media Photo''', then '''HD Photo'''.

The JPEG XR file format is very similar to [[TIFF]], though it is not compatible with it. The compression scheme is vaguely similar to the one used by lossy [[JPEG]].

== Identification ==
Files start with bytes <code>49 49 BC 01</code>.

== Specifications ==
* [http://www.itu.int/rec/T-REC-T.832 ITU-T Rec. T.832]

== Software ==
* [https://jxrlib.codeplex.com/ jxrlib]: JxrDecApp, JxrEncApp
* [[Konvertor]]
* [[XnView]]
* [http://microsoft.com/ie Internet Explorer], starting with version 9

== Sample files ==
* https://telparia.com/fileFormatSamples/image/jpegXR/
* https://web.archive.org/web/20150620065524/https://www.shikino.co.jp/eng/products/ipcore/jpegxr.html <br/> The actual files were found elsewhere on the archived site:
** https://web.archive.org/web/20220416103728/https://www.shikino.co.jp/solution/upfile/FLOWER.wdp.zip
** https://web.archive.org/web/20220416104247/https://www.shikino.co.jp/solution/upfile/SAKURA.wdp.zip
** https://web.archive.org/web/20220416104247/https://www.shikino.co.jp/solution/upfile/SMALLTOMATO.wdp.zip

== Links ==
* [[Wikipedia:JPEG XR|Wikipedia article]]

[[Category:Microsoft]]
[[Category:JPEG (organization)]]
[[Category:TIFF]]

PDF

2022-06-22T04:42:57Z

Sebras: /* Software */ Add another PDF viewer and manipulation tool.

{{FormatInfo
|formattype=electronic
|subcat=Document
|extensions={{ext|pdf}}
|mimetypes={{mimetype|application/pdf}}
|locfdd={{LoCFDD|fdd000030}}, others
|pronom={{PRONOM|fmt/276}}, others
|wikidata={{wikidata|Q42332}}
}}
'''Portable Document Format''' ('''PDF''') is a document file format originally from Adobe, based on [[PostScript]]. It has many subsets.

As well as the 'full function' ISO 32000-1:2008 (or PDF 1.7), there are also PDF/X, PDF/A, PDF/E, PDF/VT and PDF/UA, all of which are ISO specifications.

PDF profiles (formalized subsets) include the following:

* PDF/A (optimized for preservation)
** PDF/A-1 (ISO 19005-1:2005)
** PDF/A-2 (ISO 19005-2:2011)
** PDF/A-3 (ISO 19005-3:2012) (extends PDF/A-2 by allowing embedded files of any type)
** PDF/A-4 (ISO 19005-4:2020)
* PDF/E (ISO 24517-1:2008) (for engineering workflows)
* PDF/UA (ISO 14289-1) (making documents accessible through assistive technologies)
* PDF/VT (ISO 16612-2) (support for variable document printing)
* PDF/X (support for prepress graphics exchange)
** PDF/X-1 (ISO 15930-1:2001)
** PDF/X-1a (ISO 15930-4:2003)
** PDF/X-2 (ISO 15930-5:2003)
** PDF/X-3 (ISO 15930-6:2003)
* Tagged PDF
Some scanner documentation references an apparently fictitious "PDF/L" profile (see Gary McGath's [https://madfileformatscience.garymcgath.com/2018/03/21/pdf-l/ "PDF/L?"]).

A PDF 2.0 spec (ISO 32000-2) was published in 2017-07, with some new features as well as clarification of conformance with existing features.

A PDF/raster draft spec was issued in 2017 as a subset of PDF files containing raster images of scanned documents.

== Identifiers ==
{| class="wikitable"
! Format
! PRONOM
! LoCFDD
|-
|PDF || || {{LoCFDD|fdd000030}}
|-
|PDF 1.0 || {{PRONOM|fmt/14}} ||rowspan="4"| {{LoCFDD|fdd000316}}
|-
|PDF 1.1 || {{PRONOM|fmt/15}}
|-
|PDF 1.2 || {{PRONOM|fmt/16}}
|-
|PDF 1.3 || {{PRONOM|fmt/17}}
|-
|PDF 1.4 || {{PRONOM|fmt/18}} || {{LoCFDD|fdd000122}}
|-
|PDF 1.5 || {{PRONOM|fmt/19}} || {{LoCFDD|fdd000123}}
|-
|PDF 1.6 || {{PRONOM|fmt/20}} || {{LoCFDD|fdd000276}}
|-
|PDF 1.7 || {{PRONOM|fmt/276}} || {{LoCFDD|fdd000277}}
|-
|PDF 1.7, Ext. 3 || || {{LoCFDD|fdd000313}}
|-
|PDF 2.0 || {{PRONOM|fmt/1129}}
|-
|PDF/A || || {{LoCFDD|fdd000318}}
|-
|PDF/A-1 || || {{LoCFDD|fdd000125}}
|-
|PDF/A-1a || {{PRONOM|fmt/95}} || {{LoCFDD|fdd000251}}
|-
|PDF/A-1b || {{PRONOM|fmt/354}} || {{LoCFDD|fdd000252}}
|-
|PDF/A-2 || || {{LoCFDD|fdd000319}}
|-
|PDF/A-2a || {{PRONOM|fmt/476}} || {{LoCFDD|fdd000320}}
|-
|PDF/A-2b || {{PRONOM|fmt/477}} || {{LoCFDD|fdd000322}}
|-
|PDF/A-2u || {{PRONOM|fmt/478}} || {{LoCFDD|fdd000321}}
|-
|PDF/A-3a || {{PRONOM|fmt/479}} ||rowspan="3"| {{LoCFDD|fdd000360}}
|-
|PDF/A-3b || {{PRONOM|fmt/480}}
|-
|PDF/A-3u || {{PRONOM|fmt/481}}
|-
|PDF/A-4 || || {{LoCFDD|fdd000532}}
|-
|PDF/X-1 || {{PRONOM|fmt/144}}, {{PRONOM|fmt/145}} ||rowspan="9"| {{LoCFDD|fdd000124}}
|-
|PDF/X-1a || {{PRONOM|fmt/157}}, {{PRONOM|fmt/146}}
|-
|PDF/X-2 || {{PRONOM|fmt/147}}
|-
|PDF/X-3 || {{PRONOM|fmt/158}}, {{PRONOM|fmt/148}}
|-
|PDF/X-4 || {{PRONOM|fmt/488}}
|-
|PDF/X-4p || {{PRONOM|fmt/489}}
|-
|PDF/X-5g || {{PRONOM|fmt/490}}
|-
|PDF/X-5pg || {{PRONOM|fmt/491}}
|-
|PDF/X-5n || {{PRONOM|fmt/492}}
|-
|PDF/UA-1 || || {{LoCFDD|fdd000350}}
|-
|PDF/E-1 || {{PRONOM|fmt/493}}
|-
|PDF, Geospatial || || {{LoCFDD|fdd000315}}
|-
|GeoPDF 2.2 || || {{LoCFDD|fdd000312}}
|-
|PDF Portfolio || {{PRONOM|fmt/1451}}
|}

== Identification ==
The majority of PDF files can be identified by a fixed header e.g. "%PDF-1.4", however, older documents have a number of variations.
* Some can start with "%!PS-Adobe-N.n PDF-M.m" instead, as described [http://blog.didierstevens.com/2010/01/21/quickpost-pdf-header-ps-adobe-n-n-pdf-m-m/ here].
* Since PDF 1.7, the major and minor version numbers have been fixed. i.e. the public version from Adobe after 1.7 was "1.7 Adobe Extension Level 3".
* For the PDF/A families of formats, their conformance is declared via an embedded ([[XMP]]) metadata fragment.
* Some older files from Mac OS may be wrapped up in the [[AppleSingle]]/[[AppleDouble]] formats. This is a general issue, so should perhaps be documented elsewhere. For more information, see:
** http://en.wikipedia.org/wiki/AppleSingle_and_AppleDouble_formats
** http://tools.ietf.org/rfc/rfc1740.txt

== Compression ==
Images in PDF documents may use the following compression schemes:
* [[LZW]]
* Flate ([[zlib]])
* [[Run-length encoding|RunLength]]
* CCITTFax ([[CCITT Group 3]] and [[CCITT Group 4]])
* [[JBIG2]]
* DCT ([[JPEG]])
* [[JPX]] (part of the [[JPEG 2000]] standard)

== Digital Rights Management & Encryption ==
PDF has two types of 'encryption' - it uses an 'user' password to limit the ability to open the document, and a 'creator' password to limit other rights, like printing, copying, etc. The former case, where a password is required to open the file, is the main preservation concern, as our users will not be able to open a PDF encrypted in this way (unless the password can be cracked, which may be problematic both technically and legally). However, the latter case causes problems, because the PDF is encrypted here too, but with a special known user password of "" (an empty string, which is not the same as no password). So, the document is encrypted in both cases, and you can only tell which is which by attempting to decrypt the PDF using the special default password "". Some PDF analysis tools (notably [[JHOVE]]) do not implement the relevant decryption workflow, and so cannot distinguish between the two types of encryption.

An example of the decryption test workflow can be found here: https://gist.github.com/anjackson/5237071

Some of the most locked-up PDFs anywhere can be found at the [http://ibr.ansi.org/ ANSI IBR Standards Portal], which has made certain standards documents that are incorporated into legislation available for browsing, but only through a convoluted procedure involving downloading a special plug-in and filling out a registration form that must be re-filled-out in every browsing session.

A "Protected PDF" (PPDF) format is [http://www.eweek.com/mobile/microsoft-enterprise-mobility-suite-cozies-up-to-office.html reportedly] used by Microsoft's Azure Rights Management Service for sharing files securely within a workgroup.

== Document redaction ==

Occasionally the attempts of technically-inept users to obscure content in PDF files get in the news. People have sometimes had the mistaken impression that if a section of text is overlayed with a solid-black shape, or set to white-on-white text, or some such thing, before the publicly distributed document is sent out, that would make the redacted sections unavailable; this is not true, as it is in fact easy to find text that has been obscured in such manners, often as simple as dragging a mouse over it to highlight it. This happened in a [http://www.sun-sentinel.com/opinion/fl-op-editorial-judge-elizabeth-scherer-20180823-story.html 2018 Florida case] connected with the school shooting there, where some parts of the school district's report about the shooter were badly redacted and disclosed by a local newspaper, leading to a judge threatening punishment of the paper and prior restraint of future publications of theirs because of this "hacking", raising all sorts of legal and constitutional issues.

== Web linking ==

When linked on the [[Web]], specific pages of a PDF can be referenced by appending <code>#page=N</code> (where N is the desired page number) as a fragment identifier at the end of the [[URL]]. This is a little-known fact.

== Specifications ==
* [http://acroeng.adobe.com/wp/?page_id=321 Adobe PDF References] Contains links to every version of the PDF Reference published by Adobe (starting with PDF 1.0) as well as associated errata, addenda and tech notes.
* Other sources of the above documents:
** [http://www.adobe.com/devnet/pdf/pdf_reference.html PDF Reference and Adobe Extensions to the PDF Specification] Adobe page linking to specification for PDF 1.7 (equivalent to ISO 32000-1:2008) and two Adobe extensions that are expected to be incorporated into ISO 32000-2. These extensions include support for geospatial features and for 3-D content using [[U3D]] and [[Adobe PRC|PRC]] formats.
** [http://www.adobe.com/devnet/pdf/pdf_reference_archive.html Adobe PDF Reference Archives.] Archive of specifications for earlier Adobe versions of PDF, starting with Version 1.3.
* [https://www.iso.org/standard/51502.html ISO 32000-1:2008]: PDF 1.7 (not free to download)
* [https://www.iso.org/standard/63534.html ISO 32000-2:2017]: PDF 2.0 (not free to download)
* [https://pdfraster.org/wp-content/uploads/2017/06/PDFraster10_June-2017.pdf Draft PDF/raster spec 1.0]

== Software ==
* [http://get.adobe.com/reader/ Adobe Reader] views PDF files, either as a standalone program or a browser plugin.
* [http://www.mozilla.org/en-US/products/download.html?product=firefox-19.0&os=win&lang=en-US Firefox 19.0] includes a built-in PDF reader.
* [http://source.mozillaopennews.org/en-US/articles/introducing-tabula/ Tabula: convert tabular data in PDFs to CSV]
* [http://www.mpdf1.com/mpdf/index.php mPDF: convert HTML to PDF]
* [https://mupdf.com/ MuPDF PDF viewer and command line mutool for manipulating PDF]
* [http://en.pdf24.org/ PDF24 creator]
* [http://pdfbox.apache.org/ Apache PDFBox] is an open-source PDF library that includes a PDF/A validator
* [https://pdfium.googlesource.com/pdfium/ pdfium: Open source PDF rendering engine]
* [http://textract.readthedocs.org/en/latest/ Textract: extract text from various document formats including PDF]
* [https://github.com/pramodhkp/pdf2svg/ pdf2svg (in JavaScript)]
* [https://euske.github.io/pdfminer/programming.html Programming with PDFMiner]
* [https://github.com/friesey/preservation-tools/releases/tag/v0.1_alpha_PDFBox_Statistics PDFBox PDF/A Validator]
* [https://pypi.python.org/pypi/PyPDF2/1.24 PyPDF2]
* [https://github.com/sumatrapdfreader Sumatra PDF Reader]
* [https://chrome.google.com/webstore/detail/pdf-viewer/oemmndcbldboiebfnladdacbdfmadadm?hl=en PDF viewer for Chrome]
* [http://verapdf.org/software/ veraPDF library (PDF validator)]
* [http://www.metachris.com/pdfx/ PDFx - Extract metadata and URLs from PDFs, and download all referenced PDFs]
* [https://github.com/ANSSI-FR/caradoc Caradoc: PDF parser and validator]
* [https://github.com/uds-datalab/PDBF PBDF: Create documents that are simultaneously valid PDF, HTML, and VirtualBox OVA.]
* [https://blog.didierstevens.com/programs/pdf-tools/ PDF Tools]
* [https://www.tracker-software.com/product/pdf-xchange-viewer PDF-XChange Viewer]
* [https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/ The PDF Toolkit PDFTK]

== Online utilities ==
* [http://www.pdf4kindle.com/ PDF to Kindle converter]
* [https://pdftables.com/ PDF to Excel (and some other formats)]
* [https://www.ilovepdf.com/ I Love PDF: miscellaneous utilities]

== Sample files ==
* [https://github.com/openplanets/format-corpus/tree/master/pdfCabinetOfHorrors PDF Cabinet of Horrors] - sample PDF files in corrupted or otherwise problematic formats
* [http://acroeng.adobe.com/wp/?page_id=10 Adobe PDF Test Suites] - various PDF test suites on Adobe Acrobat Engineering site
* [http://craphound.com/homeland/Cory_Doctorow_-_Homeland.pdf Homeland by Cory Doctorow]
* [https://www.dan.info/sampledata/msword/testing.pdf Sample document saved from Windows Word 2007]
* [https://github.com/corkami/pocs/blob/master/pdf/quine.pdf Quine PDF; contains its own TeX source]
* [https://www.alchemistowl.org/pocorgtfo/pocorgtfo08.pdf Newsletter designed to work as PDF, ZIP, or shell script]
* [https://github.com/veraPDF/veraPDF-corpus veraPDF corpus]
* [https://github.com/osnr/horrifying-pdf-experiments Horrifying PDF Experiments]
* [https://github.com/mozilla/pdf.js/tree/master/test/pdfs Test PDFs used by Mozilla PDF Reader]
* [https://github.com/pdf-association/pdf20examples PDF 2.0 example files by the PDF Association]
* https://telparia.com/fileFormatSamples/document/pdf/

== See also ==
* [[Ascii85]]
* [[FDF]]
* [[KFP]] Preflight Profile
* [[PostScript]]
* [[WWF]]
* [[XFDF]]

== Links ==

=== Format info ===
* [http://en.wikipedia.org/wiki/Portable_Document_Format Portable Document Format (Wikipedia)]
* [http://www.forensicswiki.org/wiki/PDF Forensics Wiki: PDF]
*[http://acroeng.adobe.com/wp/ Adobe Acrobat Engineering site] - Dedicated Adobe site with lots of technical information, including a history of PDF and Acrobat, conforming viewers and test files.
* [http://www.pdfa.org/2013/04/pdfa-in-a-nutshell-2_0/ PDF/A in a Nutshell 2.0 – online edition]
* [http://www.infinitepartitions.com/cgi-bin/showarticle.cgi?article=art019 Inside the PDF File Format]
* [http://imgur.com/a/PbN8H#7 PDF101 an Adobe document walkthrough]

=== Validation ===
* [http://vimeopro.com/pdfassociation/technical-conference-europe-2013/video/68945979 PDF Validation: Dream or Yawn?] - Presentation on possibilities of an open-source PDF validator
* [http://www.docdroid.net/ciex/5103a198-1.pdf.html The pitfalls of protocol design: Attempting to write a formally verified PDF parser]
* [https://fileformats.wordpress.com/2015/04/22/verapdf/ New open-source file validation project]

=== Jailbreaking ===
* [http://scholrev.org/hackathon/ Jailbreaking the PDF hackathon]
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-a-wonderful-hackathon-and-a-community-leap-forward-for-freedom-1/ Jailbreaking the PDF (discussion)]
* [http://blogs.ch.cam.ac.uk/pmr/2013/05/28/jailbreaking-the-pdf-2-technical-aspects-glyph-processing/ Jailbreaking the PDF (technical aspects: glyph processing)]
* [http://blog.didierstevens.com/2015/04/15/pdf-password-cracking-with-john-the-ripper/ PDF Password Cracking With John The Ripper]

=== Commentary ===
* [http://www.portico.org/digital-preservation/wp-content/uploads/2012/11/TheNetworkIsTheFormat.pdf The Network is the Format: PDF and the Long-term Use of Digital Content] Article by Sheila Morrissey of ITHAKA on the challenges of preserving PDF files based on experience. She illustrates the challenge of defining a "sufficient sub-graph of the network of information about a digital object, for effective future use."
* [http://blogs.loc.gov/digitalpreservation/2014/06/the-pdfs-place-in-a-history-of-paper-knowledge-an-interview-with-lisa-gitelman/ The PDF’s Place in a History of Paper Knowledge: An Interview with Lisa Gitelman]
*[http://wiki.opf-labs.org/display/TR/Portable+Document+Format Portable Document Format on OPF File Format Risk Registry] - Lists various long-term accessibility issues in PDF and how to detect them using Apache Preflight.
* [http://www.openplanetsfoundation.org/system/files/PDFInventoryPreservationRisks_0_2_0.pdf Adobe Portable Document Format - Inventory of long-term preservation risks] - Report by KB/ National Library of the Netherlands.
* [http://fileformats.wordpress.com/2014/06/13/abuses-pdf/ The uses and abuses of PDF]
* [http://duff-johnson.com/2014/04/07/apples-preview-still-not-safe-for-work/ Apple’s Preview: Still not safe for work]
* [http://www.niso.org/publications/isq/2013/v25no3/moore/ Preserving the Grey Literature Explosion: PDF/A and the Digital Archive]
* [http://www.pdfa.org/2014/12/ensuring-long-term-access-pdf-validation-with-jhove/ Ensuring long-term access: PDF validation with JHOVE?]
* [http://www.theguardian.com/higher-education-network/2015/feb/11/researchers-its-time-to-ditch-the-pdf Researchers: it's time to ditch the PDF]
* [http://wiki.dpconline.org/images/5/51/PDF_Assessment_v1.2_external.pdf PDF Format Preservation Assessment (British Library)]
* [http://www.pdfa.org/2015/06/what-will-pdf-2-0-bring/ What will PDF 2.0 bring?]
* [http://www.digitalpreservation.gov/ndsa/working_groups/documents/NDSA_PDF_A3_report_final022014.pdf?loclr=blogsig The Benefits and Risks of the PDF/A-3 file format for archival institutions]
* [https://nicolastreeten.wordpress.com/2015/09/19/becoming-of-age-pdf/ Becoming of Age: PDF (comic)]
* [http://www.pdfa.org/2016/06/what-does-support-pdf-really-mean/ What does "support PDF" really mean?]
* [http://openpreservation.org/blog/2016/12/09/pdfa-as-a-preferred-sustainable-format-for-spreadsheets/ PDF/A as a preferred, sustainable format for spreadsheets?]
* [https://www.filingdb.com/pdf-text-extraction What's so hard about PDF text extraction?]
* [https://www.pdfa.org/perfecting-pdf-lexical-analysis/ Perfecting PDF lexical analysis]

=== Miscellaneous ===
* [http://www.pdfa.org/ PDF/A Competence Center]
* [http://web.archive.org/web/20130515073645/http://libraries.stackexchange.com/questions/964/what-preservation-risks-are-associated-with-the-pdf-file-format What preservation risks are associated with the PDF file format?] - Q&A thread from Libraries and Information Sciences Stack Exchange (archived)
* [http://labs.appligent.com/files/2013/03/recognizing_malformed_pdf_f.pdf Recognizing Corrupt and Malformed PDF Files]
* [https://github.com/davetaz/mh370-data Flight MH370 data was released as a PDF, but somebody extracted it to CSV to make it more useful for data analysis.]
* [https://pdf.yt/ PDFy - free host for publicly viewable PDFs, backed up automatically to Internet Archive]
* [http://www.washingtonpost.com/blogs/the-switch/wp/2014/08/05/uk-judge-says-freedom-of-information-means-choice-of-digital-file-format/ UK judge says ‘freedom of information’ means choice of digital file format]
* [http://blogs.perl.org/users/peter_martini/2014/08/the-chimera-quine-or-the-iso-pdf.html The Chimera Quine; or, the ISO PDF]
* [http://openplanetsfoundation.org/blogs/2014-08-12-coming-preserving-pdf-identify-validate-repair-hamburg PDF info/links for attendees of conference on it]
* [http://anjackson.github.io/keeping-codes/experiments/does-jhove-validate-pdfa-files Does JHOVE validate PDF/A files?]
* [http://raywoodcockslatest.wordpress.com/2014/12/04/pdf-repair/ Methods of Repairing Corrupted or Damaged PDFs]
* [http://stackoverflow.com/questions/17740175/how-do-i-dump-embedded-icc-profile-information-in-pdf-command-line-or-gui-tool/27464166#27464166 How do I dump embedded ICC profile information in PDF? (command line or GUI tools)]
* [http://stackoverflow.com/questions/27938551/how-to-check-pdf-pages-for-resolution-dpi-of-embedded-images/27942530 How to check PDF pages for resolution (DPI) of embedded images?]
* [http://chemxseer.ist.psu.edu/about/digital_library/das08-liu.pdf A Fast Preprocessing Method for Table Boundary Detection: Narrowing Down the Sparse Lines using Solely Coordinate Information]
* [https://github.com/angea/PDF101/tree/master/handcoded/textextract Why text extracting doesn't work for all PDFs]
* [http://stackoverflow.com/questions/29342542/how-can-i-extract-a-javascript-from-a-pdf-file-with-a-command-line-tool/29364036 How can I extract a JavaScript from a PDF file with a command line tool?]
* [http://stackoverflow.com/questions/29331731/postscript-code-to-un-hide-hidden-text-in-pdf/29334742 How to un-hide hidden text in PDF]
* [http://www.pdfa.org/2015/04/infographics-pdfua-and-wcag-2-0/ Infographics: PDF/UA and WCAG 2.0]
* [http://www.prepressure.com/pdf/basics/history The history of PDF] according to prepressure.com, a site for "prepress & print devotees".
* [https://isc.sans.edu/diary/Handling+Special+PDF+Compression+Methods/19597 Handling Special PDF Compression Methods]
* [https://speakerdeck.com/ange/lets-write-a-pdf-file Let's write a PDF file]
* [https://blog.didierstevens.com/2016/06/07/recovering-a-ransomed-pdf/ Recovering a ransomed PDF]
* [https://github.com/digital-preservation/droid/issues/114 PDF version numbers based on deprecated mechanism]
* [https://madfileformatscience.garymcgath.com/2016/09/26/pdf-version/ Figuring out the PDF version is harder than you think]
* [https://www.pdfa.org/slides-and-video-recordings-of-the-pdf-days-europe-2017/ Slides and video recordings of the PDF Days Europe 2017]
* [https://www.pcworld.com/article/2096946/5-cheaper-alternatives-to-acrobat-for-pdf-editing.html 5 cheaper alternatives to Acrobt for PDF editing]
* [https://pdfraster.org/ PDF/raster site]
* [https://www.pdfa.org/hunter-bidens-email-and-the-potential-for-deepfakes-with-pdf/ Hunter Biden’s “email” and the potential for deepfakes with PDF]
* [https://www.bitsgalore.org/2021/09/06/pdf-processing-and-analysis-with-open-source-tools PDF processing and analysis with open-source tools]
* [https://www.wowsignal.io/articles/pdf PDF cannot be tokenized]

[[Category:Page description languages]]
[[Category:Adobe]]

LBR

2022-01-29T09:20:38Z

Sebras: /* References */ Add links to two more versions of the specification.

{{FormatInfo
|formattype=electronic
|subcat=Archiving
|extensions={{ext|lbr}}, {{ext|lqr}}, {{ext|lzr}}, {{ext|lyr}}
|wikidata={{wikidata|Q6457314}}
}}
[[LBR]] was a container format popular for distributing [[CP/M]] software, designed by Gary P. Novosielski. Since it had no compression of its own, it was common for individual members of .LBR files to be compressed with [[Squeeze]] (.?Q?), [[Crunch]] (.?Z?), or [[CrLZH]] (.?Y?). Alternatively, the whole library could be compressed with one of these methods (leading to the extensions .LQR, .LZR, .LYR).

Under CP/M, the canonical tools for manipulating LBR files were LU.COM and NULU.COM. Other tools, such as NSWP.COM, understood both LBR and some of the closely associated compression formats.

LBR has been implemented on other platforms including PC/MS-DOS, but the [[LBR (Commodore)|Commodore LBR]] format is unrelated and not compatible. (It was common in those days for different platforms to be Balkanized and not have file formats that are in any way compatible with those of other platforms, even when they served similar purposes and were inspired by other-platform formats even to the point of being named after them.)

== Identification ==
LBR files have no signature, but they begin with a "Directory Control Entry" that has a fairly strict format. So, LBR files start with a 0x00 byte, then 11 spaces (0x20), then two 0x00 bytes, then two bytes that are not both 0x00.

== Tools ==

* [[CFX]] (DOS/Unix)
* [http://www.svgalib.org/rus/lbrate.html lbrate] by Russell Marks, c. 2001 (Unix, GPL2)
* [http://www.seasip.info/Unix/Lar/index.html LAR] (Unix, tar-like interface) by John Elliott, based on Stephen C. Hemminger's original
* [http://www.classiccmp.org/cpmarchives/cpm/mirrors/oak.oakland.edu/pub/cpm/arc-lbr/lu310.com lu310.com] (CP/M software)

== Sample files ==
* [http://www.classiccmp.org/cpmarchives/ftp.php?b=cpm/mirrors/oak.oakland.edu/pub/cpm/ OAK CP/M archive] → .../*.lbr
* https://telparia.com/fileFormatSamples/archive/lbr/

== References ==

* .LBR format definition, Gary P. Novosielski, 1984-08-19 -- available as LUDEF5.DOC in many CP/M archives (e.g., [http://www.retroarchive.org/cpm/cdrom/CPM/UTILS/ARC-LBR/LUDEF5.DOC here])
** [http://www.textfiles.com/programming/FORMATS/ludef5.txt This version] renames the extension to .txt so the browser won't try to launch M$ Word to open it.
** '''[http://www.seasip.info/Cpm/ludef5.html HTML version]''' of the above
* .LBR format definition, Gary P. Novosielski, 1984-08-04 -- [http://annex.retroarchive.org/cdrom/nightowl-001/015A/LUDEF4/LUDEF4.DOC an older version of the specification]
* .LBR format definition, Gary P. Novosielski, 1983-08-16 [http://cpmarchives.classiccmp.org/cpm/Software/WalnutCD/lambda/soundpot/f/lu300.lbr] use e.g. [[The Unarchiver]] to access the contained files.
* .LBR format definition, Gary P. Novosielski, 1982-11-04 -- [http://cpmarchives.classiccmp.org/cpm/Software/WalnutCD/simtel/sigm/vols100/vol119/ludef1.doc an early version of the specification]
* [[Wikipedia:LBR (file format)|LBR (file format) at Wikipedia]]
* [http://www.textfiles.com/programming/FORMATS/arc-lbr.pro ARC vs LBR comparison (1985)]

[[Category:CP/M]]

LBR

2022-01-29T09:07:17Z

Sebras: /* References */ Added link to older version of the spec.

{{FormatInfo
|formattype=electronic
|subcat=Archiving
|extensions={{ext|lbr}}, {{ext|lqr}}, {{ext|lzr}}, {{ext|lyr}}
|wikidata={{wikidata|Q6457314}}
}}
[[LBR]] was a container format popular for distributing [[CP/M]] software, designed by Gary P. Novosielski. Since it had no compression of its own, it was common for individual members of .LBR files to be compressed with [[Squeeze]] (.?Q?), [[Crunch]] (.?Z?), or [[CrLZH]] (.?Y?). Alternatively, the whole library could be compressed with one of these methods (leading to the extensions .LQR, .LZR, .LYR).

Under CP/M, the canonical tools for manipulating LBR files were LU.COM and NULU.COM. Other tools, such as NSWP.COM, understood both LBR and some of the closely associated compression formats.

LBR has been implemented on other platforms including PC/MS-DOS, but the [[LBR (Commodore)|Commodore LBR]] format is unrelated and not compatible. (It was common in those days for different platforms to be Balkanized and not have file formats that are in any way compatible with those of other platforms, even when they served similar purposes and were inspired by other-platform formats even to the point of being named after them.)

== Identification ==
LBR files have no signature, but they begin with a "Directory Control Entry" that has a fairly strict format. So, LBR files start with a 0x00 byte, then 11 spaces (0x20), then two 0x00 bytes, then two bytes that are not both 0x00.

== Tools ==

* [[CFX]] (DOS/Unix)
* [http://www.svgalib.org/rus/lbrate.html lbrate] by Russell Marks, c. 2001 (Unix, GPL2)
* [http://www.seasip.info/Unix/Lar/index.html LAR] (Unix, tar-like interface) by John Elliott, based on Stephen C. Hemminger's original
* [http://www.classiccmp.org/cpmarchives/cpm/mirrors/oak.oakland.edu/pub/cpm/arc-lbr/lu310.com lu310.com] (CP/M software)

== Sample files ==
* [http://www.classiccmp.org/cpmarchives/ftp.php?b=cpm/mirrors/oak.oakland.edu/pub/cpm/ OAK CP/M archive] → .../*.lbr
* https://telparia.com/fileFormatSamples/archive/lbr/

== References ==

* .LBR format definition, Gary P. Novosielski, 1984-08-19 -- available as LUDEF5.DOC in many CP/M archives (e.g., [http://www.retroarchive.org/cpm/cdrom/CPM/UTILS/ARC-LBR/LUDEF5.DOC here])
** [http://www.textfiles.com/programming/FORMATS/ludef5.txt This version] renames the extension to .txt so the browser won't try to launch M$ Word to open it.
** '''[http://www.seasip.info/Cpm/ludef5.html HTML version]''' of the above
* .LBR format definition, Gary P. Novosielski, 1984-08-04 -- [http://annex.retroarchive.org/cdrom/nightowl-001/015A/LUDEF4/LUDEF4.DOC an older version of specification]
* [[Wikipedia:LBR (file format)|LBR (file format) at Wikipedia]]
* [http://www.textfiles.com/programming/FORMATS/arc-lbr.pro ARC vs LBR comparison (1985)]

[[Category:CP/M]]