PDF processing and analysis with open-source tools (2021)

pfarrell · on Oct 10, 2022

I would add Tabula [0] to this list. I’ve used it to extract tabular data from pdfs, especially in acquiring Covid data at the height of the pandemic. It’s got an MIT license and does extraction of table data from pdfs really well.

0: https://tabula.technology/

MilStdJunkie · on Oct 10, 2022

Yeah, seconding this. I am surprised Tabula didn't make this list. Invaluable.

chasil · on Oct 10, 2022

This is missing the "pdfsizeopt" suite, that bundles statically compiled utilities to reduce size.

Static compilation means that it will run on most Linux platforms without extra required software.

I believe one aspect of it will remove characters from included fonts that are not used.

It really is quite impressive.

https://github.com/pts/pdfsizeopt

POPOSYS · on Oct 10, 2022

I am still looking for a good open source templating system that does use PDF files as input, so you are not bound to a special tool / format to generate the templates but can use whatever produces PDF files itself. Does anybody know one?

All printing workflows use proprietary formats as input and bind you to one tool or producer. Could Scribus help with that?

Archelaos · on Oct 10, 2022

I am using XML ===XSLT===> LaTeX ===pdflatex/lualatex===> PDF for more than a decade now. The whole pipeline is driven by a batch file that takes all XML files in an input folder, uses a temp folder for the intermediate LaTeX, and an output folder for PDF.

I can produce HTML files from the same XML sources directly: XML ===XSLT===> HTML.

For differences between PDF and HTML versions I have some special tags and attributes in my XML sources.

If I want to change something in the layout, I modify the XSLT script and run the old XML sources through the pipeline again in one go.

There was some up front effort in designing the XML tag system and writing the XSLT scripts. But since my later layout changes were minor, the required tweaks were easy.

froh · on Oct 10, 2022

which XML did you go with? docbook? Dita? some homegrown?

Archelaos · on Oct 10, 2022

A quite simple homegrown DTD. Making it as simple as possible keeps the complexity of the XSLT scripts low. Most of it is similar to HTML (<paragraph>, <italics>, <bold>) with a few special attributes to add some semantics or processing hints.[1] Sometimes I am using several intermediated steps to produce the final HTML. It might then be useful to version the DTDs with a fixed REQUIRED version attribute in the root element, that must to appear in the XML files, to avoid applying a wrong XSLT script to an outdated version of my XML sources.[2]

For a customer who needed to import large semi-structured legacy Word documents from another company into a database system, I once implemented the following process: The Word documents were converted to a relatively simple homegrown XML format based on the structural elements of the Word document. The resulting XML documents were manually corrected where the structural elements were incorrect. Some special attributes were added inside the XML documents to associate text passages with already existing database keys. When this was finished, an XSLT script was applied that split the large XML file into smaller ones based on this database keys; a human readable prefix, the key and a date went into the file name. These files were converted in bulk to LaTeX and then to PDF. Afterwards, I used a little tool to bulk upload only the fresh PDFs into the correct database entries based on the keys in their filenames.

For one of my side-projects, a C# application, I am using another, object-oriented approach, where I have an abstract base class for reporting and two derived classes, one that outputs HTML and one that outputs LaTeX. The LaTeX output is then fed into lualatex to produce PDFs. You can check out the free Herodotus edition of my (closed-source) Factonaut project at https://www.factonaut.com/ to see it in action.

[1] Using parameter entities for re-usability, such as

    <!ENTITY % output_attr SYSTEM "output_attr.ent">
    <!ATTLIST foo %output_attr; >
    <!ATTLIST bar %output_attr; >

in the DTD referring to an `output_attr.ent` file with the following contents:

    output (pdfonly|htmlonly|all) "all"

[2] The declaration in the DTD looks like:

    <!ATTLIST root version (1.0) #REQUIRED >

and the XML must then look like:

    <root version="1.0"> ... </root>

mrazomor · on Oct 10, 2022

One thing that worked well in practice (years ago):

- design svg template (eg. in inkscape)

- set placeholders (as svg is just xml -- {%address%} etc)

- replace the placeholders with the actual value & use headless inkscape to produce PDF

POPOSYS · on Oct 10, 2022

This is interesting. Does exist some kind of extraction of the PDF generating part of inkscape, so you do not have to start inkscape all over again for generating PDF files? Or is there a good and fast command line tool, that can produce PDF from Inkscape PDF files?

tkp · on Oct 10, 2022

Inkscape can be used from the commandline to export pdfs : https://wiki.inkscape.org/wiki/Using_the_Command_Line

franga2000 · on Oct 10, 2022

Inkscape's format is just SVG, so any tool that can do SVG-PDF will do. I've seen cairosvg, imagemagick and rsvg used for this, but if your deployment allows, it does make the most sense to use inkscape from the CLI to avoid any inconsistencies:

    $ inkscape rendered.svg --export-pdf=output.pdf

I think there's also a way to do batch processing where you don't need to spin up a new inkscape process for each file (which takes time), but I don't remember how it works anymore.

tkp · on Oct 10, 2022

Depending on your needs, Scribus might help with it's scripting API[1], see ScribusGenerator[2] for example.

Scribus and inkscape can import pdfs but it would be preferable to use a clean "source" format, pdf is AFAIK meant for output, like lossy compressed images.

[1] https://impagina.org/scribus-scripter-api/ [2] https://berteh.github.io/ScribusGenerator/

viraptor · on Oct 10, 2022

> that does use PDF files as input

That's not a great idea. PDFs are a very "final presentation" format. They don't even really have a strict concept of a block of text. There are even tools which will happily put the separate letters is various places and call it a day. I've done something like that (just inserting short content + signature into a PDF) and do not want to touch PDF editing with a 10ft pole ever again. To preserve sanity, use a different format that actually understands the structure of the content, one step before the PDF is generated.

gettalong · on Oct 10, 2022

The problem with having a PDF file as template is that it gets very hard to define how new pages should be created whenever there is too much content to fit on the existing pages. E.g. line items in an invoice template.

What you can do is to use a PDF file for all the static content that appears on each page.

mpweiher · on Oct 12, 2022

I used to make a tool that did this with Postscript and embedded EPSes, so you could use Quark or InDesign to create the templates.

The problem with using PDF is that most tools flatten the output, so that makes things a bit difficult.

jcynix · on Oct 10, 2022

This Perl module can read and modify PDFs:

https://metacpan.org/pod/PDF::API2

POPOSYS · on Oct 10, 2022

Thank you very much for the hints!

devy · on Oct 9, 2022

Amazing share on the tooling side of thing for PDF processing.

More on PDF and why it's hard to do analysis [1]. TL;DR

     "PDF was never really designed as a data input format, 
      but rather, it was designed as an output format 
      giving fine grained control over the resulting document."

By the way, the PDF 1.4 (most widely used version) specification is over 1000 pages![2] To make it even harder, not everybody follow the spec :)

[1] https://news.ycombinator.com/item?id=33146364

[2] https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandard...

bee_rider · on Oct 10, 2022

Yeah I think 95% of the time, the correct way to fix a problem with a PDF is to go back to the tool that generated it, and fix the issue in the input. But, sometimes non-ideal circumstances intervene.

perelin · on Oct 11, 2022

Make that 99%

jimjimjim · on Oct 10, 2022

yes, until very recently, if what adobe acrobat/reader did was different to the spec then the spec was wrong. Very hard to tell a client that their document is corrupted when they can read it just fine in adobe.

triggercut · on Oct 11, 2022

I've been looking for a tool that lets you markup pdf documents with bounding boxes around text and lets you add metadata (like a section number, paragraph number, page number) you can use for downstream processing and extracting the text from particular areas to process. Kind of like a reverse LaTeX Document Structure

The closest I've come in the past is manually setting xy coordinates to do crops in pdftotext which was pretty time consuming.

I'm sure it wouldn't be too difficult (famous last words) to use annotation objects with structure codes or some such but I'm surprised even after all these years that there isn't something that lets you do this more simply.

serhack_ · on Oct 10, 2022

This tool might be helpful for comparing pdfs: https://github.com/serhack/pdf-diff

jpalawaga · on Oct 10, 2022

One interesting thing to note about pdf libraries is there seems to be a general trend towards A/GPL licenses, presumably because working with PDFs are tedious but still the standard for conveying data intended for human consumption/archival.

The commercial licensing for some of these is a little hairy, too. I've seen rev share clauses which is, uh, hilarious.

Depending on what you're doing, check your licenses!

gyulai · on Oct 10, 2022

Don't forget https://weasyprint.org

tyingq · on Oct 9, 2022

When it comes up for me, I've usually been able to use the mentioned pdftotext from poppler-utils. With just one tweak that isn't mentioned in the article...for some documents, adding the "-layout" option gets the text in a state where I can parse it more easily.

ademup · on Oct 10, 2022

For my use case, I ended up converting the PDF to a single png image and then use Amazon Textract on it. This allows me to easily convert pdf tables into csv files all from within php. Would love to find a cheaper (local) option vs AWS, but this works.

lhuser123 · on Oct 10, 2022

> Would love to find a cheaper (local) option vs AWS

How about tesseract (https://github.com/tesseract-ocr/tesseract)

There’s even a library for php (https://github.com/thiagoalessio/tesseract-ocr-for-php). Haven’t used it. I did used python Pytesseract & works fairly well.

ademup · on Oct 10, 2022

Haven't seen this one, thanks!

matt7340 · on Oct 10, 2022

PDF processing seems like a security minefield. What are folks doing to mitigate that problem prior to (or as part of) processing? Or as part of any system that accepts PDFs with the intent that they’re shared with other systems and users.

woojoo666 · on Oct 10, 2022

I still haven't found a FOSS tool capable of signing pdfs (like docusign)

jahewson · on Oct 10, 2022

PDFBox can do this but to have to build the CLI tool yourself from this example:

https://github.com/apache/pdfbox/blob/5b00807463279f1002e245...

Everything about PDF signatures is maddeningly complex.

franga2000 · on Oct 10, 2022

If you're looking for a GUI, the Okular document viewer (by KDE) can sign them just fine, as can LibreOffice (but I don't like how it handles key storage on Linux). For something much more powerful (where you can tweak every possible signature parameter), take a look at jsignpdf - the UI isn't exactly friendly, but it works more reliably than anything else and I've never seen a reader that didn't accept its signature (Adobe Reader likes to sometimes ignore signatures from other tools for no good reason, but never jsignpdf).

As for libraries, you can use jsignpdf from the CLI, PyHanko for Python and probably even more that I don't know.

POPOSYS · on Oct 10, 2022

Also a simple tool to check signatures, if they are valid etc.

gettalong · on Oct 10, 2022

Verifying digital signatures is quite hard. Yeah, checking whether the stored hash is valid is simple but all the other things are hard, for example, whether the additions done to a signed PDF are actually allowed or not.

Or whether the signature is correct in all the small details, e.g. used algorithm, the included information.

And then it sometimes depends on the environment. For example, sometimes a digital signature is only considered valid if all the revocation information and all the certificates are included in the PDF.

oever · on Oct 10, 2022

Okular has recently gotten improved support for verifying cryptographic signatures in PDF files.

https://tsdgeos.blogspot.com/2022/03/okular-signature-suppor...

POPOSYS · on Oct 10, 2022

BTW pdfsig from the poppler tools is mentioned in the list of tools.

tpoacher · on Oct 10, 2022

not foss, but foxit reader on linux supports signatures

janci · on Oct 10, 2022

I'd like to add this tool to the list: https://pdfsam.org/ It does exactly what it says - splits and merges PDFs

tpoacher · on Oct 10, 2022

I'm still lamenting the demise of the "Impress!ve" pdf presentation tool. It was super cool. Currently unnmaintained and spitting errors, unfortunately.