I would add Tabula [0] to this list. I’ve used it to extract tabular data from pdfs, especially in acquiring Covid data at the height of the pandemic. It’s got an MIT license and does extraction of table data from pdfs really well.
I am still looking for a good open source templating system that does use PDF files as input, so you are not bound to a special tool / format to generate the templates but can use whatever produces PDF files itself. Does anybody know one?
All printing workflows use proprietary formats as input and bind you to one tool or producer. Could Scribus help with that?
I am using XML ===XSLT===> LaTeX ===pdflatex/lualatex===> PDF for more than a decade now. The whole pipeline is driven by a batch file that takes all XML files in an input folder, uses a temp folder for the intermediate LaTeX, and an output folder for PDF.
I can produce HTML files from the same XML sources directly: XML ===XSLT===> HTML.
For differences between PDF and HTML versions I have some special tags and attributes in my XML sources.
If I want to change something in the layout, I modify the XSLT script and run the old XML sources through the pipeline again in one go.
There was some up front effort in designing the XML tag system and writing the XSLT scripts. But since my later layout changes were minor, the required tweaks were easy.
A quite simple homegrown DTD. Making it as simple as possible keeps the complexity of the XSLT scripts low. Most of it is similar to HTML (<paragraph>, <italics>, <bold>) with a few special attributes to add some semantics or processing hints.[1] Sometimes I am using several intermediated steps to produce the final HTML. It might then be useful to version the DTDs with a fixed REQUIRED version attribute in the root element, that must to appear in the XML files, to avoid applying a wrong XSLT script to an outdated version of my XML sources.[2]
For a customer who needed to import large semi-structured legacy Word documents from another company into a database system, I once implemented the following process: The Word documents were converted to a relatively simple homegrown XML format based on the structural elements of the Word document. The resulting XML documents were manually corrected where the structural elements were incorrect. Some special attributes were added inside the XML documents to associate text passages with already existing database keys. When this was finished, an XSLT script was applied that split the large XML file into smaller ones based on this database keys; a human readable prefix, the key and a date went into the file name. These files were converted in bulk to LaTeX and then to PDF. Afterwards, I used a little tool to bulk upload only the fresh PDFs into the correct database entries based on the keys in their filenames.
For one of my side-projects, a C# application, I am using another, object-oriented approach, where I have an abstract base class for reporting and two derived classes, one that outputs HTML and one that outputs LaTeX. The LaTeX output is then fed into lualatex to produce PDFs. You can check out the free Herodotus edition of my (closed-source) Factonaut project at https://www.factonaut.com/ to see it in action.
[1] Using parameter entities for re-usability, such as
<!ENTITY % output_attr SYSTEM "output_attr.ent">
<!ATTLIST foo %output_attr; >
<!ATTLIST bar %output_attr; >
in the DTD referring to an `output_attr.ent` file with the following contents:
This is interesting. Does exist some kind of extraction of the PDF generating part of inkscape, so you do not have to start inkscape all over again for generating PDF files? Or is there a good and fast command line tool, that can produce PDF from Inkscape PDF files?
Inkscape's format is just SVG, so any tool that can do SVG-PDF will do. I've seen cairosvg, imagemagick and rsvg used for this, but if your deployment allows, it does make the most sense to use inkscape from the CLI to avoid any inconsistencies:
$ inkscape rendered.svg --export-pdf=output.pdf
I think there's also a way to do batch processing where you don't need to spin up a new inkscape process for each file (which takes time), but I don't remember how it works anymore.
Depending on your needs, Scribus might help with it's scripting API[1], see ScribusGenerator[2] for example.
Scribus and inkscape can import pdfs but it would be preferable to use a clean "source" format, pdf is AFAIK meant for output, like lossy compressed images.
That's not a great idea. PDFs are a very "final presentation" format. They don't even really have a strict concept of a block of text. There are even tools which will happily put the separate letters is various places and call it a day. I've done something like that (just inserting short content + signature into a PDF) and do not want to touch PDF editing with a 10ft pole ever again. To preserve sanity, use a different format that actually understands the structure of the content, one step before the PDF is generated.
The problem with having a PDF file as template is that it gets very hard to define how new pages should be created whenever there is too much content to fit on the existing pages. E.g. line items in an invoice template.
What you can do is to use a PDF file for all the static content that appears on each page.
Amazing share on the tooling side of thing for PDF processing.
More on PDF and why it's hard to do analysis [1]. TL;DR
"PDF was never really designed as a data input format,
but rather, it was designed as an output format
giving fine grained control over the resulting document."
By the way, the PDF 1.4 (most widely used version) specification is over 1000 pages![2] To make it even harder, not everybody follow the spec :)
Yeah I think 95% of the time, the correct way to fix a problem with a PDF is to go back to the tool that generated it, and fix the issue in the input. But, sometimes non-ideal circumstances intervene.
yes, until very recently, if what adobe acrobat/reader did was different to the spec then the spec was wrong. Very hard to tell a client that their document is corrupted when they can read it just fine in adobe.
I've been looking for a tool that lets you markup pdf documents with bounding boxes around text and lets you add metadata (like a section number, paragraph number, page number) you can use for downstream processing and extracting the text from particular areas to process. Kind of like a reverse LaTeX Document Structure
The closest I've come in the past is manually setting xy coordinates to do crops in pdftotext which was pretty time consuming.
I'm sure it wouldn't be too difficult (famous last words) to use annotation objects with structure codes or some such but I'm surprised even after all these years that there isn't something that lets you do this more simply.
One interesting thing to note about pdf libraries is there seems to be a general trend towards A/GPL licenses, presumably because working with PDFs are tedious but still the standard for conveying data intended for human consumption/archival.
The commercial licensing for some of these is a little hairy, too. I've seen rev share clauses which is, uh, hilarious.
Depending on what you're doing, check your licenses!
When it comes up for me, I've usually been able to use the mentioned pdftotext from poppler-utils. With just one tweak that isn't mentioned in the article...for some documents, adding the "-layout" option gets the text in a state where I can parse it more easily.
For my use case, I ended up converting the PDF to a single png image and then use Amazon Textract on it. This allows me to easily convert pdf tables into csv files all from within php. Would love to find a cheaper (local) option vs AWS, but this works.
PDF processing seems like a security minefield. What are folks doing to mitigate that problem prior to (or as part of) processing? Or as part of any system that accepts PDFs with the intent that they’re shared with other systems and users.
If you're looking for a GUI, the Okular document viewer (by KDE) can sign them just fine, as can LibreOffice (but I don't like how it handles key storage on Linux).
For something much more powerful (where you can tweak every possible signature parameter), take a look at jsignpdf - the UI isn't exactly friendly, but it works more reliably than anything else and I've never seen a reader that didn't accept its signature (Adobe Reader likes to sometimes ignore signatures from other tools for no good reason, but never jsignpdf).
As for libraries, you can use jsignpdf from the CLI, PyHanko for Python and probably even more that I don't know.
Verifying digital signatures is quite hard. Yeah, checking whether the stored hash is valid is simple but all the other things are hard, for example, whether the additions done to a signed PDF are actually allowed or not.
Or whether the signature is correct in all the small details, e.g. used algorithm, the included information.
And then it sometimes depends on the environment. For example, sometimes a digital signature is only considered valid if all the revocation information and all the certificates are included in the PDF.
I'm still lamenting the demise of the "Impress!ve" pdf presentation tool. It was super cool. Currently unnmaintained and spitting errors, unfortunately.
0: https://tabula.technology/