I noticed that you can use Tesseract as an OCR adapter for rga. Tesseract is wri...

mouldysammich · on Dec 2, 2020

https://github.com/tesseract-ocr/tesseract seems to be written in c++ not python

faitswulff · on Dec 2, 2020

Ah, my mistake then.

hobofan · on Dec 2, 2020

I don't think the problem necessarily is that Tesseract is slow, but that the whole process of rendering a PDF to a series of PNGs on which you can then run OCR is slow (which is what it does in the background).

undebuggable · on Dec 2, 2020

The process of converting all pages to raster images and then OCR-ing each one takes hours for PDFs hundreds of pages long. This workflow is not suitable for instant search. For non OCR-ed PDFs it's worth to pregenerate the text.

hobofan · on Dec 2, 2020

That's why rga comes with a cache. I've occasionally used the Tesseract adapter with good success (results-wise), and after the inital rendering and indexing, it's fast enough to use.