Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I noticed that you can use Tesseract as an OCR adapter for rga. Tesseract is written in python, IIRC, and in the OP it comes with a warning that it’s slow and not enabled by default. Are there any other fast, reliable OCR libs out there? Or any rust OCR backends?


https://github.com/tesseract-ocr/tesseract seems to be written in c++ not python


Ah, my mistake then.


I don't think the problem necessarily is that Tesseract is slow, but that the whole process of rendering a PDF to a series of PNGs on which you can then run OCR is slow (which is what it does in the background).


The process of converting all pages to raster images and then OCR-ing each one takes hours for PDFs hundreds of pages long. This workflow is not suitable for instant search. For non OCR-ed PDFs it's worth to pregenerate the text.


That's why rga comes with a cache. I've occasionally used the Tesseract adapter with good success (results-wise), and after the inital rendering and indexing, it's fast enough to use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: