Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

How accurate is table detection/parsing in PDFs? I found this part the most challenging, and none of the open-source PDF parsers worked well.


Author here. Optionally we implement unitable which represents the current state of the art in table detection. Camelot / Tabelot use much simpler, traditional extraction techniques.

Unitable itself has shockingly good accuracy, although we’re still working on better table detection which sometimes negatively affects results.


is this the unitable you mentioned https://github.com/poloclub/unitable


I've been using camelot, which builds on the lower python pdf libraries, to extract tables from pdfs. Haven't tried anything exotic, but it seems to work. The tables I parse tend to be full page or the most dominant element

https://camelot-py.readthedocs.io/en/master/

I like Camelot because it gives me back pandas dataframes. I don't want markdown, I can make that from a dataframe if needed


Have you checked Surya ?


I did and I had issues when tables had mixed text and numbers.

Example:

£243,234 would be £234,

Or £243 234

Or £243,234 (correct).

Some cells weren't even detected.



worked 100% of the time for me


which software?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: