How accurate is table detection/parsing in PDFs? I found this part the most chal...

serjester · on April 8, 2024

Author here. Optionally we implement unitable which represents the current state of the art in table detection. Camelot / Tabelot use much simpler, traditional extraction techniques.

Unitable itself has shockingly good accuracy, although we’re still working on better table detection which sometimes negatively affects results.

xyzjgf · on April 9, 2024

is this the unitable you mentioned https://github.com/poloclub/unitable

verdverm · on April 8, 2024

I've been using camelot, which builds on the lower python pdf libraries, to extract tables from pdfs. Haven't tried anything exotic, but it seems to work. The tables I parse tend to be full page or the most dominant element

https://camelot-py.readthedocs.io/en/master/

I like Camelot because it gives me back pandas dataframes. I don't want markdown, I can make that from a dataframe if needed

passion__desire · on April 8, 2024

Have you checked Surya ?

Oras · on April 8, 2024

I did and I had issues when tables had mixed text and numbers.

Example:

£243,234 would be £234,

Or £243 234

Or £243,234 (correct).

Some cells weren't even detected.

passion__desire · on April 13, 2024

https://github.com/deepdoctection/deepdoctection

Have you tried this ?

saliagato · on April 8, 2024

worked 100% of the time for me

filkin · on April 8, 2024

which software?