python - parse tables from a PDF document -

August 15, 2013

the pdf in link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains number of tables this:

enter image description here

i'd programmatically extract data , structure these tables.

things i've tried: converting pdf html using

tika: unfortunately, tables converted space delimited paragraphs - , of strings contain spaces it's notpossible split them.
python's pdfminer: returned assertion error due missing fonts. suspect html have been similar output tika,though i'll need resolve issue missing fonts confirm this.
online tools: tried http://www.zamzar.com/ , couple of others. file either big process (for online services) or generated errors.

i planning convert pdf html , parse beautifulsoup.

the output json (e.g. 1 object per table), xml, or pretty format maintains structure.

you try pdfbox. documentation here:

https://pdfbox.apache.org/1.8/cookbook/textextraction.html

extend org.apache.pdfbox.pdfviewer.pdfpagedrawer , override strokepath method. there can intercept drawing operations horizontal , vertical line segments , use information determine column , row positions. can set text regions determine numbers/letters/characters drawn in region. since know layout of regions tabular you'll able define tables , tell column , row extracted text belongs using simple algorithms.

Search This Blog

Silver

python - parse tables from a PDF document -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -