python - parse tables from a PDF document -
the pdf in link (http://www.lenovo.com/psref/pdf/psref450.pdf) contains number of tables this:
i'd programmatically extract data , structure these tables.
things i've tried: converting pdf html using
- tika: unfortunately, tables converted space delimited paragraphs - , of strings contain spaces it's notpossible split them.
- python's pdfminer: returned assertion error due missing fonts. suspect html have been similar output tika,though i'll need resolve issue missing fonts confirm this.
- online tools: tried http://www.zamzar.com/ , couple of others. file either big process (for online services) or generated errors.
i planning convert pdf html , parse beautifulsoup.
the output json (e.g. 1 object per table), xml, or pretty format maintains structure.
you try pdfbox. documentation here:
https://pdfbox.apache.org/1.8/cookbook/textextraction.html
extend org.apache.pdfbox.pdfviewer.pdfpagedrawer , override strokepath method. there can intercept drawing operations horizontal , vertical line segments , use information determine column , row positions. can set text regions determine numbers/letters/characters drawn in region. since know layout of regions tabular you'll able define tables , tell column , row extracted text belongs using simple algorithms.
Comments
Post a Comment