ocr - How to preserve document structure in tesseract -
i using tesseract ocr extract text image. preserving structure of document important me. tesseract not preserve structure, infact changes order of text. input image below.
and output getting follows:
someto left someto left in middle in middle some tab some tab some space between them some space between them sometext here sometext here this
how desired output of same structure in image?
i.e. follows:
text here text here left left in middle in middle some tab some tab some space between them some space between them
newer versions of tesseract (3.04) have option called preserve_interword_spaces
should want.
note number of spaces tesseract detects between words may not same between similar lines. words left-aligned run of spaces preceding them (as in example) may not output way -- preserve_interword_spaces
option not attempt fancy, merely preserves spaces extraction found. default tesseract collapses runs of spaces one.
details on option here.
Comments
Post a Comment