ocr - How to preserve document structure in tesseract -


i using tesseract ocr extract text image. preserving structure of document important me. tesseract not preserve structure, infact changes order of text. input image below.

input

and output getting follows:

someto left someto left  in middle in middle  some tab some tab  some space between them some space between them  sometext here sometext here  this 

how desired output of same structure in image?

i.e. follows:

                                                 text here                                                  text here  left left                      in middle                     in middle          some tab         some tab  some space between them                       some space between them                       

newer versions of tesseract (3.04) have option called preserve_interword_spaces should want.

note number of spaces tesseract detects between words may not same between similar lines. words left-aligned run of spaces preceding them (as in example) may not output way -- preserve_interword_spaces option not attempt fancy, merely preserves spaces extraction found. default tesseract collapses runs of spaces one.

details on option here.


Comments

Popular posts from this blog

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -

google shop client API returns 400 bad request error while adding an item -