ocr - How to preserve document structure in tesseract -

August 15, 2010

i using tesseract ocr extract text image. preserving structure of document important me. tesseract not preserve structure, infact changes order of text. input image below.

input

and output getting follows:

someto left someto left  in middle in middle  some tab some tab  some space between them some space between them  sometext here sometext here  this

how desired output of same structure in image?

i.e. follows:

                                                 text here                                                  text here  left left                      in middle                     in middle          some tab         some tab  some space between them                       some space between them

newer versions of tesseract (3.04) have option called preserve_interword_spaces should want.

note number of spaces tesseract detects between words may not same between similar lines. words left-aligned run of spaces preceding them (as in example) may not output way -- preserve_interword_spaces option not attempt fancy, merely preserves spaces extraction found. default tesseract collapses runs of spaces one.

details on option here.

Search This Blog

Silver

ocr - How to preserve document structure in tesseract -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

netbeans - Remove indent guide lines -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -