Welcome to PILtesseract’s documentation!¶
Table of Contents:
The piltesseract package is a simple Tesseract-OCR command line wrapper.
piltesseract allows quick conversion of PIL Image.Image instances to text using Tesseract-OCR.
Warning
piltesseract is intended to only work with tesseract 3.03+, one awesome feature added in 3.03 is the ability to pipe images via stdin, piltesseract utilizes this feature.
Examples
>>> from PIL import Image
>>> from piltesseract import get_text_from_image
>>> image = Image.open('quickfox.png')
>>> get_text_from_image(image)
'The quick brown fox jumps over the lazy dog'
Without a config file, you can set config variables using optional keywords.
>>> text = get_text_from_image(
image,
tessedit_ocr_engine_mode=1, # cube mode enum found in Tesseract-OCR docs
cube_debug_level=1
)
-
piltesseract.
get_text_from_image
(image, tesseract_dir_path=u'', stderr=None, psm=3, lang=u'eng', tessdata_dir_path=None, user_words_path=None, user_patterns_path=None, config_name=None, **config_variables)¶ Uses tesseract to get text from an image.
Outside of image, tesseract_dir_path, and stderr, the arguments mirror the official command line’s usage. A list of the command line options can be found here: https://tesseract-ocr.googlecode.com/svn/trunk/doc/tesseract.1.html
Parameters: - image (Image.Image or str) – The image to find text from or a path to that image.
- tesseract_dir_path (Optional[str]) – The path to the directory with the tesseract binary. Defaults to “”, which works if the binary is on the environmental PATH variable.
- stderr (Optional[file]) – The file like object (implements write) the tesseract stderr stream will write to. Defaults to None. You can set it to sys.stdin to see all output easily.
- psm (Optional[int]) – Page Segmentation Mode. Limits Tesseracts layout analysis (see the Tesseract docs). Default is 3, full analysis.
- lang (Optional[str]) – The language to use. Default is ‘eng’ for English.
- tessdata_dir_path (Optional[str]) – The path to the tessdata directory.
- user_words_path (Optional[str]) – The path to user words file.
- user_patterns_path (Optional[str]) – The path to the user patterns file.
- config_name (Optional[str]) – The name of a config file.
- **config_variables – The config variables for tesseract. A list of config variables can be found here: http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
Returns: The parsed text.
Return type: str
Raises: subprocess.CalledProcessError
– If the tesseract exit status is not a success.Examples
Examples assume “image” is a picture of the text “ABC123”. See piltesseract tests for working code.
>>> get_text_from_image(image) 'ABC123' >>> get_text_from_image(image, psm=10) #single character psm 'A'
You can use tesseract’s default configs or your own:
>>> get_text_from_image(image, config_name='digits') '13123'
Without a config file, you can set config variables using optional keywords:
>>> text = get_text_from_image( image, tessedit_char_whitelist='1' tessedit_ocr_engine_mode=1, #cube mode enum found in Tesseract-OCR docs ) '1 11 '