Image to Text OCR using Python

Image to Text OCR using Python

OCR stands for Optical Character Recognition which means recognition of written/ printed characters by the computer.

OCR enable to convert hard, non editable text embedded in different mediums such as PDF, images, scanned documents into editable digital text format which can be saved and edited digitally on a computer.

There are various open-source tools like Tesseract, GOCR, Ocrad which convert images into text.

Each tool has different algorithm to recognise and extract text from the source.

Tesseract Opensource Google OCR

For example:

Ocrad OCR used feature extraction method whereas the Tesseract OCR uses the latest Artificial Intelligent LSTM Neural Network to extract characters from an image.

Tesseract OCR

Tesseract was originally developed at Hewlett-Packard Laboratories Bristol and at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. In 2005 Tesseract was open sourced by HP. Since 2006 it is developed by Google.

Tesseract has 2 major versions

  • Legacy Version 3.0
  • LSTM version 4.1

Both are open source and can be explored and used by downloading it from its Github repository. [Tesseract OCR]

Using Tesseract

Since Tesseract OCW is an stand alone program it can be downloaded and used right after the installation by running the tesseract commands in command line or terminal.

You can also use the tesseract engine in your python script by using the Python-Tesseract Wrapper library.

Installing Tesseract.

Tesseract Installer
  • Install Tesseract installer [Tesseract Intaller]
  • Verify the installation by running the command in command prompt or terminal.
tesseract --version
Tesseract Installed version check
Tesseract pre installed language list command

Converting Image to text with Tesseract OCR

  • Open Command Prompt
  • use “cd” command to navigate to the the folder where your image is saved.
  • Alternatively you can use full path of image.
  • Run command :
 tesseract imagename.jpg out.txt

The above command takes the image file and feeds it to thee tesseract engine and saves the output in out.txt file.

Example of Tesseract OCR.

Sample Image:

Sample Image Tesseract

Output File:

Output Text:

CASH RECEIPT
Shop Name

Address: Lorem Ipsum 3/18
Tel: 0987 123 890 5678
Date: MM/DD/YYYY
Manager: Lorem Ipsum

Lorem 2.15
Ipsum 8.75
Dolor sit 3.50

14.40

Using Tesseract For different languages:

You can use tesseract to recognize other languages by using the -l parameter and defining the language code explicitly.

tesseract receipt.jpg out -l eng+deu

Convert Images Text into Searchable PDF using Tesseract:

tesseract receipt.jpg out pdf

Tesseract Engine Mode

Tesseract offers 4 engine mode based :

  • 0 = Original Tesseract only.
  • 1 = Neural nets LSTM only.
  • 2 = Tesseract + LSTM.
  • 3 = Default, based on what is available.

You can set the mode by using the –oem parameter in the command

tesseract receipt.jpg out --oem 1 

Using Tesseract OCR engine in Python Image to Text conversion

To use Tesseract OCR engine in your python script you require the python-tesseract wrapper library

Install the pytesseract library Using the PIP:

 pip install pytesseract

In python : imgae_to_string function of pytesseract library is used to conver Image into text. The function takes path of image as argument and returns the text in the image which can be saved in a variable or can be saved as text file.

print(pytesseract.image_to_string(Image.open('test.png')))

For more Info: PyTesseract Wrapper

Sources:


Thank you for reading, Happy Learning, drop your suggestion in the comments.

Feel free to follow us on Youtube, Linked In , Instagram

Loading comments...
Mastodon