Python Techniques for Text Extraction From Images
Explore two methods of text extraction from images using Python 3.
Join the DZone community and get the full member experience.
Join For FreePython is one of the most powerful programming languages available today. It is the most popular language when it comes to AI-related tasks such as Optical Character Recognition (OCR).
Its community support is one of the most extensive in 2024. There are numerous libraries and packages in Python that help with the creation of AI software. Today, we are going to look at a few methods of text extraction from images using Python 3.
The only prerequisites needed are to have a computer, an internet connection, and a Google account because we are going to do everything on Google Collaboratory.
Python Techniques for Image-to-Text Conversion
Several Python libraries can help you extract text from an image. Below are two straightforward methods of using such libraries.
1. Tesseract In Google Colab
Tesseract is an OCR engine. You can use Tesseract in Python with the help of Pytesseract. We are going to teach you how to use Tesseract for image-to-text conversion with Google Colab, which is an online tool for running Python code.
The advantage of using Colab is that you don’t have to worry about anything like dependencies or installing massive libraries on your system.
So, let’s see how you can do that.
-
Installing Tesseract OCR
Tesseract OCR is the particular component of Tesseract that helps us to use OCR functions. This is vital for converting images to text.
The command for installing it is :
!sudo apt install tesseract-ocr
Normally, you would need to install Tesseract OCR on your system. The “!sudo apt install” is a Linux terminal command. With Google Colab, though, you don’t need to get into that type of trouble. Simply run this command in a code block and Colab will handle everything else.
The installation may look like this:
-
Installing Tesseract and Pillow
Now, we need to install Tesseract for Python. This is a simple matter. All you need to do is write the following command in a code block and run it.
!pip install pytesseract
!pip is a Python install command. PIP stands for Python installs packages. It is used to install all kinds of Python libraries and dependencies.
Anyway, after you run this command, you will see some installations going on. They may look like this:
You may have noticed that the installing Tesseract also installs Pillow. Pillow is a Python imaging library fork. It provides functions for importing, opening, manipulating, and saving image files.
Without Pillow, we cannot provide an image to the program for image-to-text conversion. So, Python automatically installs Pillow along with Tesseract. Sweet!
-
Image Import Preparation
Now, we need to use some commands to enable the importing of images. This can be done by using the ‘shutil,’ ‘os,’ and ‘random’ commands.
- shutil: Helps you copy, move, and delete files and directories in Python.
- os: Lets you work with the operating system, like navigating files, checking file existence, and executing commands.
- random: Generates random numbers and selections and is useful for things like games, simulations, and statistical sampling.
Here’s how you need to write them down:
import pytesseract
import shutil
import os
import random
Then right below them, you need to write the following code:
try:
from PIL import Image
except ImportError:
import Image
This code snippet is a common pattern used in Python to import the Image module from the Python Imaging Library (PIL) or its fork, Pillow. Here's what it does:
- It attempts to import the Image module from the PIL package using the from ... import ... syntax.
- If the PIL package is not installed or cannot be imported, it falls back to importing the Image module from the global namespace, which may refer to the Pillow library if it's installed.
This allows the code to work with either PIL or Pillow without needing to change the import statement manually. It's a way to ensure compatibility across different environments where either PIL or Pillow may be installed.
Now, we are ready to import our image.
-
Image Import to Colab
To import an image from your device to Colab, you need to write the following snippet of code:
from google.colab import files
uploaded = files.upload()
Running this piece of code will allow you to select a file from your device and import it to the run time.
-
Text Extraction From Image
To extract text from an image, you need to write the following two commands:
extractedInformation = pytesseract.image_to_string(Image.open('sample.png'))
print(extractedInformation)
Here is a simple explanation of this code.
- Image.open('sample.png'): This part opens the image file named "sample.png". The Image.open() function is from the Python Imaging Library (PIL) or Pillow library, which allows you to open and manipulate image files.
- pytesseract.image_to_string(...): This part of the code calls the image_to_string function from the pytesseract package. This function takes an image file (in this case, opened using Image.open()) as input and extracts the text from it using the Tesseract OCR engine.
- extractedInformation = ...: This assigns the extracted text to the variable named extractedInformation.
- Print (extractedInformation) will simply output the result, which is the extracted text.
The image we chose for this exercise was this one:
As you can see, our output was the same.
So, there you have it. You've learned how to use Python for text extraction from an image using Tesseract and Google Colab.
2. Editpad: A Python-Powered Online Tool
There is another technique of using Python for text extraction. That is to use a Python-powered online tool like Editpad.
Editpad is a simple tool that uses Python in its backend to deploy OCR and extract text from images. Here’s how you can use this tool.
- Open a web browser and search for Editpad to extract text from the image tool.
- Open the result that matches your query.
- You will see a simple interface like this:
-
Follow the on-screen instructions to input your image.
-
Click the “Extract Text” button
-
You will get your output in a matter of seconds. Simply download or copy it to use it.
This is an overall much simpler way of extracting text from images. Another advantage is that you can input multiple images for extraction. There is also an API you can use if you want to import this functionality to your own programs or apps.
Conclusion
You have learned two Python techniques for text extraction from images. One method was to manually write a program in Python and use Tesseract OCR for text extraction. The other method was to use an online tool that utilizes Python in the back end for text extraction. Both approaches have their merits, and you should use them accordingly.
Opinions expressed by DZone contributors are their own.
Comments