Getting Started With OCR docTR on Ubuntu
Explore docTR, an open-source Optical Character Recognition (OCR) solution. Walk through what is needed to install it on Ubuntu as well as a test example.
Join the DZone community and get the full member experience.
Join For FreeIn this tutorial, I'll explore how to set up and utilize docTR, the open-source OCR (Optical Character Recognition) solution of the document parsing API startup Mindee. I’ll go through what you need to install docTR on Ubuntu. It accepts PDFs, images, and even a website URL as an input. In this example, I will parse a grocery store receipt. Let’s get started.
Setting Up docTR on Ubuntu
docTR is compatible with any Linux distribution, macOS, and Windows. It is also available as a Docker image. I will use Ubuntu 22.04 LTS (Jammy Jellyfish) for this tutorial. Hardware-wise, you don’t need anything specific, but if you want to do extensive testing, I recommend using a GPU instance; OVHcloud offers affordable options, with servers starting at less than a dollar per hour.
Let’s start by installing Python. At the time of writing, docTR requires Python 3.8 (or higher).
sudo apt install -y python3
To avoid messing with system libraries, let’s use a virtual environment.
sudo apt install -y python3.10-venv
python3 -m venv testing-Mindee-docTR
Then we install the OpenGL Mesa 3D Graphics Library, used for the computer vision part of docTR.
sudo apt install -y libgl1-mesa-glx
We install pango
, which is a text layout engine library.
sudo apt-get install -y libpango-1.0-0 libpangoft2-1.0-0
Then, we install pip
so that we can install docTR.
sudo apt install -y python3-pip
Finally, we install docTR within our virtual environment. This version is specifically for PyTorch. If you choose to use TensorFlow, change the command accordingly.
testing-Mindee-docTR/bin/pip3 install "python-doctr[torch]"
Using docTR
Now that docTR is installed, let’s start playing with it. In this example, I will test it with a grocery store receipt. You can download the receipt using the command below.
wget "https://media.istockphoto.com/id/889405434/vector/realistic-paper-shop-receipt-vector-cashier-bill-on-white-background.jpg?s=612x612&w=0&k=20&c=M2GxEKh9YJX2W3q76ugKW23JRVrm0aZ5ZwCZwUMBgAg=" -O receipt.jpeg
Create a testing-docTR.py
file and insert the following code into it.
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
# Load the grocery receipt
doc = DocumentFile.from_images("receipt.jpeg")
# Load the OCR model
model = ocr_predictor(pretrained=True)
# Perform OCR
result = model(doc)
# Display the OCR result
print(result.export())
Note that docTR uses a two-stage approach:
- First, it performs text detection to localize words.
- Then, it conducts text recognition to identify all characters in the word.
The ocr_predictor function accepts additional parameters to select the text detection and recognition architecture. For simplicity, I used the default ones in this example. You can find information about other models on the docTR documentation.
Reading a Receipt Using docTR
Now you just need to run your Python script:
testing-Mindee-docTR/bin/python3 testing-docTR.py
You will get an output such as the one below:
{"pages": [{"page_idx": 0, "dimensions": [612, 612], "orientation": {"value": null, "confidence": null}, "language": {"value": null, "confidence": null}, "blocks": [{"geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]], "lines": [{"geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]], "words": [{"value": "RECEIPT", "confidence": 0.9695481061935425, "geometry": [[0.44140625, 0.1201171875], [0.548828125, 0.14453125]]}]}], "artefacts": []}]}]}
Note that I drastically shortened the JSON output for readability and only kept the part showing the “RECEIPT
” word.
Here is the JSON structure you’d be looking at without truncating the result. I have expanded the part of the tree that I kept in the JSON output. docTR will provide a bunch of information about the document but the important part is about how it breaks down the document into lines, and for each line, provides an array containing the words it detected along with the degree of confidence. Here, we can see it spotted the word RECEIPT
with a confidence of 96%.
docTR offers an efficient OCR solution that simplifies text recognition processes. Depending on the document type, you may need to change the text detection and text recognition architectures to improve accuracy. Comprehensive docTR documentation is available here.
Considerations When Using docTR
Deploying docTR entails certain complexities. First, you must create a dataset and train docTR to achieve satisfactory accuracy. This means dealing with data annotation on many images. Since OCR systems typically serve as backend services for other apps, it may be necessary to integrate docTR via an API and scale it according to the app’s needs. docTR does not provide this out of the box, but there are many open-source technologies that can help facilitate this step.
Conclusion
Document processing technologies have come a long way since the advent of OCR tools, which are limited to character recognition. Intelligent Document Processing (IDP) platforms represent the next step; they utilize OCR (such as docTR) along with additional layers of intelligence like table reconstruction, document classification, and natural language understanding, to achieve better accuracy and precision.
Additionally, for those seeking a scalable IDP solution without the complexities of data collection and model training, I recommend trying out Mindee’s latest solution, docTI. This training-free IDP solution leverages Large Language Models (LLMs) to eliminate the need for data collection, annotation, and the model training process. You can use the free-tier plan, configure an instance, and start querying the API in minutes.
Opinions expressed by DZone contributors are their own.
Comments