How To Convert Scanned or Photographed Documents Into Text Using Java
Learn how you can harness two API solutions to convert scanned or photographed images into plain, machine-readable text.
Join the DZone community and get the full member experience.
Join For FreeAs many of our most pressing needs in the digital age are satisfied at the click of a button (or, perhaps more accurately, the touch of a screen), one might reasonably assume that the value of putting pen to paper has diminished to a level just shy of complete irrelevance. Yet, even as the business world continues its steady march towards complete digitization, we still somehow find ourselves writing out hand-written materials for many of our life’s most important transactions. For example, while many of us elect health insurance coverage through our employer’s online portal, we still typically need to write out various important details when we arrive at the doctor’s office – such as allergies, medications, recent disease exposures, etc. – and from there, we need to hand these filled-out forms into an office attendant for processing. Further, while (likely) most of us check finances and pay bills through secure banking applications, we still can’t seem to escape the use of in-person, hand-written materials in the process of acquiring loans for our future ventures, and we still often receive handwritten checks for payment. All things considered, the pertinence of hand-written materials may not be dying quite as quickly as anticipated.
However, regardless of how our medical or financial information is filled out, one thing is certain: that information, one way or another, is going to be stored in digital form. The reason for that is straightforward – relying on stacks of paper in cluttered filing cabinets to store all-important reference data is clearly irresponsible in the modern day. Even if mountainous filing cabinets still do exist in some over-burdened offices in the world, no one would willfully accept the storage of such files without some form of digital redundancy. The possibility of losing potentially critical data to a natural disaster – whether it be an age-old case of crippling human disorganization, or a truly unpredictable and tragic case like an earthquake or fire – is too devastating to fathom. After all, digitizing a physical form is a simple and cheap process. All you need is a camera or a scanning machine, and you’re immediately equipped to create digital copies of just about any paper-bound materials you can think of.
Once stored in digital form, the quality and usability of paper data become defined through a new metric: how well can that data now be queried, accessed, and manipulated? Once photographed, a document exists as a two-dimensional rasterized file composed entirely of pixels. As such, the data displayed within that picture isn’t machine-readable: it can’t be queried directly on its own. For that data to become digitally accessible, its pixelated text needs to be translated into a plain text string, which is a form of data computers are well equipped to read.
To accomplish this analog-to-digital transition, Optical Character Recognition (OCR) is the go-to solution. OCR is the simple process of converting an image of text into machine-readable text, and there are two major ways it can be accomplished – namely, via Pattern Matching or Feature Extraction – but I’ll defer a deeper discussion regarding those methodologies to a more broadly focused article on the subject. Once text becomes machine-readable, it can be stored independently of the two-dimensional file it arrived in. This means important text can now be queried, manipulated, and stored however we see fit.
Demonstration
The main purpose of this article is to demonstrate two OCR API solutions that you can use to convert your scanned and photographed images into plain, machine-readable text. Both services offer customizable fault-tolerance settings (basic, normal, and advanced; these respectively increase the number of API calls consumed per conversion) and offer excellent flexibility with support for dozens of common languages, including English, Chinese, Dutch, French, and more. While these APIs perform a very similar function from a birds-eye view, the primary difference between them is their degree of built-in fault tolerance. While any adequately tuned OCR service must first preprocess (i.e., remove imperfections from) an image prior to performing its text-recognition function, it’s important to acknowledge that scanned images are necessarily taken under much more favorable conditions than handheld photographs (which are prone to bad angles, prominent background details, etc.), and thus require less exhaustive preprocessing attention.
Each of the below API solutions can be used with a free Cloudmersive account (yields 800 API calls per month), and use the ready-to-run Java code examples provided in this article to structure your API calls.
Before calling either API, our first step is to install the Cloudmersive OCR API client. We can do so with Maven by first adding a Jitpack reference to the repository in pom.xml:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
After that, we need to add a reference to the dependency in pom.xml:
<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v4.25</version>
</dependency>
</dependencies>
Alternatively, we can install with Gradle by first adding a reference in our root build.gradle at the end of repositories:
allprojects {
repositories {
...
maven { url 'https://jitpack.io' }
}
}
And then adding a dependency in build.gradle:
dependencies {
implementation 'com.github.Cloudmersive:Cloudmersive.APIClient.Java:v4.25'
}
With installation finished, we can turn our attention to the larger API call structure. To convert a scanned image of a document to text, we first need to copy and paste the imports at the top of our file, and then call the function, including our API key and scanned document image in the designated fields:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ImageOcrApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
ImageOcrApi apiInstance = new ImageOcrApi();
File imageFile = new File("/path/to/inputfile"); // File | Image file to perform OCR on. Common file formats such as PNG, JPEG are supported.
String recognitionMode = "recognitionMode_example"; // String | Optional; possible values are 'Basic' which provides basic recognition and is not resillient to page rotation, skew or low quality images uses 1-2 API calls; 'Normal' which provides highly fault tolerant OCR recognition uses 26-30 API calls; and 'Advanced' which provides the highest quality and most fault-tolerant recognition uses 28-30 API calls. Default recognition mode is 'Advanced'
String language = "language_example"; // String | Optional, language of the input document, default is English (ENG). Possible values are ENG (English), ARA (Arabic), ZHO (Chinese - Simplified), ZHO-HANT (Chinese - Traditional), ASM (Assamese), AFR (Afrikaans), AMH (Amharic), AZE (Azerbaijani), AZE-CYRL (Azerbaijani - Cyrillic), BEL (Belarusian), BEN (Bengali), BOD (Tibetan), BOS (Bosnian), BUL (Bulgarian), CAT (Catalan; Valencian), CEB (Cebuano), CES (Czech), CHR (Cherokee), CYM (Welsh), DAN (Danish), DEU (German), DZO (Dzongkha), ELL (Greek), ENM (Archaic/Middle English), EPO (Esperanto), EST (Estonian), EUS (Basque), FAS (Persian), FIN (Finnish), FRA (French), FRK (Frankish), FRM (Middle-French), GLE (Irish), GLG (Galician), GRC (Ancient Greek), HAT (Hatian), HEB (Hebrew), HIN (Hindi), HRV (Croatian), HUN (Hungarian), IKU (Inuktitut), IND (Indonesian), ISL (Icelandic), ITA (Italian), ITA-OLD (Old - Italian), JAV (Javanese), JPN (Japanese), KAN (Kannada), KAT (Georgian), KAT-OLD (Old-Georgian), KAZ (Kazakh), KHM (Central Khmer), KIR (Kirghiz), KOR (Korean), KUR (Kurdish), LAO (Lao), LAT (Latin), LAV (Latvian), LIT (Lithuanian), MAL (Malayalam), MAR (Marathi), MKD (Macedonian), MLT (Maltese), MSA (Malay), MYA (Burmese), NEP (Nepali), NLD (Dutch), NOR (Norwegian), ORI (Oriya), PAN (Panjabi), POL (Polish), POR (Portuguese), PUS (Pushto), RON (Romanian), RUS (Russian), SAN (Sanskrit), SIN (Sinhala), SLK (Slovak), SLV (Slovenian), SPA (Spanish), SPA-OLD (Old Spanish), SQI (Albanian), SRP (Serbian), SRP-LAT (Latin Serbian), SWA (Swahili), SWE (Swedish), SYR (Syriac), TAM (Tamil), TEL (Telugu), TGK (Tajik), TGL (Tagalog), THA (Thai), TIR (Tigrinya), TUR (Turkish), UIG (Uighur), UKR (Ukrainian), URD (Urdu), UZB (Uzbek), UZB-CYR (Cyrillic Uzbek), VIE (Vietnamese), YID (Yiddish)
String preprocessing = "preprocessing_example"; // String | Optional, preprocessing mode, default is 'Auto'. Possible values are None (no preprocessing of the image), and Auto (automatic image enhancement of the image before OCR is applied; this is recommended).
try {
ImageToTextResponse result = apiInstance.imageOcrPost(imageFile, recognitionMode, language, preprocessing);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ImageOcrApi#imageOcrPost");
e.printStackTrace();
}
To convert a photograph of a document to text, we’ll instead perform the identical process for a slightly different body of code. Again, we’ll first copy the imports to the top of our file, and then we’ll include the function, adding our API key and handheld document photograph into the same fields as before:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ImageOcrApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
ImageOcrApi apiInstance = new ImageOcrApi();
File imageFile = new File("/path/to/inputfile"); // File | Image file to perform OCR on. Common file formats such as PNG, JPEG are supported.
String recognitionMode = "recognitionMode_example"; // String | Optional; possible values are 'Basic' which provides basic recognition and is not resillient to page rotation, skew or low quality images uses 1-2 API calls; 'Normal' which provides highly fault tolerant OCR recognition uses 26-30 API calls; and 'Advanced' which provides the highest quality and most fault-tolerant recognition uses 28-30 API calls. Default recognition mode is 'Advanced'
String language = "language_example"; // String | Optional, language of the input document, default is English (ENG). Possible values are ENG (English), ARA (Arabic), ZHO (Chinese - Simplified), ZHO-HANT (Chinese - Traditional), ASM (Assamese), AFR (Afrikaans), AMH (Amharic), AZE (Azerbaijani), AZE-CYRL (Azerbaijani - Cyrillic), BEL (Belarusian), BEN (Bengali), BOD (Tibetan), BOS (Bosnian), BUL (Bulgarian), CAT (Catalan; Valencian), CEB (Cebuano), CES (Czech), CHR (Cherokee), CYM (Welsh), DAN (Danish), DEU (German), DZO (Dzongkha), ELL (Greek), ENM (Archaic/Middle English), EPO (Esperanto), EST (Estonian), EUS (Basque), FAS (Persian), FIN (Finnish), FRA (French), FRK (Frankish), FRM (Middle-French), GLE (Irish), GLG (Galician), GRC (Ancient Greek), HAT (Hatian), HEB (Hebrew), HIN (Hindi), HRV (Croatian), HUN (Hungarian), IKU (Inuktitut), IND (Indonesian), ISL (Icelandic), ITA (Italian), ITA-OLD (Old - Italian), JAV (Javanese), JPN (Japanese), KAN (Kannada), KAT (Georgian), KAT-OLD (Old-Georgian), KAZ (Kazakh), KHM (Central Khmer), KIR (Kirghiz), KOR (Korean), KUR (Kurdish), LAO (Lao), LAT (Latin), LAV (Latvian), LIT (Lithuanian), MAL (Malayalam), MAR (Marathi), MKD (Macedonian), MLT (Maltese), MSA (Malay), MYA (Burmese), NEP (Nepali), NLD (Dutch), NOR (Norwegian), ORI (Oriya), PAN (Panjabi), POL (Polish), POR (Portuguese), PUS (Pushto), RON (Romanian), RUS (Russian), SAN (Sanskrit), SIN (Sinhala), SLK (Slovak), SLV (Slovenian), SPA (Spanish), SPA-OLD (Old Spanish), SQI (Albanian), SRP (Serbian), SRP-LAT (Latin Serbian), SWA (Swahili), SWE (Swedish), SYR (Syriac), TAM (Tamil), TEL (Telugu), TGK (Tajik), TGL (Tagalog), THA (Thai), TIR (Tigrinya), TUR (Turkish), UIG (Uighur), UKR (Ukrainian), URD (Urdu), UZB (Uzbek), UZB-CYR (Cyrillic Uzbek), VIE (Vietnamese), YID (Yiddish)
try {
ImageToTextResponse result = apiInstance.imageOcrPhotoToText(imageFile, recognitionMode, language);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ImageOcrApi#imageOcrPhotoToText");
e.printStackTrace();
}
As a reminder, you can set either API to detect the language of your choice by including your preferred language’s three-letter abbreviation in the “string language” field (each applicable language is listed in the code comments adjacent to this field). Additionally, if you’d like to get a better sense of the difference between basic, normal, and advanced fault tolerance for each solution, I recommend testing the same document once with each different value. The default – and most resource-intensive – setting is ”Advanced” (this consumes about 28-30 API calls per document).
Please refer to the below for the comprehensive list of languages supported by both of the above APIs:
- ENG (English), ARA (Arabic), ZHO (Chinese - Simplified), ZHO-HANT (Chinese - Traditional), ASM (Assamese), AFR (Afrikaans), AMH (Amharic), AZE (Azerbaijani), AZE-CYRL (Azerbaijani - Cyrillic), BEL (Belarusian), BEN (Bengali), BOD (Tibetan), BOS (Bosnian), BUL (Bulgarian), CAT (Catalan; Valencian), CEB (Cebuano), CES (Czech), CHR (Cherokee), CYM (Welsh), DAN (Danish), DEU (German), DZO (Dzongkha), ELL (Greek), ENM (Archaic/Middle English), EPO (Esperanto), EST (Estonian), EUS (Basque), FAS (Persian), FIN (Finnish), FRA (French), FRK (Frankish), FRM (Middle-French), GLE (Irish), GLG (Galician), GRC (Ancient Greek), HAT (Hatian), HEB (Hebrew), HIN (Hindi), HRV (Croatian), HUN (Hungarian), IKU (Inuktitut), IND (Indonesian), ISL (Icelandic), ITA (Italian), ITA-OLD (Old - Italian), JAV (Javanese), JPN (Japanese), KAN (Kannada), KAT (Georgian), KAT-OLD (Old-Georgian), KAZ (Kazakh), KHM (Central Khmer), KIR (Kirghiz), KOR (Korean), KUR (Kurdish), LAO (Lao), LAT (Latin), LAV (Latvian), LIT (Lithuanian), MAL (Malayalam), MAR (Marathi), MKD (Macedonian), MLT (Maltese), MSA (Malay), MYA (Burmese), NEP (Nepali), NLD (Dutch), NOR (Norwegian), ORI (Oriya), PAN (Panjabi), POL (Polish), POR (Portuguese), PUS (Pushto), RON (Romanian), RUS (Russian), SAN (Sanskrit), SIN (Sinhala), SLK (Slovak), SLV (Slovenian), SPA (Spanish), SPA-OLD (Old Spanish), SQI (Albanian), SRP (Serbian), SRP-LAT (Latin Serbian), SWA (Swahili), SWE (Swedish), SYR (Syriac), TAM (Tamil), TEL (Telugu), TGK (Tajik), TGL (Tagalog), THA (Thai), TIR (Tigrinya), TUR (Turkish), UIG (Uighur), UKR (Ukrainian), URD (Urdu), UZB (Uzbek), UZB-CYR (Cyrillic Uzbek), VIE (Vietnamese), YID (Yiddish)
Opinions expressed by DZone contributors are their own.
Comments