How to Convert PDF to Text in Java
Utilize OCR technology to convert a PDF to text using an API in Java.
Join the DZone community and get the full member experience.
Join For FreeWithout the ability to copy, paste, or edit within a PDF document, it can be a frustrating task to manually transcribe a PDF to text. Fortunately for us, we have Optical Character Recognition (OCR) technology to help us out. We have discussed this a bit in previous articles, but to clarify, optical character recognition or optical character reader is the electronic or mechanical conversion of images of typed, handwritten, or printed text into machine-encoded text.
OCR is most popular as a form of data entry for printed paper data records, but it is also frequently used to digitize printed texts so that they can be edited, stored compactly, or displayed online. This technology has been refined and trained to recognize patterns, and now with the additional assistance of AI, can provide a high degree of accuracy with little effort.
In the following tutorial, we will provide instructions on how to utilize an OCR API to scan a PDF document and convert it to text, automating what would normally be a long and drawn-out process. The operation supports various quality levels and a wide array of languages, so you can customize it to fit your project’s needs.
As usual, our first step is to install the Maven SDK by adding a reference to the repository:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Next, we will add a reference to the dependency:
xxxxxxxxxx
<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v3.90</version>
</dependency>
</dependencies>
Once the installation is complete, we’re all set up to add our imports to the top of the controller and perform the functional call:
xxxxxxxxxx
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.PdfOcrApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
PdfOcrApi apiInstance = new PdfOcrApi();
File imageFile = new File("/path/to/inputfile"); // File | PDF file to perform OCR on.
String recognitionMode = "recognitionMode_example"; // String | Optional; possible values are 'Basic' which provides basic recognition and is not resillient to page rotation, skew or low quality images uses 1-2 API calls per page; 'Normal' which provides highly fault tolerant OCR recognition uses 26-30 API calls per page; and 'Advanced' which provides the highest quality and most fault-tolerant recognition uses 28-30 API calls per page. Default recognition mode is 'Basic'
String language = "language_example"; // String | Optional, language of the input document, default is English (ENG). Possible values are ENG (English), ARA (Arabic), ZHO (Chinese - Simplified), ZHO-HANT (Chinese - Traditional), ASM (Assamese), AFR (Afrikaans), AMH (Amharic), AZE (Azerbaijani), AZE-CYRL (Azerbaijani - Cyrillic), BEL (Belarusian), BEN (Bengali), BOD (Tibetan), BOS (Bosnian), BUL (Bulgarian), CAT (Catalan; Valencian), CEB (Cebuano), CES (Czech), CHR (Cherokee), CYM (Welsh), DAN (Danish), DEU (German), DZO (Dzongkha), ELL (Greek), ENM (Archaic/Middle English), EPO (Esperanto), EST (Estonian), EUS (Basque), FAS (Persian), FIN (Finnish), FRA (French), FRK (Frankish), FRM (Middle-French), GLE (Irish), GLG (Galician), GRC (Ancient Greek), HAT (Hatian), HEB (Hebrew), HIN (Hindi), HRV (Croatian), HUN (Hungarian), IKU (Inuktitut), IND (Indonesian), ISL (Icelandic), ITA (Italian), ITA-OLD (Old - Italian), JAV (Javanese), JPN (Japanese), KAN (Kannada), KAT (Georgian), KAT-OLD (Old-Georgian), KAZ (Kazakh), KHM (Central Khmer), KIR (Kirghiz), KOR (Korean), KUR (Kurdish), LAO (Lao), LAT (Latin), LAV (Latvian), LIT (Lithuanian), MAL (Malayalam), MAR (Marathi), MKD (Macedonian), MLT (Maltese), MSA (Malay), MYA (Burmese), NEP (Nepali), NLD (Dutch), NOR (Norwegian), ORI (Oriya), PAN (Panjabi), POL (Polish), POR (Portuguese), PUS (Pushto), RON (Romanian), RUS (Russian), SAN (Sanskrit), SIN (Sinhala), SLK (Slovak), SLV (Slovenian), SPA (Spanish), SPA-OLD (Old Spanish), SQI (Albanian), SRP (Serbian), SRP-LAT (Latin Serbian), SWA (Swahili), SWE (Swedish), SYR (Syriac), TAM (Tamil), TEL (Telugu), TGK (Tajik), TGL (Tagalog), THA (Thai), TIR (Tigrinya), TUR (Turkish), UIG (Uighur), UKR (Ukrainian), URD (Urdu), UZB (Uzbek), UZB-CYR (Cyrillic Uzbek), VIE (Vietnamese), YID (Yiddish)
String preprocessing = "preprocessing_example"; // String | Optional, preprocessing mode, default is 'Auto'. Possible values are None (no preprocessing of the image), and Auto (automatic image enhancement of the image before OCR is applied; this is recommended).
try {
PdfToTextResponse result = apiInstance.pdfOcrPost(imageFile, recognitionMode, language, preprocessing);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling PdfOcrApi#pdfOcrPost");
e.printStackTrace();
}
To ensure the process runs smoothly, there are a few parameters that need to be met:
- Image File – PDF file to perform OCR on.
- API Key – your personal API key; this can be obtained by registering for a free account on the Cloudmersive website.
- Recognition Mode (optional) – three settings are provided; the default is Basic.
- Basic: base-level recognition and not resilient to page rotation or low-quality images; uses 1-2 API calls.
- Normal: provides highly fault-tolerant recognition; uses 26-30 API calls
- Advanced: provides the highest quality and most fault-tolerant recognition; uses 28-30 API calls.
- Language (optional) – the language of the input text; default is ENG (English).
- Preprocessing (optional) – two settings are available for preprocessing mode; the default is Auto.
- None: no preprocessing of the image.
- Auto: automatic image enhancement before OCR is applied.
Your response will be delivered in no time and will list the text results by page. OCR has come a long way since its humble beginnings in the early 1900s, so your results should be both concise and accurate.
Opinions expressed by DZone contributors are their own.
Comments