How to Convert a PDF to Text (TXT) Using Java
This article outlines the difficulties in extracting plain text from regular PDF documents at scale and demonstrates two API solutions that efficiently perform that task.
Join the DZone community and get the full member experience.
Join For FreeThere is perhaps no file type more ubiquitous (by design) than the Portable Document Format (PDF). Capable of holding an impressive variety of content/object types and working seamlessly on any operating system you can think of, PDFs dominate personal and professional project landscapes as a destination format for bulky and/or specially formatted files. File types like PowerPoint’s PPTX, for example, are often so large that exporting the file as a PDF is the only efficient way to make the project shareable; PDF’s vector and raster graphics capabilities offer an ideal solution, maintaining a perfect representation of the original document while achieving much better compression for sharing. Formats like Microsoft Word DOCX simply can’t be opened as intended on many operating systems; the PDF version easily retains the same fonts and formatting edits included in the original, allowing the end viewer to see an exact visual representation of the document as it was intended. The list of *insert document* to PDF conveniences goes on and on.
If there is one major drawback to PDF documents, it is that they are notoriously difficult to edit. In fact, almost everything that makes PDFs such an ideal solution for reformatting externally/manually generated material conversely makes them one of the more challenging formats to manipulate. Because PDFs handle so many different content types in one file, they go through extensive compression to achieve an easily portable size, which means opening a PDF document and changing its contents is never a straightforward task. It doesn’t help that they are designed and programmed to be difficult to edit in the first place; it’s part of what makes PDFs a secure and reliable format in the first place.
So, what if you just want to extract plain, unformatted text from a PDF — and nothing more special than that? There are many reasons why getting pure text is useful, but extracting it in a convenient, scalable way isn’t as simple as it may seem. If you’ve ever attempted to extract text by — for example — hastily converting a PDF to an office document format (perhaps using one of the hundreds of free PDF conversion tools available online), especially without knowing what the original document format was, you’ve likely experienced a huge amount of formatting inconsistencies, strange spacing issues, missing links or media files, and random lines or tables floating around where they shouldn’t be. When you just wanted the plain text portion, that clutter is a big distraction, and you’re still left with the task of separating text from the new document and manually normalizing that anyway. If you’ve tried to extract text from a scanned or rasterized PDF (one that is entirely made up of two-dimensional images with pixels) using those same tools, you’ve probably noticed that it isn’t possible at all — at least, not without a specialized Optical Character Recognition (OCR) service; a very separate, albeit equally important solution to the PDF-to-text problem. When you attempt to get plain text from a regular PDF document, what you’re really trying to do is isolate one specific piece of a PDF’s many possible content types and only retain the text content from it. Further, you’re asking for that text — which can contain a lot of complex formatting encoded from a proprietary application like Microsoft Word — to be normalized in a way that anyone on any platform can read it.
Because of the relative difficulty associated with performing simple editing tasks on a PDF, it’s common practice to use third-party PDF editors (or premium Adobe tools) to achieve the desired results. These solutions, while effective on a file-by-file basis, aren’t great for achieving results at scale, however — they still require manual navigation through an interface, which takes up time most people don’t have to waste on high-volume conversion tasks.
To edit and process PDFs at scale, third-party API services represent the most efficient solution. That’s because PDF editing APIs can communicate with the compressed PDF file without ever having to open it; they can make meaningful edits (such as rotating pages, removing comments, etc.) and, on the opposite end of the spectrum, they can extract targeted content without having any impact on the original document at all.
Demonstration
In the demonstration portion of this article, I’ll walk you through two simple and easy-to-use API solutions that are designed to extract plain text from regular PDF documents without having to open or make any changes to the original file. These API solutions include the following:
- Convert PDF to Text (TXT)
- Convert PDF to Text (TXT) by page
The first solution listed above will simply remove plain text from a PDF document without performing any additional operations (by default); the API response will contain a ‘TextResult’ string with the body of extracted text. Below, I’ve included a response model for reference:
{
"Successful": true,
"TextResult": "string"
}
The second solution will, as the name suggests, remove text while including the page number that each portion of text came from in the results. This solution adds a greater level of control over the conversion, ensuring the resulting information can be interacted with in roughly the same order that the original PDF document intended, and making it much easier to catalogue the converted information when we store it in TXT form. The below model shows how this response is formatted:
{
"Successful": true,
"Pages": [
{
"PageNumber": 0,
"PageText": "string"
}
]
}
Both solutions also provide an optional ‘textFormattingMode’ parameter that can be configured to specify how whitespace should be handled when making the conversion. When using this feature, possible values are ‘preserveWhitespace,’ which will retain whitespace in the document and preserve its relative positioning to the text, and ‘minimizeWhitespace,’ which will not insert any more spaces into the document in most cases. The default setting is ‘preserveWhitespace.’
Below, I’ll walk you through how you can take advantage of either API using ready-to-run, complementary code examples in Java. Please note that to use either API for free, you’ll just need to register a free account on www.cloudmersive.com to get a secure API key (this account will yield a limit of 800 API calls per month).
Before calling either API, we'll begin with SDK installation as our first step. We can do this with Maven by first adding a reference to the repository in pom.xml:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
And then adding a reference to the pom.xml dependency:
<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v4.25</version>
</dependency>
</dependencies>
We can also do this with Gradle by adding it in our root build.gradle at the end of repositories:
allprojects {
repositories {
...
maven { url 'https://jitpack.io' }
}
}
And then adding the dependency in build.gradle:
dependencies {
implementation 'com.github.Cloudmersive:Cloudmersive.APIClient.Java:v4.25'
}
Now we can structure our API calls, beginning with the generic PDF to TXT conversion API. Within the code snippet below, include your API key where the documentation indicates (just below the imports), and then include your file path in the inputFile parameter below that:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ConvertDocumentApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
ConvertDocumentApi apiInstance = new ConvertDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
String textFormattingMode = "textFormattingMode_example"; // String | Optional; specify how whitespace should be handled when converting PDF to text. Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases. Default is 'preserveWhitespace'.
try {
TextConversionResult result = apiInstance.convertDocumentPdfToTxt(inputFile, textFormattingMode);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ConvertDocumentApi#convertDocumentPdfToTxt");
e.printStackTrace();
}
For the solution which converts PDF to Text by page, use the following code snippet instead. This will capture your API key and file input parameters in the same way as the first:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.EditPdfApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
EditPdfApi apiInstance = new EditPdfApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
String textFormattingMode = "textFormattingMode_example"; // String | Optional; specify how whitespace should be handled when converting the document to text. Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases. Default is 'preserveWhitespace'.
try {
PdfTextByPageResult result = apiInstance.editPdfGetPdfTextByPages(inputFile, textFormattingMode);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling EditPdfApi#editPdfGetPdfTextByPages");
e.printStackTrace();
}
Remember, the default textFormattingMode setting will preserve whitespace from the input PDF in the output. If you’d prefer to avoid that, make sure to change the example code to ‘minimizeWhitespace’ instead. With this solution in hand, you’ll be able to easily redirect text from PDF documents to a variety of different destinations without ever having to open the document in question.
Opinions expressed by DZone contributors are their own.
Comments