How to Get Plain Text From Common Documents in Java
Image-based documents require OCR for text extraction, but text-based documents don't. Here, learn ways to extract plain text from text-based documents.
Join the DZone community and get the full member experience.
Join For FreeIn this article, we’ll learn how to extract plain text strings from a few of the most common file types (PDF, DOCX, XSLX, PPTX) we can expect to deal with on a day-to-day basis as programmers in an enterprise environment.
We’ll briefly review when to use plain text extraction methods over Optical Character Recognition (OCR) text extraction methods, and we’ll discuss some use cases for retrieving plain text in a real-world scenario. Ultimately, we’ll cover a few open-source APIs that are perfect for handling plain text extraction on a one-off basis, at the end we’ll demonstrate a proprietary API that saves time by automatically detecting each different file type before extracting plain text content.
OCR vs Text Object Extraction
When text content in any given file is rasterized (i.e., displayed graphically as a pixel matrix in some image format), we need to use Optical Character Recognition (OCR) to extract that content from the file body. In this context, the text doesn’t physically exist in the document, and feature recognition is required to discern and copy text characters.
When text content is stored as an object within the file structure, however, we can extract that content in a different way: by following the correct internal file path to locate where text objects are stored. This is a much more efficient and less complicated process than OCR.
Thankfully, most of the file types enterprises use to organize and format text content carry plain text directly within the file structure. This includes file types like Word (DOCX), Excel (XLSX), PowerPoint (PPTX), and text-based (e.g., vector) PDFs. In all of these examples, the file structures modify their plain text objects for folks when they open the file in an appropriate file reader (e.g., a proprietary application, web browser, or open-source reader), using their own reference systems to apply various text style values which convert the plain text to rich text.
For a quick example of this concept, we might imagine a scenario where an official contract document of some sort is being written manually within the MS Word application. The MS Word application is a powerful text editor with complex rich text formatting features, which means the person writing the contract can apply relevant styles and features to their text — and view those in real-time — as they edit the document. While the person writing the document sees rich text in real-time, the DOCX document itself “sees” plain text with active references to specific style and theme objects. When the writer saves their work, the DOCX file stores all this reference information in separate XML files (e.g., document.xml
, styles.xml
, etc.), which are compressed and zipped with a .docx extension.
Why Extract Plain Text From Rich Formatted Documents?
Retrieving plain text from a rich text document makes that content immediately usable anywhere for any purpose.
When mining text from documents for analysis and keyword generation, for example, starting with plain text extraction makes the process simpler and faster. For centralized archiving aimed at regulatory compliance, stripping out plain text from relevant documents streamlines operations, especially when dealing with large volumes of files. If we’re tackling a more complex project, such as automating tasks based on specific content triggers within a document (e.g., a legal contract with the phrase “termination clause”), extracting plain text will be much quicker and more straightforward than accessing the file structure for each individual operation.
We might imagine, for example, a programmatic workflow that retrieves plain text content directly from email message attachments. After integrating the email application API in our application to retrieve the attachment file bytes from message files, we could implement a method from the following to extract the plain text from each file’s structure before proceeding with any downstream text processing steps.
Extracting Text With Popular Open-Source Libraries
Different file types store text content in slightly different ways, which means we’ll need a catch-all solution if we’re interested in extracting text from multiple different file types in a single programmatic workflow. Without that, we’ll need to implement unique text-extraction methods for each file type we expect to work with. Thankfully, we have a few good options at our disposal.
If we’re working exclusively with text-based (i.e., not rasterized) PDF documents, we could use the PDFTextStripper
class from the Apache PDFbox library, which is an open-source library with dozens of useful content extraction tools. If we’re focused only on getting text from Open Office XML files (i.e., DOCX, XLSX, PPTX, etc.), we could leverage the reliable Apache POI library, which offers several classes (e.g., XWPFDocument
for DOCX, XSSFWorkbook
for XLSX, etc.) for extracting text from Open Office documents. In either case, we’ll be able to print plain text to our console before using that text in any subsequent operations.
If we’re dealing with a mix of PDF and Open Office file types, we’ll need to validate and sort our attachment content with our own logic before using the above libraries to process our documents. Otherwise, we can alternatively leverage a proprietary solution that automatically detects the file type and performs the correct text extraction method.
Demonstration
The API demonstrated below automatically distinguishes between Open-Office and PDF document file types using content validation logic before extracting plain text from the file structure. This API is proprietary, but it’s free to use with a free API key, and it offers an optional developer-friendly parameter for preserving or minimizing whitespace in the output text string. We’ll walk through how to structure our API call below.
First, add the below snippet to your pom.xml
file to pull dependencies from Jitpack
:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
And then add the below dependency block to your pom.xml
file to include the API client:
<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v4.25</version>
</dependency>
</dependencies>
Now add the below import classes to the top of our file:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ConvertDocumentApi;
Next, initialize the API client and set your API key authorization:
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
Finally, use the below examples to convert a document to text. You can set the textFormattingMode
parameter to 'minimizeWhitespace'
to stop additional space from being added to the document post-conversion (in most cases). The default value for this parameter is 'preserveWhitespace'
, which attempts to keep the relative positioning of text from the original document structure.
ConvertDocumentApi apiInstance = new ConvertDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
String textFormattingMode = "textFormattingMode_example"; // String | Optional; specify how whitespace should be handled when converting the document to text. Possible values are 'preserveWhitespace' which will attempt to preserve whitespace in the document and relative positioning of text within the document, and 'minimizeWhitespace' which will not insert additional spaces into the document in most cases. Default is 'preserveWhitespace'.
try {
TextConversionResult result = apiInstance.convertDocumentAutodetectToTxt(inputFile, textFormattingMode);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ConvertDocumentApi#convertDocumentAutodetectToTxt");
e.printStackTrace();
}
If any exceptions occur during your API call, the try-catch
block will handle them and print error messages to the console. Otherwise, you can begin working with the text conversion result string from your original documents.
Conclusion
In this article, we learned how text-based files store and reference plain text from within the file structure, and we reviewed some example scenarios where extracting plain text from a document could be useful. We then learned about a few open-source libraries optimized to retrieve text from Open Office and PDF files, and finally, we walked through a proprietary solution that distinguishes between files and extracts text on our behalf.
Opinions expressed by DZone contributors are their own.
Comments