How To Compare DOCX Documents in Java
In this article, learn how to carry out DOCX comparisons programmatically by calling a specialized web API with Java code examples.
Join the DZone community and get the full member experience.
Join For FreeIf you’ve spent a lot of time creating and editing documents in the MS Word application, there’s a good chance you’ve heard of (and maybe even used) the DOCX comparison feature. This simple, manual comparison tool produces a three-pane view displaying the differences between two versions of a file. It’s a useful tool for summarizing the journey legal contracts (or other, similar documents that tend to start as templates) take when they undergo multiple rounds of collaborative edits.
As useful as manual DOCX document comparisons are, they’re still manual, which immediately makes them inefficient at scale. Thankfully, though, the open-source file structure DOCX is based on - OpenXML - is designed to facilitate the automation of manual processes like this by making Office document file structure easily accessible to programmers. With the right developer tools, you can make programmatic DOCX comparisons at scale in your own applications.
In this article, you’ll learn how to carry out DOCX comparisons programmatically by calling a specialized web API with Java code examples. This will help you automate DOCX comparisons without the need to understand OpenXML formatting or write a ton of new code. Before we get to our demonstration, however, we'll first briefly review OpenXML formatting, and we'll also learn about an open-source library that can be used to read and write Office files in Java.
Understanding OpenXML
OpenXML formatting has been around for a long time now (since 2007), and it’s the standard all major Office documents are currently based on.
Thanks to OpenXML formatting, all Office files – including Word (DOCX), Excel (XLSX), PowerPoint (PPTX), and others – are structured as open-source zip archives containing compressed metadata, file specifications, etc. in XML format.
We can easily review this file structure for ourselves by renaming Office files as .zip files. To do that, we can CD into one of our DOCX file's directories (Windows) and rename our file using the below command (replacing the example file name below with our own file name):
ren "hello world".docx "hello world".zip
We can then open the .zip version of our DOCX file and poke around in our file archive.
When we open DOCX files in our MS Word application, our files are unzipped, and we can then use various built-in application tools to manipulate our files’ contents.
This open-source file structure makes it relatively straightforward to build applications that read and write DOCX files. It is, to use a well-known example, the reason why programs like Google Drive can upload and manipulate DOCX files in their own text editor applications. With a good understanding of OpenXML structure, we could build our own text editor applications to manipulate DOCX files if we wanted – it would just be a LOT of work. It wouldn’t be especially worth our time, either, given the number of applications and programming libraries that already exist for exactly that purpose.
Writing DOCX Comparisons in Java
While the OpenXML SDK is open source (hosted on GitHub for anyone to use), it’s written to be used with .NET languages like C#. If we were looking to automate DOCX comparisons with an open-source library in Java, we would need to use something like the Apache POI library to build our application instead.
Our process would roughly entail:
- Adding Apache POI dependencies to our pom.xml
- Importing the XWPF library (designed for OpenXML files)
- Writing some code to load and extract relevant content from our documents
Part 3 is where things would start to get complicated - we would need to write a bunch of code to retrieve and compare paragraph elements from each document, and if we wanted to ensure consistent formatting across both of our documents (important for our resulting comparison document), we would need to break down our paragraphs into runs. We would then, of course, need to implement our own robust error handling before writing our DOCX comparison result to a new file.
Advantages of a Web API for DOCX Comparison
Writing our DOCX comparison from scratch would take time, and it would also put the burden of our file-processing operation squarely on our own server. That might not be a big deal for comparisons involving smaller-sized DOCX documents, but it would start to take a toll with larger-sized documents and larger-scale (higher volume) operations.
By calling a web API to handle our DOCX comparison instead, we’ll limit the amount of code we need to write, and we’ll offload the heavy lifting in our comparison workflow to an external server. That way, we can focus more of our hands-on coding efforts on building robust features in our application that handle the results of our DOCX comparisons in various ways.
Demonstration
Using the code examples below, we can call an API that simplifies the process of automating DOCX comparisons. Rather than writing a bunch of new code, we’ll just need to copy relevant examples, load our input files, and write our resulting comparison strings to new DOCX files of their own.
To help demonstrate what the output of our programmatic comparison looks like, I’ve included a screenshot from a simple DOCX comparison result below. This document shows the comparison between two versions of a classic Lorem Ipsum passage – one containing all of the original Latin text, and the other containing a few lines of English text:
To structure our API call, we can begin by installing the client SDK. Let’s add a reference to our pom.xml repository:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
And let’s add a reference to the dependency in our pom.xml:
<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v4.25</version>
</dependency>
</dependencies>
After that, we can add the following Import
s to our controller:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.CompareDocumentApi;
Now we can turn our attention to configuration. We’ll need to supply a free Cloudmersive API key (this allows 800 API calls/month with no commitments) in the following configuration snippet:
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
Next, we can use our final code examples below to create an instance of the API and call the DOCX comparison function:
CompareDocumentApi apiInstance = new CompareDocumentApi();
File inputFile1 = new File("/path/to/inputfile"); // File | First input file to perform the operation on.
File inputFile2 = new File("/path/to/inputfile"); // File | Second input file to perform the operation on (more than 2 can be supplied).
try {
byte[] result = apiInstance.compareDocumentDocx(inputFile1, inputFile2);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling CompareDocumentApi#compareDocumentDocx");
e.printStackTrace();
}
Now we can easily automate DOCX comparisons with a few lines of code. If our input DOCX files contain any errors, the endpoint will try to auto-repair the files before making the comparison.
Conclusion
In this article, we learned about the MS Word DOCX Comparison tool and discussed how DOCX comparisons can be automated (thanks to OpenXML formatting). We then learned how to call a low-code DOCX comparison API with Java code examples.
Opinions expressed by DZone contributors are their own.
Comments