How To Get the Comments From a DOCX Document in Java
In this article, learn how to extract comments from DOCX documents at scale and pick up key insights which improve team collaboration.
Join the DZone community and get the full member experience.
Join For FreeContemporary document collaboration tools help make it possible to push projects from start to finish on tighter deadlines than ever before. Where pre-digital project collaboration relied on manual markups and annotations to modify/improve critical reports and memos before their distribution, contemporary teams across a variety of industries can accomplish the same essential goals – and much, much more – using the simple revision tools accessible to all users in DOCX files. Suggestions, changes, and callouts can be added by any team member to a DOCX document in a SharePoint site drive, drastically minimizing the time it takes to publish and share final products with stakeholders.
Underneath the hood, it’s Microsoft’s OpenXML document format that makes this team-oriented file manipulation possible. Because DOCX format is structured as a zip file composed of multiple XML-based files, comments and other revisions are physically separated from the document’s core content, and data that define the relationships between these separate files are stored in a folder of its own.
In other words, the comments and revisions we see when we open a collaborative DOCX document are part of an independent file communicating with the document’s body of text, which is stored in a file of its own. This file-structure compartmentalization ultimately creates the fluid, dynamic experience we’re accustomed to when we add, remove or resolve revisions, or even when we elect to turn collaboration features on and off entirely.
Since comments in a DOCX document are stored in an XML-based file, they can be manually or programmatically accessed independently of the document’s other components. Once extracted, useful comment metadata – including the comment text along with the author names, dates, and much more - can be analyzed independently of the original content it’s associated with. While this data excavation isn’t necessarily useful on a one-off basis, there’s a notable benefit to accumulating comments from multiple documents of the same type (e.g., cyclical reports and memos) over time and using that information to better understand the overall content collaboration process. With volumes of comment metadata readily available, it’s possible, for example, to apply NLP analysis and better understand how a team tends to feel about specific sections of a biweekly memo. It’s also possible to get a sense of how often collaboration occurs on a particular topic, learn more about who the most frequent contributors are, and much more.
If insights like these are intriguing enough to pursue, the challenge becomes one of extracting that information in an organized and efficient manner across multiple documents in a reasonably short period of time. While Open XML files can be converted to .zip files and extracted independently (or accessed individually using documented code examples in C# or Visual Basic), these methods are largely impractical or too limited in scope to be practical across a larger array of files. Rather, it’s far more practical to rely on fully realized programmatic solutions which extract and return the data we need in a simple, organized, and human-readable format. This is a perfect role for a specialized document conversion API.
Demonstration
In the remainder of this article, I’ll demonstrate two APIs that are designed to retrieve comment text and comment metadata from a DOCX file. These two solutions can be utilized easily (and freely, using a free API key) by copying from the ready-to-run Java code examples provided further down the page, and they both perform slightly different variations of the same basic function. I’ll briefly outline both solutions below.
1. Get Comments From a DOCX Document as a Flat List
This API returns comments and review annotations without any hierarchy showing the reply-child comments attached to the original comments. In the response object, replies to original comments are distinguished by an IsReply
Boolean. Refer to the below example JSON response body:
{
"Successful": true,
"Comments": [
{
"Path": "string",
"Author": "string",
"AuthorInitials": "string",
"CommentText": "string",
"CommentDate": "2023-07-27T15:15:44.278Z",
"IsTopLevel": true,
"IsReply": true,
"ParentCommentPath": "string",
"Done": true
}
],
"CommentCount": 0
}
2. Get Comments From a DOCX Document Hierarchically
This API returns comments and review annotations in an object with reply-child comments nested beneath their associated comment. This serves to make the relationship between reply comments and original comments distinct in the API response body. Refer to the below example JSON response body:
{
"Successful": true,
"Comments": [
{
"Path": "string",
"Author": "string",
"AuthorInitials": "string",
"CommentText": "string",
"CommentDate": "2023-07-27T15:16:28.931Z",
"ReplyChildComments": [
{
"Path": "string",
"Author": "string",
"AuthorInitials": "string",
"CommentText": "string",
"CommentDate": "2023-07-27T15:16:28.931Z",
"IsTopLevel": true,
"IsReply": true,
"ParentCommentPath": "string",
"Done": true
}
],
"Done": true
}
],
"TopLevelCommentCount": 0
}
You can begin structuring either API call in Java by first installing Maven. Add the following reference to the repository in pom.xml:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
Then, add a reference to the dependency in pom.xml:
<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v4.25</version>
</dependency>
</dependencies>
With installation complete, you can copy the below examples (including import classes) to retrieve DOCX comments as a flat list:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.EditDocumentApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
EditDocumentApi apiInstance = new EditDocumentApi();
GetDocxGetCommentsRequest reqConfig = new GetDocxGetCommentsRequest(); // GetDocxGetCommentsRequest | Document input request
try {
GetDocxCommentsResponse result = apiInstance.editDocumentDocxGetComments(reqConfig);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling EditDocumentApi#editDocumentDocxGetComments");
e.printStackTrace();
}
And you can copy the below examples (including import classes) to retrieve DOCX comments hierarchically:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.EditDocumentApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
EditDocumentApi apiInstance = new EditDocumentApi();
GetDocxGetCommentsHierarchicalRequest reqConfig = new GetDocxGetCommentsHierarchicalRequest(); // GetDocxGetCommentsHierarchicalRequest | Document input request
try {
GetDocxCommentsHierarchicalResponse result = apiInstance.editDocumentDocxGetCommentsHierarchical(reqConfig);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling EditDocumentApi#editDocumentDocxGetCommentsHierarchical");
e.printStackTrace();
}
Now you can easily automate the retrieval of DOCX comment/annotation metadata and parse that information seamlessly into other applications and workflows.
Opinions expressed by DZone contributors are their own.
Comments