How To Validate Archives and Identify Invalid Documents in Java

Explore the prevalence of cyberattacks carried out using common archive formats and invalid document types and data validation API solutions.

Brian O'Neill

CORE ·

Sep. 16, 23 · Tutorial

Likes (2)

Comment

Save

5.7K Views

In our contemporary cybersecurity landscape, sneaky custom content threats are beginning to penetrate our email security policies and firewalls/virus-scanning network proxies with greater consistency. Aptly disguised files can easily wind their way into our inboxes and our most sensitive file storage locations, and they can lurk there for extended periods, waiting patiently for unsuspecting victims to download and execute their malicious payloads.

Seemingly, the faster we rush to understand and mitigate one iteration of a hidden content threat, the quicker that threat evolves into something entirely new, catching us by surprise again and again.

In recent years, Office file formats, URLs, and executables have stolen the spotlight as the most commonly pursued hosts for latent email and storage-based attack vectors alike. Links to compromised websites are frequently encountered in our email inboxes, as are malicious macros and various executables. Invalid files, password-protected files, or even OLE-enabled (object linking and embedding) files with malicious content can often be found scattered throughout our cloud storage instances.

Amid all of this, an even stealthier form of malware host has begun to gain ground over its contemporaries, namely, archive file formats like ZIP and RAR. According to research conducted over a three-month period in 2022, more than 40% of malware attacks used ZIP and RAR formats to deliver malicious content to a client device. That exceeds the usage of many long-established Office formats over the same period, and while that might first seem surprising, at a closer look, it’s not hard to see why. File compression formats can harness powerful encryption algorithms to safeguard their contents, and there’s not much a regular virus and malware scanning service can do when it can’t decrypt the files it needs to scan.

As if an archive’s encryption algorithms weren't already posing a difficult enough obstacle for virus and malware scanning solutions to detect, making matters even more difficult is the ease with which these archive formats can be smuggled past security policies within the body of disguised invalid file types. For example, some recent attacks have buried archives within HTML documents, and these HTML documents have been designed to convincingly mimic the online PDF viewers (complete with an apparent PDF file extension and seemingly normal document thumbnail) we're regularly accustomed to opening on our browsers. If we let our eyes deceive us and download an HTML mimic file, we might unknowingly decrypt and subsequently inject the contents of an externally stored malicious ZIP or RAR archive directly onto our device, allowing an attacker to establish a direct link with our computer and initiate a fully-fledged cyberattack.

As pure virus and malware detection, policies become increasingly inadequate sentinels on their own, it’s more important than ever that we simultaneously deploy content-validation-centric policies against inbound files. Detecting a stray ZIP, RAR, or invalid file type in a sensitive location can be the difference between the success and failure of a latent cyberattack. One way we can accomplish this is with the help of simple document validation APIs, and I’ve provided a few free-to-use options in the demonstration portion of this article.

Demonstration

The API solutions provided below are free to use (with a free-tier API key), and they’re easy to call via ready-to-run Java code examples supplied further down the page, beginning with Java SDK installation instructions. They’re designed to perform the following actions, respectively:

Validate if a file is a ZIP archive.
Validate if a file is a RAR archive.
Automatically detect the contents of a common file type (i.e., PDF, HTML, XLSX, etc.) and perform in-depth content verification against the file’s extension.

After processing each file, these solutions will return a “DocumentIsValid” Boolean response, making it straightforward to flag or divert common content threat types away from sensitive locations within our system. Additionally, all these solutions will identify whether a file has password-protection measures in place (this is often a further indication of malicious content — especially when a file in question originates from an untrustworthy source), and they'll identify any overt errors or warnings associated with the document in question.

As a reminder, these APIs are NOT designed to detect or flag virus or malware signatures; their utility will depend on where you elect to deploy them. They can just as easily be deployed as simple data validation steps in the workflow of any regular file-processing application. Further down the page, I've linked a previous article that highlights an API solution that scans, validates, and verifies content all in one step.

To begin structuring our API calls, let’s install the SDK with Maven by first adding a reference to the repository in pom.xml:

     XML 
   
 
 
   <repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories> 
  

After that, let’s add a reference to the dependency in pom.xml:

     XML 
   
 
 
   <dependencies>
<dependency>
    <groupId>com.github.Cloudmersive</groupId>
    <artifactId>Cloudmersive.APIClient.Java</artifactId>
    <version>v4.25</version>
</dependency>
</dependencies> 
  

We can then call the ZIP File Validation API using the below code:

     Java 
   
 
 
   // Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ValidateDocumentApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ValidateDocumentApi apiInstance = new ValidateDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
try {
    DocumentValidationResult result = apiInstance.validateDocumentZipValidation(inputFile);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ValidateDocumentApi#validateDocumentZipValidation");
    e.printStackTrace();
} 
  

We can call the RAR File Validation API using the code below:

     Java 
   
 
 
   // Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ValidateDocumentApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ValidateDocumentApi apiInstance = new ValidateDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
try {
    DocumentValidationResult result = apiInstance.validateDocumentRarValidation(inputFile);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ValidateDocumentApi#validateDocumentRarValidation");
    e.printStackTrace();
} 
  

Lastly, we can call the Automatic Content Validation API using the final code examples below:

     Java 
   
 
 
   // Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ValidateDocumentApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ValidateDocumentApi apiInstance = new ValidateDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
try {
    AutodetectDocumentValidationResult result = apiInstance.validateDocumentAutodetectValidation(inputFile);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ValidateDocumentApi#validateDocumentAutodetectValidation");
    e.printStackTrace();
} 
  

Hopefully, with a few additional content validation policies in place, we can rest assured that we’ll be aware when common threat vectors enter our system.

Scan, Verify, and Validate Content All at Once

To take advantage of an API solution designed to simultaneously identify viruses, malware, and custom content threats (with full content verification and custom content restriction policies), feel free to check out my previous article, "How to Protect .NET Web Applications from Viruses and Zero Day Threats."

Since that article applies to .NET application development, I've provided comparable Java code examples below for Java application development.

First, add the following reference to the repository in pom.xml:

     XML 
   
 
 
   <repositories>
    <repository>
        <id>jitpack.io</id>
        <url>https://jitpack.io</url>
    </repository>
</repositories> 
  

Then add the following reference to the dependency in pom.xml:

     XML 
   
 
 
   <dependencies>
<dependency>
    <groupId>com.github.Cloudmersive</groupId>
    <artifactId>Cloudmersive.APIClient.Java</artifactId>
    <version>v4.25</version>
</dependency>
</dependencies> 
  

Finally, use the below Java code examples to structure your API call, and once again, utilize a free-tier API key to authorize your requests. As outlined in the linked article, you can use Booleans to set custom restrictions against a variety of custom content threat types (macros, password-protected files, malicious archives, HTML, scripts, etc.), and you can custom-restrict unwanted file types by supplying a comma-separated list of accepted file extensions (e.g., .docx,.pdf,.xlsx) in the string restrictFileTypes parameter.

     Java 
   
 
 
   // Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ScanApi;

ApiClient defaultClient = Configuration.getDefaultApiClient();

// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");

ScanApi apiInstance = new ScanApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
Boolean allowExecutables = true; // Boolean | Set to false to block executable files (program code) from being allowed in the input file.  Default is false (recommended).
Boolean allowInvalidFiles = true; // Boolean | Set to false to block invalid files, such as a PDF file that is not really a valid PDF file, or a Word Document that is not a valid Word Document.  Default is false (recommended).
Boolean allowScripts = true; // Boolean | Set to false to block script files, such as a PHP files, Python scripts, and other malicious content or security threats that can be embedded in the file.  Set to true to allow these file types.  Default is false (recommended).
Boolean allowPasswordProtectedFiles = true; // Boolean | Set to false to block password protected and encrypted files, such as encrypted zip and rar files, and other files that seek to circumvent scanning through passwords.  Set to true to allow these file types.  Default is false (recommended).
Boolean allowMacros = true; // Boolean | Set to false to block macros and other threats embedded in document files, such as Word, Excel and PowerPoint embedded Macros, and other files that contain embedded content threats.  Set to true to allow these file types.  Default is false (recommended).
Boolean allowXmlExternalEntities = true; // Boolean | Set to false to block XML External Entities and other threats embedded in XML files, and other files that contain embedded content threats.  Set to true to allow these file types.  Default is false (recommended).
Boolean allowInsecureDeserialization = true; // Boolean | Set to false to block Insecure Deserialization and other threats embedded in JSON and other object serialization files, and other files that contain embedded content threats.  Set to true to allow these file types.  Default is false (recommended).
Boolean allowHtml = true; // Boolean | Set to false to block HTML input in the top level file; HTML can contain XSS, scripts, local file accesses and other threats.  Set to true to allow these file types.  Default is false (recommended) [for API keys created prior to the release of this feature default is true for backward compatability].
String restrictFileTypes = "restrictFileTypes_example"; // String | Specify a restricted set of file formats to allow as clean as a comma-separated list of file formats, such as .pdf,.docx,.png would allow only PDF, PNG and Word document files.  All files must pass content verification against this list of file formats, if they do not, then the result will be returned as CleanResult=false.  Set restrictFileTypes parameter to null or empty string to disable; default is disabled.
try {
    VirusScanAdvancedResult result = apiInstance.scanFileAdvanced(inputFile, allowExecutables, allowInvalidFiles, allowScripts, allowPasswordProtectedFiles, allowMacros, allowXmlExternalEntities, allowInsecureDeserialization, allowHtml, restrictFileTypes);
    System.out.println(result);
} catch (ApiException e) {
    System.err.println("Exception when calling ScanApi#scanFileAdvanced");
    e.printStackTrace();
} 
  

API Document Java (programming language) security Malware

Opinions expressed by DZone contributors are their own.

Related

Trending

How To Validate Archives and Identify Invalid Documents in Java

Explore the prevalence of cyberattacks carried out using common archive formats and invalid document types and data validation API solutions.

Demonstration

Scan, Verify, and Validate Content All at Once

Related

Partner Resources