Three Ways To Separate Plain Text From HTML Using Java
Learn about three API solutions that can be employed to convert an HTML document to text, convert an HTML string to text, and remove HTML from a text string.
Join the DZone community and get the full member experience.
Join For FreeOn each webpage we visit, we are confronted with a huge variety of multimedia content, all of which is put together and presented using Hyper Text Markup Language (HTML). HTML is a basic programming language that many developers are familiar with, composed of elements that – when interpreted by a browser – typically form a coherent, organized, and intentional display with various customized elements. This code provides the framework for how we view images, videos, bodies of writing, hyperlinks, data entry fields, and anything else you can think of on a web page – and all that code is available for anyone to view with a simple right-click on any browser.
Given the immense volume of formatting elements present in any complex HTML string, the actual subject of the code - the text contents and file paths buried within those strings - can be a bit difficult to access independently of those formatting specifications. If, for example, we want to review web copy and subsequently edit or manipulate that text in a meaningful way, we’re going to have a difficult time copying and pasting that information from the displayed web page directly. We’ll just end up with a mess of inconsistently formatted text riddled with hyperlinks, logos, disjointed tabs and spaces, and more. This isn’t to say that it can’t be done. We can, of course, copy small snippets of text from any web page and reformat those snippets to resemble their original form in rich text editors like Microsoft Word. The issue is that this “point and click” approach chews up valuable time in our workday, and if we need to scale up our operation to include multiple websites and thousands of characters worth of text, we’ll be doing ourselves a big disservice in the long run by attempting to do so manually.
Rather than waste valuable time and energy attempting to snag the text we want with deft clicks and drags, we’re much better off removing it from the HTML code entirely using an API service that is specifically equipped to do so. We can accomplish this separation through a few methods which – while appearing identical on the surface – accommodate slightly different use cases. These methods include the following:
- Converting an HTML file to plain text
- Converting an HTML string to plain text
- Removing HTML from a text string
The first and second methods are essentially the same operation with two different scenarios in mind: in the former, our HTML code is readily available to us in file form (one which will open directly as a browser page when we click on it), and in the latter, our HTML is available to us as a text string (for example, HTML we copied via right-clicking on our web browser). The third method, while technically accomplishing the same goal as the first two methods, envisions a more security-focused use case, helping us to identify HTML and Cross-Site Scripting attacks (a form of cyber threat in which a malicious actor places executable scripts into trusted app/website code) in a given text string, without assuming we have a fully formed HTML string to work with.
In the remainder of this article, I will demonstrate three simple API solutions which can be used to separate HTML code from plain text contents for any of the three slightly different scenarios listed above. These APIs are all free to use and are available via the Cloudmersive Document Conversion API endpoint with a single free-tier Cloudmersive API key to authenticate each service (provides a limit of 800 API calls per month and zero additional commitments). Below, I’ve provided ready-to-run Java code examples to help you structure your API call to each of these three APIs, with additional notes included regarding input requests and output responses where they are needed.
Before we begin calling each individual API iteration, I will first provide instructions to help you install the API client with Maven or Gradle (these are the same for each API iteration). To install with Maven, our first task is to add a reference to our repository in pom.xml:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
After that, we just need to add an additional reference to the pom.xml dependency:
<dependencies>
<dependency>
<groupId>com.github.Cloudmersive</groupId>
<artifactId>Cloudmersive.APIClient.Java</artifactId>
<version>v4.25</version>
</dependency>
</dependencies>
To install with Gradle instead, we’ll need to add the below snippet to our root build.gradle (at the end of repositories):
allprojects {
repositories {
...
maven { url 'https://jitpack.io' }
}
}
After that, we’ll need to add the below dependency in build.gradle:
dependencies {
implementation 'com.github.Cloudmersive:Cloudmersive.APIClient.Java:v4.25'
}
With installation complete, we can move on and begin structuring our API calls. The first API iteration I will demonstrate can be used to convert an HTML document (file) to plain text. This API requires a file path input included within the inputFile
field (indicated by the code comments in the examples provided below). We just need to copy the code below into our file and configure our inputs accordingly:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ConvertDocumentApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
ConvertDocumentApi apiInstance = new ConvertDocumentApi();
File inputFile = new File("/path/to/inputfile"); // File | Input file to perform the operation on.
try {
TextConversionResult result = apiInstance.convertDocumentHtmlToTxt(inputFile);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ConvertDocumentApi#convertDocumentHtmlToTxt");
e.printStackTrace();
}
The second API iteration I will demonstrate can be used to convert an HTML string to text. This API’s request parameters require a simple HTML string input, which can be included in the following format:
<?xml version="1.0" encoding="UTF-8"?>
<HtmlToTextRequest>
<Html>string</Html>
</HtmlToTextRequest>
With your HTML string properly formatted, you can pass your string through the below code examples, and you’re all done with this method:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.ConvertWebApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
ConvertWebApi apiInstance = new ConvertWebApi();
HtmlToTextRequest input = new HtmlToTextRequest(); // HtmlToTextRequest | HTML to Text request parameters
try {
HtmlToTextResponse result = apiInstance.convertWebHtmlToTxt_0(input);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling ConvertWebApi#convertWebHtmlToTxt_0");
e.printStackTrace();
}
The final API iteration I will demonstrate can be used to perform the inverse of the previous two operations: removing HTML from a string of text. To call this API, we will need to format our input request like the example below:
<?xml version="1.0" encoding="UTF-8"?>
<RemoveHtmlFromTextRequest>
<TextContainingHtml>string</TextContainingHtml>
</RemoveHtmlFromTextRequest>
Once our request is formatted, we can pass that request through the final code examples below, which will return a plain TextContentResult
string:
// Import classes:
//import com.cloudmersive.client.invoker.ApiClient;
//import com.cloudmersive.client.invoker.ApiException;
//import com.cloudmersive.client.invoker.Configuration;
//import com.cloudmersive.client.invoker.auth.*;
//import com.cloudmersive.client.EditTextApi;
ApiClient defaultClient = Configuration.getDefaultApiClient();
// Configure API key authorization: Apikey
ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey");
Apikey.setApiKey("YOUR API KEY");
// Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null)
//Apikey.setApiKeyPrefix("Token");
EditTextApi apiInstance = new EditTextApi();
RemoveHtmlFromTextRequest request = new RemoveHtmlFromTextRequest(); // RemoveHtmlFromTextRequest | Input request
try {
RemoveHtmlFromTextResponse result = apiInstance.editTextRemoveHtml(request);
System.out.println(result);
} catch (ApiException e) {
System.err.println("Exception when calling EditTextApi#editTextRemoveHtml");
e.printStackTrace();
}
With these three solutions in your back pocket, you’ll be able to easily separate HTML from plain text at scale with three distinct use cases in mind. Each API solution returns a simple plain text string, which can be easily reviewed and edited within any rich or plain text editor.
Opinions expressed by DZone contributors are their own.
Comments