Using PDFs With the Jamstack: Adding Search With Text Extraction
Adobe PDF Extraction can help enable search features for your PDF library.
Join the DZone community and get the full member experience.
Join For FreeA few weeks ago, I shared a couple of blog posts describing how to add PDFs to your Jamstack site. The first talked about using the Adobe PDF Embed API to give you more control over viewing PDFs on an Eleventy site. The second example took it a bit further by using Adobe's PDF Services API to generate thumbnails of the PDFs when the Eleventy site was developed. Since I wrote those posts, we released a new service, the PDF Extraction API.
This API extracts information from your PDF, including:
- Text, of course, but deep information about the text, such as its font, position, etc.
- Tables as Excel, CSV, or plain images
- Images
There's a tremendous amount of detail in the output, but the text is engaging. If our Jamstack site is making use of PDFs, it would be helpful to provide a way for users to search for text contained inside those PDFs. Let's look at an update to the last demo that adds this feature.
First, I worked on the data file I used to get PDF information for the site. This file (I'll link to the repo at the end) is found in _data\pdfs.js
. In the last post, the code did two things - see the PDFs and add them to an array and generate a thumbnail for each one (if it didn't exist). My first modification will be getting and saving the text content of the PDF. As with the thumbnails, we only do this one time.
let textFile = pdf.replace('.pdf', '.txt');
let text = '';
if(!fs.existsSync(textFile)) {
console.log('need to generate text '+textFile);
text = await getPDFText(pdf, creds);
console.log(`Get text back, length is ${text.length}`);
fs.writeFileSync(textFile, text);
} else text = fs.readFileSync(textFile,'utf-8');
result.push({
path:files[i].replace('./','/'),
name,
text,
thumb
});
In the code above, pdf
is one of the PDF file names and will be something.pdf
. We want to cache the output, so I use the same name with a different extension. The hard work is done in getPDFText
. Our docs go into detail about how this works, but the general flow is:
- Tell the API what you want (text, tables, images)
- Pass that and the PDF to the API
- Get a zip result back
- Extract the zip and do whatever
For the most part, it feels pretty straightforward, but the hard part will be dealing with the zip. Not hard per se, but you need to find a good Zip library for your platform that's easy to use. I used node-stream-zip, which seemed to work fine. Here's the entirety of the function:
async function getPDFText(path, creds) {
return new Promise((resolve, reject) => {
const OUTPUT_ZIP = `./output${nanoid()}.zip`;
const credentials = PDFServicesSdk.Credentials.serviceAccountCredentialsBuilder()
.withClientId(creds.clientId)
.withClientSecret(creds.clientSecret)
.withPrivateKey(creds.privateKey)
.withOrganizationId(creds.organizationId)
.withAccountId(creds.accountId)
.build();
// Create an ExecutionContext using credentials
const executionContext = PDFServicesSdk.ExecutionContext.create(credentials);
// Build extractPDF options
const options = new PDFServicesSdk.ExtractPDF.options.ExtractPdfOptions.Builder()
.addElementsToExtract(PDFServicesSdk.ExtractPDF.options.ExtractElementType.TEXT).build()
// Create a new operation instance.
const extractPDFOperation = PDFServicesSdk.ExtractPDF.Operation.createNew(),
input = PDFServicesSdk.FileRef.createFromLocalFile(
path,
PDFServicesSdk.ExtractPDF.SupportedSourceFormat.pdf
);
extractPDFOperation.setInput(input);
extractPDFOperation.setOptions(options);
// Execute the operation
extractPDFOperation.execute(executionContext)
.then(result => result.saveAsFile(OUTPUT_ZIP))
.then(async () => {
let zipFile = new StreamZip.async({file: OUTPUT_ZIP });
let entries = await zipFile.entries();
let first = Object.values(entries)[0];
let dest = outputPath + nanoid() + '.' + first.name.split('.').pop();
await zipFile.extract(first.name, dest);
await zipFile.close();
let data = JSON.parse(fs.readFileSync(dest, 'utf-8'));
let text = data.elements.filter(e => e.Text).reduce((result, e) => {
return result + e.Text + '\n';
},'');
fs.unlinkSync(dest);
fs.unlinkSync(OUTPUT_ZIP);
resolve(text);
})
.catch(err => reject(err));
});
}
It creates a unique filename for the resulting zip so that multiple operations don't try to overwrite the same file. The code cleans up after itself, but I figured better safe than sorry. Next is the setup code for PDF Services, basically just loading the credentials. We then build an operation object, set the parameters (addElementsToExtract
), and execute.
At this part, we'll have a zip file with our data. In that zip is a JSON file containing every bit of text data possible from the call. This is a huge JSON file, and you can grab a schema file to help make sense of it. Here's one snippet of some sample output:
{
"Bounds": [
72,
603.0959930419922,
523.8600311279297,
743.4600067138672
],
"Font": {
"alt_family_name": "Titlingmes New Roman PSMT",
"embedded": true,
"encoding": "WinAnsiEncoding",
"family_name": "Titlingmes New Roman PSMT",
"font_type": "TrueType",
"italic": false,
"monospaced": false,
"name": "RHNYAF+TimesNewRomanPSMT",
"subset": true,
"weight": 400
},
"HasClip": false,
"Lang": "en",
"Page": 1,
"Path": "//Document/P[4]",
"Text": "Performing architecture-level trade studies and mission concept point design studies was a major effort, presenting a number of challenges to the study centers, NASA HQ and the NRC PSDS team. As a result of this work, many lessons were learned along the way that can benefit future efforts that require similar design team study support. The PSDS Lessons Learned task was undertaken to identify those lessons learned and make them available for future similar efforts. The approach taken was to send questionnaires to the full range of study participants, including the PSDS science champions and panels, study center leads and team participants (JPL, GSFC, APL, GRC and MSFC), and NASA HQ POCs. The responses reflect a complete cross-section of study participants and provideset of responses reflects a complete cross-section of study participants and provides excellent insight into the PSDS mission study process. This report documents the lessons learned. ",
"TextSize": 12,
"attributes": {
"LineHeight": 13.75,
"SpaceAfter": 18
}
},
The important part is just Text
, and I use that in my filter
and reduce
calls to generate one blob of text.
The result of all this is an array of PDF data that includes where the PDF is (I use this to render it with the Embed API), a path to the thumbnail (used on the home page for links), and the text content.
Alright, so at this point, I can build a search engine. I've covered various types of Jamstack friendly search engines, but the simplest one would be Lunr. The quick and dirty solution for building the search engine would be to expose my PDF data for indexing by Lunr and then add a page that uses JavaScript and the Lunr library to search it. First, here is my index:
---
permalink: /searchdata.json
---
[
{% for pdf in pdfs %}
{
"name": {{pdf.name | jsonify}},
"url":"/pdf/{{ pdf.name }}",
"text":{{ pdf.text | jsonify }}
}{% unless forloop.last %},{% endunless %}
{% endfor %}
]
The jsonify
filter is one I wrote myself and may be found in .eleventy.js
:
eleventyConfig.addFilter('jsonify', function (variable) {
return JSON.stringify(variable);
});
I forgot to use an arrow function for that. I feel so un-hipster. Sorry. The result of this is a JSON file containing each PDF, its URL, and the text. If your spidey sense is tingling, this could be a huge JSON file. Earlier this month, I discussed using Lunr in a serverless fashion: Using Lunr with Eleventy via Netlify Serverless Functions - Part Two. Just keep that in mind as a possible alteration to what I've done here. Now for the front end:
---
layout: main
---
<h2>Search</h2>
<p>
<input type="search" id="term" placeholder="Enter your search here...">
<button id="searchBtn">Search</button>
</p>
<div id="result"></div>
<script src="https://unpkg.com/lunr/lunr.js"></script>
<script>
document.addEventListener('DOMContentLoaded', init, false);
let idx, $term, pdfs, $result;
async function init() {
console.log('load original data');
let dataRequest = await fetch('/searchdata.json');
pdfs = await dataRequest.json();
idx = lunr(function () {
this.ref('id');
this.field('name');
this.field('text');
pdfs.forEach(function (doc, idx) {
doc.id = idx;
this.add(doc);
}, this);
});
console.log('Search index created');
document.querySelector('#searchBtn').addEventListener('click', search, false);
$term = document.querySelector('#term');
$result = document.querySelector('#result');
}
function search() {
if($term.value === '') return;
console.log(`search for ${$term.value}`);
let results = idx.search($term.value);
results.forEach(r => {
r.name = pdfs[r.ref].name;
r.url = pdfs[r.ref].url;
});
if(results.length > 0) {
let result = '<p>Search results:</p><ul>';
results.forEach(r => {
result += `<li><a href="${r.url}?term=${encodeURIComponent($term.value)}">${r.name}</a></li>`
});
result += '</ul>';
$result.innerHTML = result;
} else {
$result.innerHTML = '<p>Sorry, but there we no results.</p>';
}
}
</script>
I'm just using some vanilla JS here to load in the data, pass it to Lunr, and set up the form field and button to handle searching. Suppose you want to give this a spin, head over to https://pdftest3.vercel.app/ and click the Search link on top. A good search term is "launch." To make it even fancier (I'll be all about the fancy), I made it such that when you go through to the embedded view, I pass along the search term and use the Embed API to highlight it.
Published at DZone with permission of Raymond Camden, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments