Searching and Indexing With Apache Lucene
Apache Lucene's indexing and searching capabilities make it attractive for any number of uses—development or academic. See an example of how the search engine works.
Join the DZone community and get the full member experience.
Join For FreeApache Lucene is a high-performance and full-featured text search engine library written entirely in Java from the Apache Software Foundation. It is a technology suitable for nearly any application that requires full-text search, especially in a cross-platform environment. In this article, we will see some exciting features of Apache Lucene. A step-by-step example of documents indexing and searching will be shown too.
Apache Lucene Features
Lucene offers powerful features like scalable and high-performance indexing of the documents and search capability through a simple API. It utilizes powerful, accurate and efficient search algorithms written in Java. Most importantly, it is a cross-platform solution. Therefore, it’s popular in both academic and commercial settings due to its performance, reconfigurability, and generous licensing terms. The Lucene home page is http://lucene.apache.org.
Lucene provides search over documents; where a document is essentially a collection of fields. A field consists of a field name that is a string and one or more field values. Lucene does not in any way constrain document structures. Fields are constrained to store only one kind of data, either binary, numeric, or text data. There are two ways to store text data: string fields store the entire item as one string; text fields store the data as a series of tokens. Lucene provides many ways to break a piece of text into tokens as well as hooks that allow you to write custom tokenizers. Lucene has a highly expressive search API that takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score.
The Lucene API consists of a core library and many contributed libraries. The top-level package is org.apache.lucene
. As of now, Lucene 6, the Lucene distribution contains approximately two dozen package-specific jars, these cuts down on the size of an application at a small cost to the complexity of the build file. In a nutshell, the features of Lucene can be described as follows:
Scalable and High-Performance Indexing
- Small RAM requirements — only 1MB heap.
- Incremental indexing as fast as batch indexing.
- Index size roughly 20-30% the size of text indexed.
Powerful, Accurate, and Efficient Search Algorithms
- Provides ranked searching — i.e. best results returned first.
- Supports many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more.
- Provides fielded searching (e.g. title, author, contents).
- Supports sorting by any field.
- Supports multiple-index searching with merged results.
- It allows simultaneous update and searching.
- Has flexible faceting, highlighting, joins and result grouping.
- It is fast, memory-efficient and typo-tolerant suggesters.
- Provides pluggable ranking models, including the Vector Space Model and Okapi BM25.
- Provides configurable storage engine (codecs).
Cross-platform solution
- Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
- It is 100%-pure Java
- Implementations in other programming languages are available that are index-compatible.
How Does Apache Lucene Work?
In this section, we will see how does Apache Lucene work towards documents indexing and searching.
A Lucene Index Is an Inverted Index
Lucene manages an index over a dynamic collection of documents and provides very rapid updates to the index as documents are added to and deleted from the collection. An index may store a heterogeneous set of documents, with any number of different fields that may vary by a document in arbitrary ways. Lucene indexes terms, which means that Lucene search searches over terms. A term combines a field name with a token. The terms created from the non-text fields in the document are pairs consisting of the field name and the field value. The terms created from text fields are pairs of field name and token.
The Lucene index provides a mapping from terms to documents. This is called an inverted index because it reverses the usual mapping of a document to the terms it contains. The inverted index provides the mechanism for scoring search results: if a number of search terms all map to the same document, then that document is likely to be relevant.
Lucene Index Fields
Conceptually, Lucene provides indexing and search over documents, but implementation-wise, all indexing and search are carried out over fields. A document is a collection of fields. Each field has three parts: name, type, and value. At search time, the supplied field name restricts the search to particular fields. For example, a MEDLINE citation can be represented as a series of fields: one field for the name of the article, another field for name of the journal in which it was published, another field for the authors of the article, a pub-date field for the date of publication, a field for the text of the article’s abstract, and another field for the list of topic keywords drawn from Medical Subject Headings (MeSH). Each of these fields is given a different name, and at search time, the client could specify that it was searching for authors or titles or both, potentially restricting to a date range and set of journals by constructing search terms for the appropriate fields and values.
Indexing Documents
Document indexing consists of first constructing a document that contains the fields to be indexed or stored, then adding that document to the index. The key classes involved in indexing are,oal.index.IndexWriter
which is responsible for adding documents to an index, and, oal.store.Directory
which is the storage abstraction used for the index itself. Directories provide an interface that’s similar to an operating system’s file system. A Directory
contains any number of sub-indexes called segments. Maintaining the index as a set of segments allows Lucene to rapidly update and delete documents from the index.
Document Search and Search Ranking
The Lucene search API takes a search query and returns a set of documents ranked by relevancy with documents most similar to the query having the highest score. Lucene provides a highly configurable hybrid form of search that combines exact boolean searches with softer, more relevance-ranking-oriented vector-space search methods. All searches are field specific because Lucene indexes terms and a term is composed of a field name and a token.
An Example of Document Indexing and Searching
In this section, we will see a step-by-step example that shows document indexing and searching with Apache Lucene.
Step 1: Loading Required APIs and Packages
package com.example.lucene;
import java.io.File;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.util.Version;
import java.io.FileReader;
import java.io.IOException;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.store.FSDirectory;
Step-2: File Indexing
At first select the index directory where the indexer will be saved and then select the data directory as follows:
File indexDir = new File("C:/Exp/Index/")
File dataDir = new File("C:/Users/rezkar/Downloads/lucene-6.3.0/lucene-6.3.0/");
Now select the suffix of the files that you intend to search after indexed:
String suffix = "jar";
Since we will be searching the files with extension say "java", so call the Lucene File Indexer as follows:
SimpleFileIndexer indexer = new SimpleFileIndexer();
Now create an index and let's see how many files got indexed:
int numIndex = indexer.index(indexDir, dataDir, suffix);
System.out.println("Numer of total files got indexed: " + numIndex);
Here the index() method goes as follows:
private int index(File indexDir, File dataDir, String suffix) throws Exception {
IndexWriter indexWriter = new IndexWriter( FSDirectory.open(indexDir), new SimpleAnalyzer(), true, IndexWriter.MaxFieldLength.LIMITED);
indexWriter.setUseCompoundFile(false);
indexDirectory(indexWriter, dataDir, suffix);
int numIndexed = indexWriter.maxDoc();
indexWriter.optimize();
indexWriter.close();
return numIndexed;
}
As the above code create the index and write in the index directory that we have selected above of course after applying a simple analysis suing the SimpleAnalyzer() method. Finally, the method returns the number of the files that have been indexed. As if you see carefully, the indexDirectory() method takes 3 parameters: the index writer that writes the index by analyzing the files having the extension .java from the data directory. The indexDirectory() method goes as follows:
private void indexDirectory(IndexWriter indexWriter, File dataDir, String suffix) throws IOException {
File[] files = dataDir.listFiles();
for (int i = 0; i < files.length; i++) {
File f = files[i];
if (f.isDirectory()) {
indexDirectory(indexWriter, f, suffix);
}
else {
indexFileWithIndexWriter(indexWriter, f, suffix);
}
}
}
According to the above code segment, the indexer indexes either all the files inside a sub-directory or all the files in the data directory using the indexFileWithIndexWriter() method that goes as follows:
private void indexFileWithIndexWriter(IndexWriter indexWriter, File f, String suffix) throws IOException {
if (f.isHidden() || f.isDirectory() || !f.canRead() || !f.exists()) {
return;
}
if (suffix!=null && !f.getName().endsWith(suffix)) {
return;
}
System.out.println("Indexing file:... " + f.getCanonicalPath());
Document doc = new Document();
doc.add(new Field("contents", new FileReader(f)));
doc.add(new Field("filename", f.getCanonicalPath(), Field.Store.YES, Field.Index.ANALYZED));
indexWriter.addDocument(doc);
}
After successful indexing, you should observe the following output:
Indexing file:... C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\common\lucene-analyzers-common-6.3.0.jar
Indexing file:... C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\icu\lib\icu4j-56.1.jar
Indexing file:... C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\morfologik\lucene-analyzers-morfologik-6.3.0.jar
.
.
.
Indexing file:... C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\test-framework\lucene-test-framework-6.3.0.jar
Numer of total files indexed: 60
Step 3: Search the Files
In this step, we will search all the name of the files that we indexed in the previous step. The workflow for this step goes as follows:
1. Show the index directory.
2. Input the search query -i.e. say "jar".
3. Use the SimpleSearcher API of Lucene.
4. Perform the search operation.
5. Print the result.
Technically, these five steps can be performed using the following code segment:
public static void main(String[] args) throws Exception {
File indexDir = new File("C:/Exp/Index/");
String query = "lucene";
int hits = 100;
SimpleSearcher searcher = new SimpleSearcher();
searcher.searchIndex(indexDir, query, hits);
}
Here, searchIndex() is a user-defined method that actually searches the file searching that goes as follows:
private void searchIndex(File indexDir, String queryStr, int maxHits) throws Exception {
Directory directory = FSDirectory.open(indexDir);
IndexSearcher searcher = new IndexSearcher(directory);
QueryParser parser = new QueryParser(Version.LUCENE_30, "contents", new SimpleAnalyzer());
Query query = parser.parse(queryStr);
TopDocs topDocs = searcher.search(query, maxHits);
ScoreDoc[] hits = topDocs.scoreDocs;
for (int i = 0; i < hits.length; i++) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
System.out.println(d.get("filename"));
}
System.out.println("Found " + hits.length);
}
This method searches the files and prints the names of the files. For the sample data directory, you can download the Apache Lucene distribution version 6.3.0 from here. On successful execution of the above method, you should observe the output as follows:
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\uima\lib\WhitespaceTokenizer-2.3.1.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\benchmark\lib\xercesImpl-2.9.1.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\replicator\lib\jetty-continuation-9.3.8.v20160314.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\morfologik\lib\morfologik-fsa-2.1.1.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\queryparser\lucene-queryparser-6.3.0.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\uima\lucene-analyzers-uima-6.3.0.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\spatial-extras\lib\slf4j-api-1.7.7.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\morfologik\lib\morfologik-polish-2.1.1.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\analysis\morfologik\lib\morfologik-stemming-2.1.1.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\benchmark\lib\spatial4j-0.6.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\demo\lucene-demo-6.3.0.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\replicator\lib\commons-logging-1.1.3.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\spatial-extras\lib\spatial4j-0.6-tests.jar
C:\Users\rezkar\Downloads\lucene-6.3.0\lucene-6.3.0\spatial-extras\lib\spatial4j-0.6.jar
Found 14
Conclusion
In this article, I tried to cover some essential features of Lucene. Putting the above code fragments together into a full application is left as an exercise to the reader.
Nevertheless, if this does not work, readers can download the source code, a sample data folder, and the maven friendly pom.XML file from my GitHub repository here.
Any kind of feedback is welcomed. Happy reading!
Opinions expressed by DZone contributors are their own.
Comments