In this Lucene example, we will learn to search indexed documents and highlight searched terms in search results using SimpleHTMLFormatter
and SimpleSpanFragmenter
.
1. Maven
Start with adding these Lucene dependencies. We are using Lucene 9.10.0 and Java 21.
<properties>
<maven.compiler.source>21</maven.compiler.source>
<maven.compiler.target>21</maven.compiler.target>
<lucene.version>9.10.0</lucene.version>
</properties>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-analysis-common</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-queryparser</artifactId>
<version>${lucene.version}</version>
</dependency>
<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-highlighter</artifactId>
<version>${lucene.version}</version>
</dependency>
2. Indexing the Text File Contents
For creating the indexes and writing them into index files, we are using the code given in the Lucene text files example. I am directly giving the code, you can checkout the details in the linked post.
public class LuceneWriteIndexFromFileExample {
public static void main(String[] args) {
//Input folder
String docsPath = "c:/temp/lucene/inputFiles";
//Output folder
String indexPath = "c:/temp/lucene/indexedFiles";
//Input Path Variable
final Path docDir = Paths.get(docsPath);
try {
//org.apache.lucene.store.Directory instance
Directory dir = FSDirectory.open(Paths.get(indexPath));
//analyzer with the default stop words
Analyzer analyzer = new StandardAnalyzer();
//IndexWriter Configuration
IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
//IndexWriter writes new index files to the directory
IndexWriter writer = new IndexWriter(dir, iwc);
//Its recursive method to iterate all files and directories
indexDocs(writer, docDir);
writer.close();
} catch (IOException e) {
e.printStackTrace();
}
}
static void indexDocs(final IndexWriter writer, Path path) throws IOException {
//Directory?
if (Files.isDirectory(path)) {
//Iterate directory
Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
@Override
public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
try {
//Index this file
writeToIndex(writer, file, attrs.lastModifiedTime().toMillis());
} catch (IOException ioe) {
ioe.printStackTrace();
}
return FileVisitResult.CONTINUE;
}
});
} else {
//Index this file
writeToIndex(writer, path, Files.getLastModifiedTime(path).toMillis());
}
}
static void writeToIndex(IndexWriter writer, Path file, long lastModified) throws IOException {
try (InputStream stream = Files.newInputStream(file)) {
//Create lucene Document
Document doc = new Document();
doc.add(new StringField("path", file.toString(), Field.Store.YES));
doc.add(new LongPoint("modified", lastModified));
doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES));
//Updates a document by first deleting the document(s)
//containing <code>term</code> and then adding the new
//document. The delete and then add are atomic as seen
//by a reader on the same index
System.out.println("Writing file : " + file.toString());
writer.updateDocument(new Term("path", file.toString()), doc);
}
}
}
3. Searching and Highlighting the Search Terms
In this section, we will search the index created in previous step and then we will highlight the searched terms in results returned by lucene searcher.
The Document field where search term needs to be searched and highlighted – MUST BE STORED. Rest everything is optional.
3.1. Lucene Search Highlight Steps
In short, this is what we need to do to highlight searched terms in text:
- Search index with Query.
- Retrieve document text using document id from above step.
- Create TokenStream by document id and document text for the field
- Use token stream and highlighter to get array of text fragments.
- Iterate the array and display it. It has highlighted search terms.
3.2. Java Program to Search and Highlight Lecene Matches
Let’s code for the steps discussed above.
import java.nio.file.Paths;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Formatter;
import org.apache.lucene.search.highlight.Fragmenter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleSpanFragmenter;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
public class LuceneSearchHighlighterExample {
//This contains the lucene indexed documents
private static final String INDEX_DIR = "c:/temp/lucene/indexedFiles";
private static String searchQuery = "cottage private discovery concluded";
public static void main(String[] args) throws Exception {
//Get directory reference
Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
//Index reader - an interface for accessing a point-in-time view of a lucene index
IndexReader reader = DirectoryReader.open(dir);
//Create lucene searcher. It search over a single IndexReader.
IndexSearcher searcher = new IndexSearcher(reader);
//analyzer with the default stop words
Analyzer analyzer = new StandardAnalyzer();
//Query parser to be used for creating TermQuery
QueryParser qp = new QueryParser("contents", analyzer);
//Create the query
Query query = qp.parse(searchQuery);
//Search the lucene documents
TopDocs hits = searcher.search(query, 10);
/** Highlighter Code Start ****/
//Uses HTML <B></B> tag to highlight the searched terms
Formatter formatter = new SimpleHTMLFormatter();
//It scores text fragments by the number of unique query terms found
//Basically the matching score in layman terms
QueryScorer scorer = new QueryScorer(query);
//used to markup highlighted terms found in the best sections of a text
Highlighter highlighter = new Highlighter(formatter, scorer);
//It breaks text up into same-size texts but does not split up spans
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 10);
//breaks text up into same-size fragments with no concerns over spotting sentence boundaries.
//Fragmenter fragmenter = new SimpleFragmenter(10);
//set fragmenter to highlighter
highlighter.setTextFragmenter(fragmenter);
//Iterate over found results
for (int i = 0; i < hits.scoreDocs.length; i++) {
int docid = hits.scoreDocs[i].doc;
Document doc = searcher.doc(docid);
String title = doc.get("path");
//Printing - to which document result belongs
System.out.println("Path " + " : " + title);
//Get stored text from found document
String text = doc.get("contents");
//Create token stream
TokenStream stream = TokenSources.getAnyTokenStream(reader, docid, "contents", analyzer);
//Get highlighted text fragments
String[] frags = highlighter.getBestFragments(stream, text, 10);
for (String frag : frags) {
System.out.println("=======================");
System.out.println(frag);
}
}
dir.close();
}
}
4. Demo
As a prerequisite, create the Lucene index for some text files as shown in the Lucene text files example.
Let’s say I want to search documents containing words “cottage private discovery concluded“. Execute the class using it’s main()
method. Verify the output:
Path : c:\temp\lucene\inputFiles\data3.txt
=======================
Or neglected agreeable of <B>discovery</B> <B>concluded</B> oh it sportsman. Week to time in john. Son elegance
Path : c:\temp\lucene\inputFiles\data1.txt
=======================
Society excited by <B>cottage</B> <B>private</B> an it esteems. Fully begin on by wound an. Girl rich in do up or
Path : c:\temp\lucene\inputFiles\data2.txt
=======================
to cordial <B>cottage</B>.
Search more terms and verify them yourselves.
Happy Learning !!
Comments