Lucene Search Highlight Example

In this Lucene example, we will learn to search indexed documents and highlight searched terms in search results using SimpleHTMLFormatter and SimpleSpanFragmenter.

1. Maven

Start with adding these Lucene dependencies. We are using Lucene 9.10.0 and Java 21.

<properties> 
  <maven.compiler.source>21</maven.compiler.source>
  <maven.compiler.target>21</maven.compiler.target>
  <lucene.version>9.10.0</lucene.version>
</properties>

<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-core</artifactId>
  <version>${lucene.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-analysis-common</artifactId>
  <version>${lucene.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-queryparser</artifactId>
  <version>${lucene.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-highlighter</artifactId>
  <version>${lucene.version}</version>
</dependency>

2. Indexing the Text File Contents

For creating the indexes and writing them into index files, we are using the code given in the Lucene text files example. I am directly giving the code, you can checkout the details in the linked post.

public class LuceneWriteIndexFromFileExample {
  public static void main(String[] args) {
    //Input folder
    String docsPath = "c:/temp/lucene/inputFiles";

    //Output folder
    String indexPath = "c:/temp/lucene/indexedFiles";

    //Input Path Variable
    final Path docDir = Paths.get(docsPath);

    try {
      //org.apache.lucene.store.Directory instance
      Directory dir = FSDirectory.open(Paths.get(indexPath));

      //analyzer with the default stop words
      Analyzer analyzer = new StandardAnalyzer();

      //IndexWriter Configuration
      IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
      iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);

      //IndexWriter writes new index files to the directory
      IndexWriter writer = new IndexWriter(dir, iwc);

      //Its recursive method to iterate all files and directories
      indexDocs(writer, docDir);

      writer.close();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

  static void indexDocs(final IndexWriter writer, Path path) throws IOException {
    //Directory?
    if (Files.isDirectory(path)) {
      //Iterate directory
      Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
        @Override
        public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
          try {
            //Index this file
            writeToIndex(writer, file, attrs.lastModifiedTime().toMillis());
          } catch (IOException ioe) {
            ioe.printStackTrace();
          }
          return FileVisitResult.CONTINUE;
        }
      });
    } else {
      //Index this file
      writeToIndex(writer, path, Files.getLastModifiedTime(path).toMillis());
    }
  }

  static void writeToIndex(IndexWriter writer, Path file, long lastModified) throws IOException {
    try (InputStream stream = Files.newInputStream(file)) {
      //Create lucene Document
      Document doc = new Document();

      doc.add(new StringField("path", file.toString(), Field.Store.YES));
      doc.add(new LongPoint("modified", lastModified));
      doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES));

      //Updates a document by first deleting the document(s)
      //containing <code>term</code> and then adding the new
      //document.  The delete and then add are atomic as seen
      //by a reader on the same index
      System.out.println("Writing file : " + file.toString());
      writer.updateDocument(new Term("path", file.toString()), doc);
    }
  }
}

3. Searching and Highlighting the Search Terms

In this section, we will search the index created in previous step and then we will highlight the searched terms in results returned by lucene searcher.

The Document field where search term needs to be searched and highlighted – MUST BE STORED. Rest everything is optional.

3.1. Lucene Search Highlight Steps

In short, this is what we need to do to highlight searched terms in text:

Search index with Query.
Retrieve document text using document id from above step.
Create TokenStream by document id and document text for the field
Use token stream and highlighter to get array of text fragments.
Iterate the array and display it. It has highlighted search terms.

3.2. Java Program to Search and Highlight Lecene Matches

Let’s code for the steps discussed above.

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Formatter;
import org.apache.lucene.search.highlight.Fragmenter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleSpanFragmenter;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneSearchHighlighterExample {
  //This contains the lucene indexed documents
  private static final String INDEX_DIR = "c:/temp/lucene/indexedFiles";
  private static String searchQuery = "cottage private discovery concluded";

  public static void main(String[] args) throws Exception {
    //Get directory reference
    Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));

    //Index reader - an interface for accessing a point-in-time view of a lucene index
    IndexReader reader = DirectoryReader.open(dir);

    //Create lucene searcher. It search over a single IndexReader.
    IndexSearcher searcher = new IndexSearcher(reader);

    //analyzer with the default stop words
    Analyzer analyzer = new StandardAnalyzer();

    //Query parser to be used for creating TermQuery
    QueryParser qp = new QueryParser("contents", analyzer);

    //Create the query
    Query query = qp.parse(searchQuery);

    //Search the lucene documents
    TopDocs hits = searcher.search(query, 10);

    /** Highlighter Code Start ****/

    //Uses HTML &lt;B&gt;&lt;/B&gt; tag to highlight the searched terms
    Formatter formatter = new SimpleHTMLFormatter();

    //It scores text fragments by the number of unique query terms found
    //Basically the matching score in layman terms
    QueryScorer scorer = new QueryScorer(query);

    //used to markup highlighted terms found in the best sections of a text
    Highlighter highlighter = new Highlighter(formatter, scorer);

    //It breaks text up into same-size texts but does not split up spans
    Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 10);

    //breaks text up into same-size fragments with no concerns over spotting sentence boundaries.
    //Fragmenter fragmenter = new SimpleFragmenter(10);

    //set fragmenter to highlighter
    highlighter.setTextFragmenter(fragmenter);

    //Iterate over found results
    for (int i = 0; i < hits.scoreDocs.length; i++) {
      int docid = hits.scoreDocs[i].doc;
      Document doc = searcher.doc(docid);
      String title = doc.get("path");

      //Printing - to which document result belongs
      System.out.println("Path " + " : " + title);

      //Get stored text from found document
      String text = doc.get("contents");

      //Create token stream
      TokenStream stream = TokenSources.getAnyTokenStream(reader, docid, "contents", analyzer);

      //Get highlighted text fragments
      String[] frags = highlighter.getBestFragments(stream, text, 10);
      for (String frag : frags) {
        System.out.println("=======================");
        System.out.println(frag);
      }
    }
    dir.close();
  }
}

4. Demo

As a prerequisite, create the Lucene index for some text files as shown in the Lucene text files example.

Let’s say I want to search documents containing words “cottage private discovery concluded“. Execute the class using it’s main() method. Verify the output:

Path  : c:\temp\lucene\inputFiles\data3.txt
=======================
Or neglected agreeable of <B>discovery</B> <B>concluded</B> oh it sportsman. Week to time in john. Son elegance

Path  : c:\temp\lucene\inputFiles\data1.txt
=======================
Society excited by <B>cottage</B> <B>private</B> an it esteems. Fully begin on by wound an. Girl rich in do up or

Path  : c:\temp\lucene\inputFiles\data2.txt
=======================
 to cordial <B>cottage</B>.

Search more terms and verify them yourselves.

Happy Learning !!

Source Code on Download

I am getting the following error while building LuceneSearchHighlighterExample main.Any Idea?

 
Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: 
     no segments* file found in MMapDirectory@PATH
lockFactory=org.apache.lucene.store.NativeFSLockFactory@5ce65a89: files: []
	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:687)
	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:77)
	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
	at com.lucene.highlight.LuceneSearchHighlighterExample.main(LuceneSearchHighlighterExample.java:36)

Prashant Mahajan

August 7, 2018 at 7:42 pm

Currently using lucene 7.4.0 and I have a regex to find out the emails which are present in log files.How can I list the content matching the regex pattern ?
If I try above code I get error : Cannot instantiate the type Highlighter
shekar

February 6, 2018 at 12:39 pm

what’s version of JDK, throws unsupported error for 1.8
Andrea

October 19, 2017 at 7:31 pm

Hi Lokesh,
very nice tutorial!

I have a question. I don’t understand if, and possibly how, I can choose the dimension of the context of my query. For example, how can I print the first 10 text fragments with the occurrences of “highlighted_query_word” having x words before “highlighted_query_word” and x words after “highlighted_query_word” (i.e., concordances) ?

Andrea
WonBin Ahn

August 7, 2017 at 12:40 pm

I have a question about running your example

Build : Window 7, Eclipse Neon, JDK 1.8
Class : LuceneSearchHighlighterExample
Error Line : 88 Line
Console :
Path : d:\temp\inputFiles\data3.txt
Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/lucene/index/memory/MemoryIndex
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getLeafContext(WeightedSpanTermExtractor.java:399)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedTerms(WeightedSpanTermExtractor.java:363)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:142)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:113)
at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:522)
at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:218)
at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)
at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:195)
at org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter.java:155)
at tutorial.LuceneSearchHighlighterExample.main(LuceneSearchHighlighterExample.java:97)
Caused by: java.lang.ClassNotFoundException: org.apache.lucene.index.memory.MemoryIndex
at java.net.URLClassLoader.findClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
at java.lang.ClassLoader.loadClass(Unknown Source)
… 10 more
- Lokesh Gupta
  
  August 7, 2017 at 1:02 pm
  
  Make sure you have lucene-core jar file in your classpath.
- WonBin Ahn
  
  August 7, 2017 at 1:25 pm
  
  I solved the problem.
  
  java project –> maven project changed.
  - Lokesh Gupta
    
    August 7, 2017 at 2:14 pm
    
    Great !!

Lucene Search Highlight Example

1. Maven

2. Indexing the Text File Contents

3. Searching and Highlighting the Search Terms

3.1. Lucene Search Highlight Steps

3.2. Java Program to Search and Highlight Lecene Matches

4. Demo

Leave a Comment

Leave a Comment Cancel reply

Lucene: Index and Search Unstructured Text Files

Lucene UnifiedHighlighter Example

About Us

Tutorial Series

Meta Links

Our Blogs

Follow On: