Lucene Search Highlight Example

In this lucene 6 example, we will learn to search indexed documents and highlight searched term in search result using SimpleHTMLFormatter and SimpleSpanFragmenter.

Table of Contents

Project Structure
Index Text Files Content
Search and Highlight searched terms
Demo
Sourcecode

Project Structure

I am creating maven project to execute this example. And added these lucene dependencies.

<properties>
	<lucene.version>6.6.0</lucene.version>
</properties>

<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-core</artifactId>
	<version>${lucene.version}</version>
</dependency>
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-analyzers-common</artifactId>
	<version>${lucene.version}</version>
</dependency>
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-queryparser</artifactId>
	<version>${lucene.version}</version>
</dependency>

<!-- To include highlight support-->
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-highlighter</artifactId>
	<version>${lucene.version}</version>
</dependency>

Project structure looks this now:

Lucene Index File - Project Structure
Lucene Index File – Project Structure

Please note that we will be using these two folders inside project:

  • inputFiles – will contain all text files which we want to index.
  • indexedFiles – will contain lucene indexed documents. We will search the index inside it.

Index Text Files Content

I am iterating all files in inputFiles folder and then indexing them. I am creating 3 fields:

  1. path : File path [Field.Store.YES]
  2. modified : File last modified timestamp
  3. contents : File content [Field.Store.YES]
If a document is indexed but not stored, you can search for it, but it won’t be returned with search results. A YES value causes lucene to store the original field value in the index.

LuceneWriteIndexFromFileExample.java

package com.howtodoinjava.demo.lucene.file;

import java.io.IOException;
import java.io.InputStream;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneWriteIndexFromFileExample 
{
	public static void main(String[] args)
	{
		//Input folder
		String docsPath = "inputFiles";
		
		//Output folder
		String indexPath = "indexedFiles";

		//Input Path Variable
		final Path docDir = Paths.get(docsPath);

		try 
		{
			//org.apache.lucene.store.Directory instance
			Directory dir = FSDirectory.open( Paths.get(indexPath) );
			
			//analyzer with the default stop words
			Analyzer analyzer = new StandardAnalyzer();
			
			//IndexWriter Configuration
			IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
			iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
			
			//IndexWriter writes new index files to the directory
			IndexWriter writer = new IndexWriter(dir, iwc);
			
			//Its recursive method to iterate all files and directories
			indexDocs(writer, docDir);

			writer.close();
		} 
		catch (IOException e) 
		{
			e.printStackTrace();
		}
	}
	
	static void indexDocs(final IndexWriter writer, Path path) throws IOException 
	{
		//Directory?
		if (Files.isDirectory(path)) 
		{
			//Iterate directory
			Files.walkFileTree(path, new SimpleFileVisitor<Path>() 
			{
				@Override
				public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException 
				{
					try 
					{
						//Index this file
						indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
					} 
					catch (IOException ioe) 
					{
						ioe.printStackTrace();
					}
					return FileVisitResult.CONTINUE;
				}
			});
		} 
		else 
		{
			//Index this file
			indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());
		}
	}

	static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException 
	{
		try (InputStream stream = Files.newInputStream(file)) 
		{
			//Create lucene Document
			Document doc = new Document();
			
			doc.add(new StringField("path", file.toString(), Field.Store.YES));
			doc.add(new LongPoint("modified", lastModified));
			doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES));
			
			//Updates a document by first deleting the document(s) 
			//containing <code>term</code> and then adding the new
			//document.  The delete and then add are atomic as seen
			//by a reader on the same index
			writer.updateDocument(new Term("path", file.toString()), doc);
		}
	}
}

Search and Highlight searched terms

In this section, we will search the index created in previous step and then we will highlight the searched terms in results returned by lucene searcher.

The field where search term needs to be highlighted – MUST BE STORED. Rest everything is optional.

Lucene Search Highlight Steps

In short, this is what we need to do to highlight searched terms in text:

  1. Search index with Query.
  2. Retrieve document text using document id from above step.
  3. Create TokenStream by document id and document text for the field
  4. Use token stream and highlighter to get array of text fragments.
  5. Iterate the array and display it. It has highlighted search terms.

LuceneSearchHighlighterExample.java

package com.howtodoinjava.demo.lucene.highlight;

import java.nio.file.Paths;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.search.highlight.Formatter;
import org.apache.lucene.search.highlight.Fragmenter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.SimpleHTMLFormatter;
import org.apache.lucene.search.highlight.SimpleSpanFragmenter;
import org.apache.lucene.search.highlight.TokenSources;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneSearchHighlighterExample 
{
	//This contains the lucene indexed documents
	private static final String INDEX_DIR = "indexedFiles";

	public static void main(String[] args) throws Exception 
	{
		//Get directory reference
		Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
		
		//Index reader - an interface for accessing a point-in-time view of a lucene index
		IndexReader reader = DirectoryReader.open(dir);
		
		//Create lucene searcher. It search over a single IndexReader.
		IndexSearcher searcher = new IndexSearcher(reader);
		
		//analyzer with the default stop words
		Analyzer analyzer = new StandardAnalyzer();
		
		//Query parser to be used for creating TermQuery
		QueryParser qp = new QueryParser("contents", analyzer);
		
		//Create the query
		Query query = qp.parse("cottage private discovery concluded");
		
		//Search the lucene documents
		TopDocs hits = searcher.search(query, 10);
		
		/** Highlighter Code Start ****/
		
		//Uses HTML &lt;B&gt;&lt;/B&gt; tag to highlight the searched terms
		Formatter formatter = new SimpleHTMLFormatter();
		
		//It scores text fragments by the number of unique query terms found
		//Basically the matching score in layman terms
        QueryScorer scorer = new QueryScorer(query);
        
        //used to markup highlighted terms found in the best sections of a text
        Highlighter highlighter = new Highlighter(formatter, scorer);
        
        //It breaks text up into same-size texts but does not split up spans
        Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 10);
        
        //breaks text up into same-size fragments with no concerns over spotting sentence boundaries.
        //Fragmenter fragmenter = new SimpleFragmenter(10);
        
        //set fragmenter to highlighter
        highlighter.setTextFragmenter(fragmenter);
		
        //Iterate over found results
        for (int i = 0; i < hits.scoreDocs.length; i++) 
        {
            int docid = hits.scoreDocs[i].doc;
            Document doc = searcher.doc(docid);
            String title = doc.get("path");
            
            //Printing - to which document result belongs
            System.out.println("Path " + " : " + title);
            
            //Get stored text from found document
            String text = doc.get("contents");

            //Create token stream
			TokenStream stream = TokenSources.getAnyTokenStream(reader, docid, "contents", analyzer);
			
			//Get highlighted text fragments
            String[] frags = highlighter.getBestFragments(stream, text, 10);
            for (String frag : frags) 
            {
            	System.out.println("=======================");
                System.out.println(frag);
            }
        }
        dir.close();
	}
}

Demo

  1. Let’s create 3 files in folder inputFiles with following content.

    data1.txt

    Society excited by cottage private an it esteems. Fully begin on by wound an. Girl rich in do up or both. At declared in as rejoiced of together. He impression collecting delightful unpleasant by prosperous as on. End too talent she object mrs wanted remove giving.

    data2.txt

    Questions explained agreeable preferred strangers too him her son. Set put shyness offices his females him distant. Improve has message besides shy himself cheered however how son. Quick judge other leave ask first chief her. Indeed or remark always silent seemed narrow be. Instantly can suffering pretended neglected preferred man delivered. Perhaps fertile brandon do imagine to cordial cottage.

    data3.txt

    Or neglected agreeable of discovery concluded oh it sportsman. Week to time in john. Son elegance use weddings separate. Ask too matter formed county wicket oppose talent. He immediate sometimes or to dependent in. Everything few frequently discretion surrounded did simplicity decisively. Less he year do with no sure loud.
  2. Execute LuceneWriteIndexFromFileExample.java using it’s main() method. Verify that lucene indexes are created in indexedFiles folder.
  3. Let’s say I want to search documents containing word “cottage private discovery concluded”. Change the search term in line no. 29 of class LuceneSearchHighlighterExample.java. Execute the class using it’s main() method. Verify the output:
    Path  : inputFiles\data3.txt
    =======================
    Or neglected agreeable of discovery concluded oh it sportsman. Week to time in john. Son elegance
    
    Path  : inputFiles\data1.txt
    =======================
    Society excited by cottage private an it esteems. Fully begin on by wound an. Girl rich in do up
    
    Path  : inputFiles\data2.txt
    =======================
     to cordial cottage.
    
  4. Search more terms and verify them yourselves.

Sourcecode

Download the sourcecode using below given link.

Happy Learning !!

Was this post helpful?

Join 8000+ Awesome Developers, Like YOU!

8 thoughts on “Lucene Search Highlight Example”

  1. I am getting the following error while building LuceneSearchHighlighterExample main.Any Idea?

     
    Exception in thread "main" org.apache.lucene.index.IndexNotFoundException: 
         no segments* file found in MMapDirectory@PATH
    lockFactory=org.apache.lucene.store.NativeFSLockFactory@5ce65a89: files: []
    	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:687)
    	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:77)
    	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
    	at com.lucene.highlight.LuceneSearchHighlighterExample.main(LuceneSearchHighlighterExample.java:36)
    
    Reply
  2. Currently using lucene 7.4.0 and I have a regex to find out the emails which are present in log files.How can I list the content matching the regex pattern ?
    If I try above code I get error : Cannot instantiate the type Highlighter

    Reply
  3. Hi Lokesh,
    very nice tutorial!

    I have a question. I don’t understand if, and possibly how, I can choose the dimension of the context of my query. For example, how can I print the first 10 text fragments with the occurrences of “highlighted_query_word” having x words before “highlighted_query_word” and x words after “highlighted_query_word” (i.e., concordances) ?

    Andrea

    Reply
  4. I have a question about running your example

    Build : Window 7, Eclipse Neon, JDK 1.8
    Class : LuceneSearchHighlighterExample
    Error Line : 88 Line
    Console :
    Path : d:\temp\inputFiles\data3.txt
    Exception in thread “main” java.lang.NoClassDefFoundError: org/apache/lucene/index/memory/MemoryIndex
    at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getLeafContext(WeightedSpanTermExtractor.java:399)
    at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extractWeightedTerms(WeightedSpanTermExtractor.java:363)
    at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:142)
    at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.extract(WeightedSpanTermExtractor.java:113)
    at org.apache.lucene.search.highlight.WeightedSpanTermExtractor.getWeightedSpanTerms(WeightedSpanTermExtractor.java:522)
    at org.apache.lucene.search.highlight.QueryScorer.initExtractor(QueryScorer.java:218)
    at org.apache.lucene.search.highlight.QueryScorer.init(QueryScorer.java:186)
    at org.apache.lucene.search.highlight.Highlighter.getBestTextFragments(Highlighter.java:195)
    at org.apache.lucene.search.highlight.Highlighter.getBestFragments(Highlighter.java:155)
    at tutorial.LuceneSearchHighlighterExample.main(LuceneSearchHighlighterExample.java:97)
    Caused by: java.lang.ClassNotFoundException: org.apache.lucene.index.memory.MemoryIndex
    at java.net.URLClassLoader.findClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    at sun.misc.Launcher$AppClassLoader.loadClass(Unknown Source)
    at java.lang.ClassLoader.loadClass(Unknown Source)
    … 10 more

    Reply

Leave a Comment

About HowToDoInJava

This blog provides tutorials and how-to guides on Java and related technologies.

It also shares the best practices, algorithms & solutions, and frequently asked interview questions.