In this lucene 6 example, we will learn to search indexed documents and highlight searched term in search result using SimpleHTMLFormatter
and SimpleSpanFragmenter
.
Table of Contents Project Structure Index Text Files Content Search and Highlight searched terms Demo Sourcecode
Project Structure
I am creating maven project to execute this example. And added these lucene dependencies.
<properties> <lucene.version>6.6.0</lucene.version> </properties> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-core</artifactId> <version>${lucene.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-analyzers-common</artifactId> <version>${lucene.version}</version> </dependency> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-queryparser</artifactId> <version>${lucene.version}</version> </dependency> <!-- To include highlight support--> <dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-highlighter</artifactId> <version>${lucene.version}</version> </dependency>
Project structure looks this now:

Please note that we will be using these two folders inside project:
inputFiles
– will contain all text files which we want to index.indexedFiles
– will contain lucene indexed documents. We will search the index inside it.
Index Text Files Content
I am iterating all files in inputFiles
folder and then indexing them. I am creating 3 fields:
- path : File path [Field.Store.YES]
- modified : File last modified timestamp
- contents : File content [Field.Store.YES]
YES
value causes lucene to store the original field value in the index.LuceneWriteIndexFromFileExample.java
package com.howtodoinjava.demo.lucene.file; import java.io.IOException; import java.io.InputStream; import java.nio.file.FileVisitResult; import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; import java.nio.file.SimpleFileVisitor; import java.nio.file.attribute.BasicFileAttributes; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.LongPoint; import org.apache.lucene.document.StringField; import org.apache.lucene.document.TextField; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriterConfig; import org.apache.lucene.index.IndexWriterConfig.OpenMode; import org.apache.lucene.index.Term; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; public class LuceneWriteIndexFromFileExample { public static void main(String[] args) { //Input folder String docsPath = "inputFiles"; //Output folder String indexPath = "indexedFiles"; //Input Path Variable final Path docDir = Paths.get(docsPath); try { //org.apache.lucene.store.Directory instance Directory dir = FSDirectory.open( Paths.get(indexPath) ); //analyzer with the default stop words Analyzer analyzer = new StandardAnalyzer(); //IndexWriter Configuration IndexWriterConfig iwc = new IndexWriterConfig(analyzer); iwc.setOpenMode(OpenMode.CREATE_OR_APPEND); //IndexWriter writes new index files to the directory IndexWriter writer = new IndexWriter(dir, iwc); //Its recursive method to iterate all files and directories indexDocs(writer, docDir); writer.close(); } catch (IOException e) { e.printStackTrace(); } } static void indexDocs(final IndexWriter writer, Path path) throws IOException { //Directory? if (Files.isDirectory(path)) { //Iterate directory Files.walkFileTree(path, new SimpleFileVisitor<Path>() { @Override public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException { try { //Index this file indexDoc(writer, file, attrs.lastModifiedTime().toMillis()); } catch (IOException ioe) { ioe.printStackTrace(); } return FileVisitResult.CONTINUE; } }); } else { //Index this file indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis()); } } static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException { try (InputStream stream = Files.newInputStream(file)) { //Create lucene Document Document doc = new Document(); doc.add(new StringField("path", file.toString(), Field.Store.YES)); doc.add(new LongPoint("modified", lastModified)); doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES)); //Updates a document by first deleting the document(s) //containing <code>term</code> and then adding the new //document. The delete and then add are atomic as seen //by a reader on the same index writer.updateDocument(new Term("path", file.toString()), doc); } } }
Search and Highlight searched terms
In this section, we will search the index created in previous step and then we will highlight the searched terms in results returned by lucene searcher.
Lucene Search Highlight Steps
In short, this is what we need to do to highlight searched terms in text:
- Search index with Query.
- Retrieve document text using document id from above step.
- Create TokenStream by document id and document text for the field
- Use token stream and highlighter to get array of text fragments.
- Iterate the array and display it. It has highlighted search terms.
LuceneSearchHighlighterExample.java
package com.howtodoinjava.demo.lucene.highlight; import java.nio.file.Paths; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.index.DirectoryReader; import org.apache.lucene.index.IndexReader; import org.apache.lucene.queryparser.classic.QueryParser; import org.apache.lucene.search.IndexSearcher; import org.apache.lucene.search.Query; import org.apache.lucene.search.TopDocs; import org.apache.lucene.search.highlight.Formatter; import org.apache.lucene.search.highlight.Fragmenter; import org.apache.lucene.search.highlight.Highlighter; import org.apache.lucene.search.highlight.QueryScorer; import org.apache.lucene.search.highlight.SimpleHTMLFormatter; import org.apache.lucene.search.highlight.SimpleSpanFragmenter; import org.apache.lucene.search.highlight.TokenSources; import org.apache.lucene.store.Directory; import org.apache.lucene.store.FSDirectory; public class LuceneSearchHighlighterExample { //This contains the lucene indexed documents private static final String INDEX_DIR = "indexedFiles"; public static void main(String[] args) throws Exception { //Get directory reference Directory dir = FSDirectory.open(Paths.get(INDEX_DIR)); //Index reader - an interface for accessing a point-in-time view of a lucene index IndexReader reader = DirectoryReader.open(dir); //Create lucene searcher. It search over a single IndexReader. IndexSearcher searcher = new IndexSearcher(reader); //analyzer with the default stop words Analyzer analyzer = new StandardAnalyzer(); //Query parser to be used for creating TermQuery QueryParser qp = new QueryParser("contents", analyzer); //Create the query Query query = qp.parse("cottage private discovery concluded"); //Search the lucene documents TopDocs hits = searcher.search(query, 10); /** Highlighter Code Start ****/ //Uses HTML <B></B> tag to highlight the searched terms Formatter formatter = new SimpleHTMLFormatter(); //It scores text fragments by the number of unique query terms found //Basically the matching score in layman terms QueryScorer scorer = new QueryScorer(query); //used to markup highlighted terms found in the best sections of a text Highlighter highlighter = new Highlighter(formatter, scorer); //It breaks text up into same-size texts but does not split up spans Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 10); //breaks text up into same-size fragments with no concerns over spotting sentence boundaries. //Fragmenter fragmenter = new SimpleFragmenter(10); //set fragmenter to highlighter highlighter.setTextFragmenter(fragmenter); //Iterate over found results for (int i = 0; i < hits.scoreDocs.length; i++) { int docid = hits.scoreDocs[i].doc; Document doc = searcher.doc(docid); String title = doc.get("path"); //Printing - to which document result belongs System.out.println("Path " + " : " + title); //Get stored text from found document String text = doc.get("contents"); //Create token stream TokenStream stream = TokenSources.getAnyTokenStream(reader, docid, "contents", analyzer); //Get highlighted text fragments String[] frags = highlighter.getBestFragments(stream, text, 10); for (String frag : frags) { System.out.println("======================="); System.out.println(frag); } } dir.close(); } }
Demo
- Let’s create 3 files in folder
inputFiles
with following content.data1.txt
Society excited by cottage private an it esteems. Fully begin on by wound an. Girl rich in do up or both. At declared in as rejoiced of together. He impression collecting delightful unpleasant by prosperous as on. End too talent she object mrs wanted remove giving.
data2.txt
Questions explained agreeable preferred strangers too him her son. Set put shyness offices his females him distant. Improve has message besides shy himself cheered however how son. Quick judge other leave ask first chief her. Indeed or remark always silent seemed narrow be. Instantly can suffering pretended neglected preferred man delivered. Perhaps fertile brandon do imagine to cordial cottage.
data3.txt
Or neglected agreeable of discovery concluded oh it sportsman. Week to time in john. Son elegance use weddings separate. Ask too matter formed county wicket oppose talent. He immediate sometimes or to dependent in. Everything few frequently discretion surrounded did simplicity decisively. Less he year do with no sure loud.
- Execute
LuceneWriteIndexFromFileExample.java
using it’smain()
method. Verify that lucene indexes are created inindexedFiles
folder. - Let’s say I want to search documents containing word “cottage private discovery concluded”. Change the search term in line no. 29 of class
LuceneSearchHighlighterExample.java
. Execute the class using it’smain()
method. Verify the output:Path : inputFiles\data3.txt ======================= Or neglected agreeable of discovery concluded oh it sportsman. Week to time in john. Son elegance Path : inputFiles\data1.txt ======================= Society excited by cottage private an it esteems. Fully begin on by wound an. Girl rich in do up Path : inputFiles\data2.txt ======================= to cordial cottage.
- Search more terms and verify them yourselves.
Sourcecode
Download the sourcecode using below given link.
Happy Learning !!
Comments