Lucene – Index and Search Text Files

In this lucene 6 example, we will learn to create index from files and then search tokens within indexed documents. To learn about installing lucene, please refer to lucene index and search example.

Table of Contents

Project Structure
Index Text Files Content
Search Indexed Files
Demo
Sourcecode

Project Structure

I am creating maven project to execute this example. And added these lucene dependencies.

<properties>
	<lucene.version>6.6.0</lucene.version>
</properties>

<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-core</artifactId>
	<version>${lucene.version}</version>
</dependency>
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-analyzers-common</artifactId>
	<version>${lucene.version}</version>
</dependency>
<dependency>
	<groupId>org.apache.lucene</groupId>
	<artifactId>lucene-queryparser</artifactId>
	<version>${lucene.version}</version>
</dependency>

Project structure looks this now:

Lucene Index File - Project Structure
Lucene Index File – Project Structure

Please note that we will be using these two folders inside project:

  • inputFiles – will contain all text files which we want to index.
  • indexedFiles – will contain lucene indexed documents. We will search the index inside it.

Index Text Files Content

I am iterating all files in inputFiles folder and then indexing them. I am creating 3 fields:

  1. path : File path [Field.Store.YES]
  2. modified : File last modified timestamp
  3. contents : File content [Field.Store.YES]
If a document is indexed but not stored, you can search for it, but it won’t be returned with search results. A YES value causes lucene to store the original field value in the index.

LuceneWriteIndexFromFileExample.java

package com.howtodoinjava.demo.lucene.file;

import java.io.IOException;
import java.io.InputStream;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneWriteIndexFromFileExample 
{
	public static void main(String[] args)
	{
		//Input folder
		String docsPath = "inputFiles";
		
		//Output folder
		String indexPath = "indexedFiles";

		//Input Path Variable
		final Path docDir = Paths.get(docsPath);

		try 
		{
			//org.apache.lucene.store.Directory instance
			Directory dir = FSDirectory.open( Paths.get(indexPath) );
			
			//analyzer with the default stop words
			Analyzer analyzer = new StandardAnalyzer();
			
			//IndexWriter Configuration
			IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
			iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);
			
			//IndexWriter writes new index files to the directory
			IndexWriter writer = new IndexWriter(dir, iwc);
			
			//Its recursive method to iterate all files and directories
			indexDocs(writer, docDir);

			writer.close();
		} 
		catch (IOException e) 
		{
			e.printStackTrace();
		}
	}
	
	static void indexDocs(final IndexWriter writer, Path path) throws IOException 
	{
		//Directory?
		if (Files.isDirectory(path)) 
		{
			//Iterate directory
			Files.walkFileTree(path, new SimpleFileVisitor<Path>() 
			{
				@Override
				public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException 
				{
					try 
					{
						//Index this file
						indexDoc(writer, file, attrs.lastModifiedTime().toMillis());
					} 
					catch (IOException ioe) 
					{
						ioe.printStackTrace();
					}
					return FileVisitResult.CONTINUE;
				}
			});
		} 
		else 
		{
			//Index this file
			indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());
		}
	}

	static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException 
	{
		try (InputStream stream = Files.newInputStream(file)) 
		{
			//Create lucene Document
			Document doc = new Document();
			
			doc.add(new StringField("path", file.toString(), Field.Store.YES));
			doc.add(new LongPoint("modified", lastModified));
			doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES));
			
			//Updates a document by first deleting the document(s) 
			//containing <code>term</code> and then adding the new
			//document.  The delete and then add are atomic as seen
			//by a reader on the same index
			writer.updateDocument(new Term("path", file.toString()), doc);
		}
	}
}

Search Indexed Files

In this section, we will search the index created in previous step i.e. we will search the documents which contain our search query terms.

package com.howtodoinjava.demo.lucene.file;

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneReadIndexFromFileExample 
{
	//directory contains the lucene indexes
	private static final String INDEX_DIR = "indexedFiles";

	public static void main(String[] args) throws Exception 
	{
		//Create lucene searcher. It search over a single IndexReader.
		IndexSearcher searcher = createSearcher();
		
		//Search indexed contents using search term
		TopDocs foundDocs = searchInContent("frequently", searcher);
		
		//Total found documents
		System.out.println("Total Results :: " + foundDocs.totalHits);
		
		//Let's print out the path of files which have searched term
		for (ScoreDoc sd : foundDocs.scoreDocs) 
		{
			Document d = searcher.doc(sd.doc);
			System.out.println("Path : "+ d.get("path") + ", Score : " + sd.score);
		}
	}
	
	private static TopDocs searchInContent(String textToFind, IndexSearcher searcher) throws Exception
	{
		//Create search query
		QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
		Query query = qp.parse(textToFind);
		
		//search the index
		TopDocs hits = searcher.search(query, 10);
		return hits;
	}

	private static IndexSearcher createSearcher() throws IOException 
	{
		Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));
		
		//It is an interface for accessing a point-in-time view of a lucene index
		IndexReader reader = DirectoryReader.open(dir);
		
		//Index searcher
		IndexSearcher searcher = new IndexSearcher(reader);
		return searcher;
	}
}

Demo

  1. Let’s create 3 files in folder inputFiles with following content.

    data1.txt

    Society excited by cottage private an it esteems. Fully begin on by wound an. Girl rich in do up or both. At declared in as rejoiced of together. He impression collecting delightful unpleasant by prosperous as on. End too talent she object mrs wanted remove giving.

    data2.txt

    Questions explained agreeable preferred strangers too him her son. Set put shyness offices his females him distant. Improve has message besides shy himself cheered however how son. Quick judge other leave ask first chief her. Indeed or remark always silent seemed narrow be. Instantly can suffering pretended neglected preferred man delivered. Perhaps fertile brandon do imagine to cordial cottage.

    data3.txt

    Or neglected agreeable of discovery concluded oh it sportsman. Week to time in john. Son elegance use weddings separate. Ask too matter formed county wicket oppose talent. He immediate sometimes or to dependent in. Everything few frequently discretion surrounded did simplicity decisively. Less he year do with no sure loud.
  2. Execute LuceneWriteIndexFromFileExample.java using it’s main() method. Verify that lucene indexes are created in indexedFiles folder.
  3. Let’s say I want to search documents containing word “agreeable”. Change the search term in line no. 29 of class LuceneReadIndexFromFileExample.java. Execute the class using it’s main() method. Verify the output:
    Total Results :: 2
    Path : inputFiles\data3.txt, Score : 0.47632512
    Path : inputFiles\data2.txt, Score : 0.38863274
  4. Search more terms and verify them yourselves.

Sourcecode

Download the sourcecode using below given link.

Happy Learning !!

Was this post helpful?

Join 7000+ Awesome Developers

Get the latest updates from industry, awesome resources, blog updates and much more.

* We do not spam !!

3 thoughts on “Lucene – Index and Search Text Files”

  1. Hello, how can we make this program work for the version…3.6.1 as the product which I am using only supports Lucene 3.6.1…Could you please help on this.

    Reply
    • And When I add the lucene Jars of the version 3.6.1 program throwing errors.
      Please find the error Below:

      Exception in thread "main" org.apache.lucene.index.IndexFormatTooOldException: 
      Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path=
      "E:\SearchIndex\Matrix\Matrix\MatrixMasterCatalog\
      CatalogIndex_20180116181719547\en_US\segments_1"))): 
      3 (needs to be between 6 and 8). 
      This version of Lucene only supports indexes created with release 6.0 and later.
      	at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:213)
      	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:306)
      	at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:290)
      	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:59)
      	at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:56)
      	at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:675)
      	at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:79)
      	at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:63)
      Reply

Leave a Comment

HowToDoInJava

A blog about Java and related technologies, the best practices, algorithms, and interview questions.