Lucene: Index and Search Unstructured Text Files

In this Lucene tutorial, we will learn to create indexes from unstructured text files and then search tokens within the indexed documents. To learn about installing Lucene, please refer to the Lucene index and search example.

1. Maven

Start with adding these Lucene dependencies. We are using Lucene 9.10.0 and Java 21.

<properties> 
  <maven.compiler.source>21</maven.compiler.source>
  <maven.compiler.target>21</maven.compiler.target>
  <lucene.version>9.10.0</lucene.version>
</properties>

<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-core</artifactId>
  <version>${lucene.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-analysis-common</artifactId>
  <version>${lucene.version}</version>
</dependency>
<dependency>
  <groupId>org.apache.lucene</groupId>
  <artifactId>lucene-queryparser</artifactId>
  <version>${lucene.version}</version>
</dependency>

Additionally, we are using these two folders:

//Input folder where text files are present
String docsPath = "c:/temp/lucene/inputFiles";

//Index folder where the indexes will be created
String indexPath = "c:/temp/lucene/indexedFiles";

2. Indexing the Text File Contents

To index the file contents, we are iterating over all the text files in inputFiles folder and then indexing them. We are creating 3 fields in the Lucene document:

path: File path [Field.Store.YES]
modified: File last modified timestamp
contents: File content [Field.Store.YES]

If a document is indexed but not stored, you can search for it, but it won’t be returned with search results. A YES value causes lucene to store the original field value in the index.

The following Java program uses Files.walkFileTree() to find and iterate over the text files in the provided directory, and later uses org.apache.lucene.index.IndexWriter to write the document in the index.

import java.io.IOException;
import java.io.InputStream;
import java.nio.file.FileVisitResult;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.nio.file.SimpleFileVisitor;
import java.nio.file.attribute.BasicFileAttributes;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.LongPoint;
import org.apache.lucene.document.StringField;
import org.apache.lucene.document.TextField;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriterConfig;
import org.apache.lucene.index.IndexWriterConfig.OpenMode;
import org.apache.lucene.index.Term;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneWriteIndexFromFileExample {

  public static void main(String[] args) {

    //Input folder
    String docsPath = "c:/temp/lucene/inputFiles";

    //Output folder
    String indexPath = "c:/temp/lucene/indexedFiles";

    //Input Path Variable
    final Path docDir = Paths.get(docsPath);

    try {
      //org.apache.lucene.store.Directory instance
      Directory dir = FSDirectory.open(Paths.get(indexPath));

      //analyzer with the default stop words
      Analyzer analyzer = new StandardAnalyzer();

      //IndexWriter Configuration
      IndexWriterConfig iwc = new IndexWriterConfig(analyzer);
      iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);

      //IndexWriter writes new index files to the directory
      IndexWriter writer = new IndexWriter(dir, iwc);

      //Its recursive method to iterate all files and directories
      indexDocs(writer, docDir);

      writer.close();
    } catch (IOException e) {
      e.printStackTrace();
    }
  }

  static void indexDocs(final IndexWriter writer, Path path) throws IOException {
    //Directory?
    if (Files.isDirectory(path)) {
      //Iterate directory
      Files.walkFileTree(path, new SimpleFileVisitor<Path>() {
        @Override
        public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
          try {
            //Index this file
            writeToIndex(writer, file, attrs.lastModifiedTime().toMillis());
          } catch (IOException ioe) {
            ioe.printStackTrace();
          }
          return FileVisitResult.CONTINUE;
        }
      });
    } else {
      //Index this file
      writeToIndex(writer, path, Files.getLastModifiedTime(path).toMillis());
    }
  }

  static void writeToIndex(IndexWriter writer, Path file, long lastModified) throws IOException {
    try (InputStream stream = Files.newInputStream(file)) {
      //Create lucene Document
      Document doc = new Document();

      doc.add(new StringField("path", file.toString(), Field.Store.YES));
      doc.add(new LongPoint("modified", lastModified));
      doc.add(new TextField("contents", new String(Files.readAllBytes(file)), Store.YES));

      //Updates a document by first deleting the document(s)
      //containing <code>term</code> and then adding the new
      //document.  The delete and then add are atomic as seen
      //by a reader on the same index
      System.out.println("Writing file : " + file.toString());
      writer.updateDocument(new Term("path", file.toString()), doc);
    }
  }
}

2. Searching in Lucene Indexes

To search for anything in the Lucene indexes, we use org.apache.lucene.search.IndexSearcher and its search() method. The QueryParser helps in creating the Query object from the input text to search.

import java.io.IOException;
import java.nio.file.Paths;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.DirectoryReader;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.queryparser.classic.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.ScoreDoc;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;

public class LuceneReadIndexFromFileExample {
  //directory contains the lucene indexes
  private static final String INDEX_DIR = "c:/temp/lucene/indexedFiles";
  private static String textToSearch = "agreeable";

  public static void main(String[] args) throws Exception {
    //Create lucene searcher. It searches over a single IndexReader.
    IndexSearcher searcher = createSearcher();

    //Search indexed contents using search term
    TopDocs foundDocs = searchInContent(textToSearch, searcher);

    //Total found documents
    System.out.println("Total Results :: " + foundDocs.totalHits);

    //Let's print out the path of files which have searched term
    for (ScoreDoc sd : foundDocs.scoreDocs) {
      Document d = searcher.doc(sd.doc);
      System.out.println("Path : " + d.get("path") + ", Score : " + sd.score);
    }
  }

  private static TopDocs searchInContent(String textToFind, IndexSearcher searcher) throws Exception {
    //Create search query
    QueryParser qp = new QueryParser("contents", new StandardAnalyzer());
    Query query = qp.parse(textToFind);

    //search the index
    TopDocs hits = searcher.search(query, 10);
    return hits;
  }

  private static IndexSearcher createSearcher() throws IOException {
    Directory dir = FSDirectory.open(Paths.get(INDEX_DIR));

    //It is an interface for accessing a point-in-time view of a lucene index
    IndexReader reader = DirectoryReader.open(dir);

    //Index searcher
    IndexSearcher searcher = new IndexSearcher(reader);
    return searcher;
  }
}

3. Demo

Let’s create 3 files in a folder inputFiles with the following content: data1.txt, data2.txt and data3.txt.

Society excited by cottage private an it esteems. Fully begin on by wound an. Girl rich in do up or both. At declared in as rejoiced of together. He impression collecting delightful unpleasant by prosperous as on. End too talent she object mrs wanted remove giving.

Questions explained agreeable preferred strangers too him her son. Set put shyness offices his females him distant. Improve has message besides shy himself cheered however how son. Quick judge other leave ask first chief her. Indeed or remark always silent seemed narrow be. Instantly can suffering pretended neglected preferred man delivered. Perhaps fertile brandon do imagine to cordial cottage.

Or neglected agreeable of discovery concluded oh it sportsman. Week to time in john. Son elegance use weddings separate. Ask too matter formed county wicket oppose talent. He immediate sometimes or to dependent in. Everything few frequently discretion surrounded did simplicity decisively. Less he year do with no sure loud.

Now, run the LuceneWriteIndexFromFileExample using it’s main() method. Verify that Lucene indexes are created in indexedFiles folder.

Now, let’s say we want to search documents containing the word “agreeable“. Change the search term in variable “textToSearch” of the class LuceneReadIndexFromFileExample. Execute the class using it’s main() method. Verify the output:

Total Results :: 2
Path : inputFiles\data3.txt, Score : 0.47632512
Path : inputFiles\data2.txt, Score : 0.38863274

Search more terms and verify them yourselves.

Happy Learning !!

Source Code on Github

Lucene: Index and Search Unstructured Text Files

1. Maven

2. Indexing the Text File Contents

2. Searching in Lucene Indexes

3. Demo

Weekly Newsletter

Comments

Java Regex as Predicate using Pattern.compile() Method

Lucene Search Highlight Example

About Us

Tutorial Series

Meta Links

Our Blogs

Follow On: