Apache POI SAX Parser: Reading a Large Excel File

Learn to read a large Excel file in Java using the Apache POI and SAX parser library. The SAX parser is an event-based parser. Unlike a DOM parser, a SAX parser creates no parse tree and sends event notifications when a sheet, row or cell is processed sequentially from top to bottom.

In this example, we will be able to:

  • Use custom logic to choose if we want to process a specific sheet (by sheet name).
  • Notify when a new sheet starts or the current sheet ends.
  • Get the first row in the sheet as headers.
  • Get the other rows in the sheet as a Map of column name and cell value pairs.

1. Maven Dependencies

Add the latest version of org.apache.poi:poi and org.apache.poi:poi-ooxml in the application, if not added already.

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi</artifactId>
    <version>5.2.2</version>
</dependency>

<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>5.2.2</version>
</dependency>

2. Core Classes

  • OPCPackage: A .xlsx file is built on top of the OOXML package structure, and OPCPackage represents a container that can store multiple data objects.
  • XSSFReader: makes it easy to get at individual parts of an OOXML .xlsx file, suitable for low memory sax parsing.
  • DefaultHandler: provides default implementations for all callbacks in the other core SAX2 handler classes. We have extended this class and overrode the necessary methods to handle event callbacks.
  • SAXParser: parses a document and sends notification of various parser events to a registered event handler.
  • SharedStringsTable: It stores a table of strings shared across all sheets in a workbook. It helps in improving performance when some strings are repeated across many rows or columns. The shared string table contains all the necessary information for displaying the string: the text, formatting properties, and phonetic properties.

See Also: DOM vs SAX Parser

3. Reading Excel with SAX Parser

3.1. Overriding DefaultHandler

Let us start with creating the event handler for parsing events. The following SheetHandler extends DefaultHandler and provides the following methods:

  • startElement(): is called when a new row or cell begins.
  • endElement(): is called when the current row or cell ends.
  • readExcelFile(): takes an excel file and uses SAXParser and XSSFReader to parse the file, sheet by sheet.
import org.apache.poi.openxml4j.opc.OPCPackage;
import org.apache.poi.xssf.eventusermodel.XSSFReader;
import org.apache.poi.xssf.model.SharedStringsTable;
import org.apache.poi.xssf.usermodel.XSSFRichTextString;
import org.xml.sax.Attributes;
import org.xml.sax.SAXException;
import org.xml.sax.helpers.DefaultHandler;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import java.io.File;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.concurrent.ExecutionException;

public class SheetHandler extends DefaultHandler
{
  protected Map<String, String> header = new HashMap<>();
  protected Map<String, String> rowValues = new HashMap<>();
  private SharedStringsTable sharedStringsTable;

  protected long rowNumber = 0;
  protected String cellId;
  private String contents;
  private boolean isCellValue;
  private boolean fromSST;

  protected static String getColumnId(String attribute) throws SAXException {
    for (int i = 0; i < attribute.length(); i++) {
      if (!Character.isAlphabetic(attribute.charAt(i))) {
        return attribute.substring(0, i);
      }
    }
    throw new SAXException("Invalid format " + attribute);
  }

  @Override
  public void startElement(String uri, String localName, String name,
                           Attributes attributes) throws SAXException {
    // Clear contents cache
    contents = "";
    // element row represents Row
    switch (name) {
      case "row" -> {
        String rowNumStr = attributes.getValue("r");
        rowNumber = Long.parseLong(rowNumStr);
      }
      // element c represents Cell
      case "c" -> {
        cellId = getColumnId(attributes.getValue("r"));
        // attribute t represents the cell type
        String cellType = attributes.getValue("t");
        if (cellType != null && cellType.equals("s")) {
          // cell type s means value will be extracted from SharedStringsTable
          fromSST = true;
        }
      }
      // element v represents value of Cell
      case "v" -> isCellValue = true;
    }
  }

  @Override
  public void characters(char[] ch, int start, int length) {
    if (isCellValue) {
      contents += new String(ch, start, length);
    }
  }

  @Override
  public void endElement(String uri, String localName, String name) {
    if (isCellValue && fromSST) {
      int index = Integer.parseInt(contents);
      contents = new XSSFRichTextString(sharedStringsTable.getItemAt(index).getString()).toString();
      rowValues.put(cellId, contents);
      cellId = null;
      isCellValue = false;
      fromSST = false;
    } else if (isCellValue) {
      rowValues.put(cellId, contents);
      isCellValue = false;
    } else if (name.equals("row")) {
      header.clear();
      if (rowNumber == 1) {
        header.putAll(rowValues);
      }
      try {
        processRow();
      } catch (ExecutionException | InterruptedException e) {
        e.printStackTrace();
      }
      rowValues.clear();
    }
  }

  protected boolean processSheet(String sheetName) {
    return true;
  }

  protected void startSheet() {
  }

  protected void endSheet() {
  }

  protected void processRow() throws ExecutionException, InterruptedException {
  }

  public void readExcelFile(File file) throws Exception {

    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser saxParser = factory.newSAXParser();

    try (OPCPackage opcPackage = OPCPackage.open(file)) {
      XSSFReader xssfReader = new XSSFReader(opcPackage);
      sharedStringsTable = (SharedStringsTable) xssfReader.getSharedStringsTable();

      System.out.println(sharedStringsTable.getUniqueCount());

      Iterator<InputStream> sheets = xssfReader.getSheetsData();

      if (sheets instanceof XSSFReader.SheetIterator sheetIterator) {
        while (sheetIterator.hasNext()) {
          try (InputStream sheet = sheetIterator.next()) {
            String sheetName = sheetIterator.getSheetName();
            if(!processSheet(sheetName)) {
              continue;
            }
            startSheet();
            saxParser.parse(sheet, this);
            endSheet();
          }
        }
      }
    }
  }
}

3.2. Creating Row Handler

The following class ExcelReaderHandler extends SheetHandler class as given in the previous section. It overrides the following methods so we can write our custom logic for processing the data read from each sheet in the excel file.

  • processSheet(): for determining if we want to read a sheet or not. It takes the sheet name as a parameter that we can use to determine the decision.
  • startSheet(): is invoked everytime a new sheet starts.
  • endSheet(): is invoked everytime the current sheet ends.
  • processRow(): is invoked once for each row, and provides cell values in that row.
public class ExcelReaderHandler extends SheetHandler {

  @Override
  protected boolean processSheet(String sheetName) {
    //Decide which sheets to read; Return true for all sheets
    //return "Sheet 1".equals(sheetName);
    System.out.println("Processing start for sheet : " + sheetName);
    return true;
  }

  @Override
  protected void startSheet() {
    //Any custom logic when a new sheet starts
    System.out.println("Sheet starts");
  }

  @Override
  protected void endSheet() {
    //Any custom logic when sheet ends
    System.out.println("Sheet ends");
  }

  @Override
  protected void processRow() {
    if(rowNumber == 1 && !header.isEmpty()) {
      System.out.println("The header values are at line no. " + rowNumber + " " +
          "are :" + header);
    }
    else if (rowNumber > 1 && !rowValues.isEmpty()) {

      //Get specific values here
      /*String a = rowValues.get("A");
      String b = rowValues.get("B");*/

      //Print whole row
      System.out.println("The row values are at line no. " + rowNumber + " are :" + rowValues);
    }
  }
}

4. Demo

Let us understand how to read the excel file using a demo program. We are reading a file that has 2 sheets and some values in the sheets.

Let us use ExcelReaderHandler to read the excel and print the values read in the process.

import java.io.File;
import java.net.URL;

public class ReadExcelUsingSaxParserExample {
  public static void main(String[] args) throws Exception {

    URL url = ReadExcelUsingSaxParserExample.class
        .getClassLoader()
        .getResource("howtodoinjava_demo.xlsx");

    new ExcelReaderHandler().readExcelFile(new File(url.getFile()));
  }
}

Check the output that has the cell values from the excel file.

Processing start for sheet : Employee Data
Sheet starts
The header values are at line no. 1 are :{A=ID, B=NAME, C=LASTNAME}
The row values are at line no. 2 are :{A=1, B=Amit, C=Shukla}
The row values are at line no. 3 are :{A=2, B=Lokesh, C=Gupta}
The row values are at line no. 4 are :{A=3, B=John, C=Adwards}
The row values are at line no. 5 are :{A=4, B=Brian, C=Schultz}
Sheet ends

Processing start for sheet : Random Data
Sheet starts
The header values are at line no. 1 are :{A=Key, B=Value}
The row values are at line no. 2 are :{A=1, B=a}
The row values are at line no. 3 are :{A=2, B=b}
The row values are at line no. 4 are :{A=3, B=c}
Sheet ends

5. Conclusion

In this Apache POI tutorial, we learned to read an excel file using the SAX parser. We can use this solution to read huge excel files as well. I will suggest you play with the code for better understanding.

Happy Learning !!

Source Code on Github

Comments

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

About Us

HowToDoInJava provides tutorials and how-to guides on Java and related technologies.

It also shares the best practices, algorithms & solutions and frequently asked interview questions.