Introduction to the Apache POI Library
A good thing to know for any Java dev working with Excel spreadsheets, let's see how the Apache POI library can help read and write Excel files with little overhead.
Join the DZone community and get the full member experience.
Join For FreeWe use Office applications frequently to serve many purposes. Have you ever thought of doing the task of reading or writing an Office file (like Excel spreadsheet) programmatically? Sometimes, there will be situations where we will have to handle it by writing some custom program. The Apache POI library is a very helpful tool to accomplish such a task. The POI library can be used for reading or writing all kinds of Office documents. This article, however, will focus on working with spreadsheet files — more specifically, Excel files.
Excel File Formats
Microsoft Excel has two major file formats:
- Excel 97-2003 workbook: This is a legacy Excel file that follows a binary file format. The file extension of the format is .xls.
- Excel 2007+ workbook: This is the default XML-based file format for Excel 2007 and later versions. It follows the Office Open XML (OOXML) format, which is a zipped, XML-based file format developed by Microsoft for representing office documents. The file extension of the format is .xlsx.
Apache POI has options for handling both file formats. For the Excel 97-2003 file format, POI has implemented HSSF. For Excel's 2007 OOXML file format, POI has implemented XSSF. Since the Excel 97-2003 (.xls) format has become almost obsolete now, the rest of our discussions and examples will follow the Excel 2007+(.xlsx) file format and its corresponding POI implementation, XSSF. In addition, POI has developed a streaming extension of XSSF to work with very large spreadsheets. This streaming API is SXSSF, which can write very large files using a low memory footprint.
Project Setup
For Maven projects, we should include the following dependencies:
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi</artifactId>
<version>3.17</version>
</dependency>
<dependency>
<groupId>org.apache.poi</groupId>
<artifactId>poi-ooxml</artifactId>
<version>3.17</version>
</dependency>
For non-Maven projects, we have to download the binary distribution of the POI library from the Apache POI’s download page. After downloading, we have to extract the compressed archive file and include the necessary JAR files in the project’s Java classpath. Here is a sample list of JAR files:
- poi-3.17.jar
- poi-ooxml-3.17.jar
- poi-ooxml-schemas-3.17.jar
- curvesapi-1.04.jar
- xmlbeans-2.6.0.jar
- commons-codec-1.10.jar
- commons-collections4-4.1.jar
Basic Interfaces
To start working with the POI API, we should learn about the following interfaces from package org.apache.poi.ss.usermodel:
- Workbook: This interface is the high-level representation of an Excel workbook. XSSFWorkbook and SXSSFWorkbook are the concrete implementations we will work with.
- Sheet: This interface is the high-level representation of an Excel worksheet. XSSFSheet and SXSSFSheet are the concrete implementations we will work with.
- Row: This interface is the high-level representation of a row of an Excel worksheet. XSSFRow and SXSSFRow are the concrete implementations we will work with.
- Cell: This interface is the high-level representation of a cell in a row of an Excel worksheet. XSSFCell and SXSSFCell are the concrete implementations we will work with.
- CellStyle: This interface is the high-level representation of the styling or formatting information for the cells on an Excel worksheet. XSSFCellStyle is the concrete implementation we will work with.
Various APIs
We can use the POI library for reading and writing Excel files in several ways:
- User API (XSSF)
- Streaming User API (SXSSF)
- Event API (XSSF with SAX)
Let’s learn about the above APIs with some small examples.
User API (XSSF)
The User API or usermodel API is available in packages org.apache.poi.ss.usermodel and org.apache.poi.xssf.usermodel. This is a very simple API. One drawback is that it has a higher memory footprint. Here is an example of writing an Excel file using the User API:
public class XSSFExample1 {
public static void main(String[] args) throws Exception {
String[] books = {
"The Tempest",
"Gitanjali",
"Harry Potter"
};
String[] authors = {
"William Shakespeare",
"Rabindranath Tagore",
"J. K. Rowling"
};
Workbook workbook = new XSSFWorkbook();
Sheet sheet = workbook.createSheet();
sheet.setColumnWidth((short) 0, (short)((50 * 8) / ((double) 1 / 20)));
sheet.setColumnWidth((short) 1, (short)((50 * 8) / ((double) 1 / 20)));
workbook.setSheetName(0, "XSSFWorkbook example");
Font font1 = workbook.createFont();
font1.setFontHeightInPoints((short) 10);
font1.setColor((short) 0xc); // make it blue
font1.setBold(true);
XSSFCellStyle cellStyle1 = (XSSFCellStyle) workbook.createCellStyle();
cellStyle1.setFont(font1);
Font font2 = workbook.createFont();
font2.setFontHeightInPoints((short) 10);
font2.setColor((short) Font.COLOR_NORMAL);
XSSFCellStyle cellStyle2 = (XSSFCellStyle) workbook.createCellStyle();
cellStyle2.setFont(font2);
Row headerRow = sheet.createRow(0);
Cell cell1 = headerRow.createCell(0);
cell1.setCellValue("Book");
cell1.setCellStyle(cellStyle1);
Cell cell2 = headerRow.createCell(1);
cell2.setCellValue("Author");
cell2.setCellStyle(cellStyle1);
int rownum;
Row row = null;
Cell cell = null;
for (rownum = (short) 1; rownum <= books.length; rownum++) {
row = sheet.createRow(rownum);
cell = row.createCell(0);
cell.setCellValue(books[rownum - 1]);
cell.setCellStyle(cellStyle2);
cell = row.createCell(1);
cell.setCellValue(authors[rownum - 1]);
cell.setCellStyle(cellStyle2);
}
final String FILE_NAME = "./xssf_example.xlsx";
FileOutputStream outputStream = new FileOutputStream(FILE_NAME);
workbook.write(outputStream);
outputStream.close();
workbook.close();
}
}
In the above example, we have created an instance of XSSFWorkbook first. A sheet is then created from the workbook instance. We have created two instances of XSSFCellStyle from the workbook instance. We have created a couple of rows from the sheet instance. Some cell instances have been created from each row instance. Finally, we have written the workbook in a file using the file output stream.
Reading from an Excel file using the User API is also very simple. Here is a small example:
public class XSSFExample2 {
public static void main(String[] args) throws Exception {
final String FILE_NAME = "./xssf_example.xlsx";
FileInputStream excelInputStream = new FileInputStream(new File(FILE_NAME));
Workbook workbook = new XSSFWorkbook(excelInputStream);
Sheet sheet = workbook.getSheetAt(0);
Iterator < Row > rowItr = sheet.iterator();
int rowNum = 0;
while (rowItr.hasNext()) {
Row row = rowItr.next();
Iterator < Cell > cellItr = row.iterator();
System.out.print(rowNum + ". ");
while (cellItr.hasNext()) {
Cell cell = cellItr.next();
if (cell.getCellTypeEnum() == CellType.STRING) {
System.out.print(cell.getStringCellValue() + "\t\t");
} else if (cell.getCellTypeEnum() == CellType.NUMERIC) {
System.out.print(cell.getNumericCellValue() + "\t\t");
}
}
System.out.println();
rowNum++;
}
workbook.close();
excelInputStream.close();
}
}
We have tried to read the Excel file that was written in the first example. At first, we created a file input stream using the Excel file. TheWorkbook instance is then created from the input stream. We get the sheet instance from the workbook instance. A row iterator can be found via the sheet instance. Similarly, the cell iterator can be found via each row instance. Based on the type of the cell, the value of the cell is finally printed in the console.
Streaming User API (SXSSF)
The SXSSF API is available in the package org.apache.poi.xssf.streaming. This one is suitable for writing very large spreadsheets using low heap memory. SXSSF keeps a portion of the workbook in memory, instead of keeping the whole workbook. This portion is specified by a window size that indicates the number of rows accessible from memory. The other rows are gradually flushed into temporary files. In this way, SXSSF achieves its low memory footprint.
We can specify the window size in the constructor of SXSSFWorkbook. The default window size is 100. When a new row is created using the sheet’s createRow() method, the unflashed row with the lowest index value will be flushed if the total number of unflushed rows exceeds the window size. The flushed row is no longer accessible by using the getRow() method. Let’s see a small example of the use of SXSSF:
public class SXSSFExample {
public static void main(String[] args) throws Exception {
SXSSFWorkbook workbook = new SXSSFWorkbook(100);
Sheet sheet = workbook.createSheet();
for (int rownum = 0; rownum < 500; rownum++) {
Row row = sheet.createRow(rownum);
for (int cellnum = 0; cellnum < 10; cellnum++) {
Cell cell = row.createCell(cellnum);
cell.setCellValue((rownum + 1) + ", " + (cellnum + 1));
}
}
System.out.println(sheet.getRow(0));
System.out.println(sheet.getRow(200));
System.out.println(sheet.getRow(400));
final String FILE_NAME = "./sxssf_example.xlsx";
FileOutputStream outputStream = new FileOutputStream(FILE_NAME);
workbook.write(outputStream);
outputStream.close();
workbook.dispose();
workbook.close();
}
}
In the above example, a workbook of 500 rows has been created. Here, the window size is set to 100. We are creating the rows sequentially starting from index 0. When the 101st row is created, the 1st row (with index 0) is removed from memory and flushed to a temporary file. Thus the 1st row will not be available to access any longer. After the creation of 500 rows is finished, only last 100 rows will be available to getRow(). So, a call of getRow(0) and getRow(200) will return null. Meanwhile, a call of getRow(400) will return a valid reference.
SXSSF uses temporary files to flush the sheet data. These files are not cleaned up automatically. The workbook’s dispose() method should be used to do the cleanup. The size of these temporary files can become very large if the spreadsheet is very big. The workbook’s setCompressTempFiles() method should be used in such cases. Thus temporary files will be created using GZIP compression and disk storage will be saved.
Event API (XSSF with SAX)
If we want to read large XLSX file using a small memory footprint, the Event API is a suitable solution. This one is available in the package org.apache.poi.xssf.eventusermodel. The core class is XSSFReader. We have to use it along with the SAX XML parser. The details of the SAX XML parser is beyond the scope of this article, but you can get a good overview here.
The Event API is easy to use, but we have to understand the underlying XML file structure of XLSX files. To understand the XML file structure, let's copy the file that was created in our first example (xssf_example.xlsx) to any other suitable directory, then rename the file by changing the extension from .xlsx to .zip. If we extract that ZIP file using any archive utility, we will get a bunch of XML files with the following directory structure:
Shared strings, styles, workbook, sheets — all these are represented in separate XML files. For convenience, here is the formatted version of sharedStrings.xml:
<?xml version="1.0" encoding="UTF-8"?>
<sst xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" count="8" uniqueCount="8">
<si>
<t>Book</t>
</si>
<si>
<t>Author</t>
</si>
<si>
<t>The Tempest</t>
</si>
<si>
<t>William Shakespeare</t>
</si>
<si>
<t>Gitanjali</t>
</si>
<si>
<t>Rabindranath Tagore</t>
</si>
<si>
<t>Harry Potter</t>
</si>
<si>
<t>J. K. Rowling</t>
</si>
</sst>
Here is the formatted version of sheet1.xml:
<?xml version="1.0" encoding="UTF-8"?>
<worksheet xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main">
<dimension ref="A1:C4" />
<sheetViews>
<sheetView workbookViewId="0" tabSelected="true" />
</sheetViews>
<sheetFormatPr defaultRowHeight="15.0" />
<cols>
<col min="1" max="1" width="31.25" customWidth="true" />
<col min="2" max="2" width="31.25" customWidth="true" />
</cols>
<sheetData>
<row r="1">
<c r="A1" t="s" s="1">
<v>0</v>
</c>
<c r="B1" t="s" s="1">
<v>1</v>
</c>
</row>
<row r="2">
<c r="A2" t="s" s="2">
<v>2</v>
</c>
<c r="B2" t="s" s="2">
<v>3</v>
</c>
</row>
<row r="3">
<c r="A3" t="s" s="2">
<v>4</v>
</c>
<c r="B3" t="s" s="2">
<v>5</v>
</c>
</row>
<row r="4">
<c r="A4" t="s" s="2">
<v>6</v>
</c>
<c r="B4" t="s" s="2">
<v>7</v>
</c>
</row>
</sheetData>
<pageMargins bottom="0.75" footer="0.3" header="0.3" left="0.7" right="0.7" top="0.75" />
</worksheet>
We now have an overview of the underlying XML file structure. We can start working with the Event API. The XLSX file that we want to read should be opened as an OPCPackage object at first. We have to create an instance of XSSFReader with the OPCPackage object. XSSFReader provides us methods to get the shared strings table, the styles table, and the sheets as raw XML data. We have to parse the sheet’s raw XML data using the SAX parser. Here is a sample example:
public class XSSFReaderExample {
public static void main(String[] args) throws Exception {
final String FILE_NAME = "./xssf_example.xlsx";
XSSFReaderExample example = new XSSFReaderExample();
example.readExcelFile(FILE_NAME);
}
public void readExcelFile(String filename) throws Exception {
OPCPackage opcPackage = OPCPackage.open(filename);
XSSFReader xssfReader = new XSSFReader(opcPackage);
SharedStringsTable sharedStringsTable = xssfReader.getSharedStringsTable();
XMLReader parser = getSheetParser(sharedStringsTable);
Iterator < InputStream > sheets = xssfReader.getSheetsData();
while (sheets.hasNext()) {
System.out.println("Processing sheet:");
InputStream sheet = sheets.next();
InputSource sheetSource = new InputSource(sheet);
parser.parse(sheetSource);
sheet.close();
System.out.println();
}
}
public XMLReader getSheetParser(SharedStringsTable sharedStringsTable) throws SAXException {
XMLReader parser = XMLReaderFactory.createXMLReader();
ContentHandler handler = new SheetHandler(sharedStringsTable);
parser.setContentHandler(handler);
return parser;
}
/** sheet handler class for SAX2 events */
private static class SheetHandler extends DefaultHandler {
private SharedStringsTable sharedStringsTable;
private String contents;
private boolean isCellValue;
private boolean fromSST;
private SheetHandler(SharedStringsTable sharedStringsTable) {
this.sharedStringsTable = sharedStringsTable;
}
@Override
public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
// Clear contents cache
contents = "";
// element row represents Row
if (name.equals("row")) {
String rowNumStr = attributes.getValue("r");
System.out.println("Row# " + rowNumStr);
}
// element c represents Cell
else if (name.equals("c")) {
// attribute r represents the cell reference
System.out.print(attributes.getValue("r") + " - ");
// attribute t represents the cell type
String cellType = attributes.getValue("t");
if (cellType != null && cellType.equals("s")) {
// cell type s means value will be extracted from SharedStringsTable
fromSST = true;
}
// element v represents value of Cell
} else if (name.equals("v")) {
isCellValue = true;
}
}
@Override
public void characters(char[] ch, int start, int length) throws SAXException {
if (isCellValue) {
contents += new String(ch, start, length);
}
}
@Override
public void endElement(String uri, String localName, String name) throws SAXException {
if (isCellValue && fromSST) {
int index = Integer.parseInt(contents);
contents = new XSSFRichTextString(sharedStringsTable.getEntryAt(index)).toString();
System.out.println(contents);
isCellValue = false;
fromSST = false;
}
}
}
}
In the above example, we have read one XLSX file using XSSFReader. As our SAX XML parser, an instance of org.xml.sax.XMLReader (which has been created from org.xml.sax.helpers.XMLReaderFactory) is used.
XMLReader needs to use a ContentHandler implementation. We have extended the org.xml.sax.helpers.DefaultHandler class and used the instance of it as the ContentHandler. We have overridden the methods startElement(), characters(), and endElement(), which make the basic building blocks of the SAX XML parser.
All the example codes of this article are available in this repository.
Conclusion
In this article, we learned the basic usage of the Apache POI library. We have explored various approaches of reading and writing Excel files. We have also learned how to read and write Excel files using a low memory footprint.
Opinions expressed by DZone contributors are their own.
Comments