SAX event driven analysis of large amount of Excel files (personally test 30000 lines of Excel data and read without pressure)

SAX event driven parsing of Excel files

Recently, when doing project functions, I found that when the API provided by traditional Apache POI parses a large number of data Excel files, it is easy to throw an exception of memory overflow, which makes it difficult to read Excel data into memory. Therefore, I found a lot of information on the Internet. Finally, I found the SAX event driven method to analyze Excel files, which perfectly solved the problem of reading large amounts of data. Here, I'll simply sort it out and share it with you!

1. Two analytical methods

There are two modes for Apache POI to read Excel files: user mode and SAX event driven mode. The two methods are not good or bad, but they are suitable for different application scenarios.

1. User mode: it provides rich API interfaces, such as Workbook, Sheet, Row and Cell, which can easily parse Excel. It is also a way used more in practical development.

Workbook wb = new XSSFWorkbook(inputStream); // Analysis xlsx
Workbook wb = new HSSFWorkbook(inputStream); // Analysis xls

Sheet sheet = wb.getSheetAt(index); // Get sheet
Row row = sheet.getRow(index); // Get row
Cell cell = row.getCell(index); // Get cell

However, this mode will load all the contents of the file into memory, which consumes a lot of memory! When encountering large sheet s (such as more than 100000), it is easy to throw OOM and other exceptions, resulting in file parsing failure. Although batch or line by line processing can also be solved, it has certain limitations and is not easy to process data flexibly.

2. Sax event driven mode: sax is called Simple API for XML, which is both an interface and a software package. It is an alternative to XML parsing. Convert excel into XML, and then use SAXParser parser to parse the XML file to get Excel data (SAX is different from DOM parsing. It scans the document line by line and parses while scanning. Since the application only checks the data when reading the data, it does not need to store the data in memory, which is a great advantage for the parsing of large documents), it can effectively solve the memory overflow and is suitable for large file processing, This is also the way to solve memory overflow recommended by POI officials!

Coordinate dependent access address

<!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml -->
<dependency>
    <groupId>org.apache.poi</groupId>
    <artifactId>poi-ooxml</artifactId>
    <version>4.1.2</version>
</dependency>

2. Quick start SAX event driven parsing

In order to operate data faster, consume less memory and extract cell data more conveniently, we finally choose to save Excel data in the form of row and column coordinates (key) - value into a LinkedHashMap.

First, import the class sheethandler Java, this class can be placed on the tool class package path. This class rewrites the methods of startElement(), endElement(), and characters(), and encapsulates the data into LinkedHashMap through these methods!

SheetHandler.java

/**
 * @description: Rewrite startElement() endElement() characters() to encapsulate Excel data into LinkedHashMap < string, string >
 * @date: 2021/12/16 15:10
 */
public class SheetHandler extends DefaultHandler {
    private SharedStringsTable sst;
    private String lastContents;
    private boolean nextIsString;
    private String cellPosition;
    private LinkedHashMap<String, String> rowContents = new LinkedHashMap<String, String>();

    public LinkedHashMap<String, String> getRowContents() {
        return rowContents;
    }

    public void setRowContents(LinkedHashMap<String, String> rowContents) {
        this.rowContents = rowContents;
    }

    public SheetHandler(SharedStringsTable sst) {
        this.sst = sst;
    }

    public void startElement(String uri, String localName, String name, Attributes attributes) throws SAXException {
        if (name.equals("c")) {
            cellPosition = attributes.getValue("r");
            String cellType = attributes.getValue("t");
            if (cellType != null && cellType.equals("s")) {
                nextIsString = true;
            } else {
                nextIsString = false;
            }
        }
        // Clear cache content
        lastContents = "";
    }

    public void endElement(String uri, String localName, String name)
            throws SAXException {
        if (nextIsString) {
            int idx = Integer.parseInt(lastContents);
            lastContents = new XSSFRichTextString(sst.getEntryAt(idx)).toString();
            nextIsString = false;
        }
        if (name.equals("v")) {
            // After reading the data, save the cell coordinates and contents into the map
            if (!(cellPosition.length() == 2) || (cellPosition.length() == 2 && !"1".equals(cellPosition.substring(1)))) { //Do not save the first row of data
                rowContents.put(cellPosition, lastContents);
            }
        }
    }

    public void characters(char[] ch, int start, int length) throws SAXException {
        lastContents += new String(ch, start, length);
    }
}

Write the test code in the main method to test the reading of Excel data

/**
 * @description:
 * @author: laizhenghua
 * @date: 2022/5/17 20:03
 */
public class SAXTest {
    public static void main(String[] args) throws IOException {
        String filePath = "C:\\Users\\laizhenghua\\Desktop\\test.xlsx";
        InputStream sheet = null;
        OPCPackage pkg = null;
        SheetHandler sheetHandler = null;
        try {
            pkg = OPCPackage.open(filePath); // You can specify a File path / File class / input stream, etc
            XSSFReader reader = new XSSFReader(pkg);
            SharedStringsTable table = reader.getSharedStringsTable(); // Shared string table
            sheetHandler = new SheetHandler(table);
            XMLReader parser = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAX Parser");
            parser.setContentHandler(sheetHandler);
            sheet = reader.getSheet("rId1"); // rId1 is Sheet1, rid2 is sheet2, and so on
            InputSource sheetSource = new InputSource(sheet);
            parser.parse(sheetSource); // In the process of parsing each record in excel, the three functions of startElement() characters() endElement() will be executed in turn
            LinkedHashMap<String, String> rowContents = sheetHandler.getRowContents();
            System.out.println(rowContents);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (sheet != null) {
                sheet.close();
            }
            if (pkg != null) {
                pkg.close();
            }
        }
    }
}

The implementation results are as follows

3. Tool class encapsulation

Through the above methods, we can read the data into the memory. It is actually very simple to calculate the data or save it to the database~

In order to facilitate code reuse and customization, the code for parsing Excel data is encapsulated into a tool class excelutil Java is equivalent to introducing these two encapsulated tool classes to read Excel data in this way in the project.

SheetHandler.java

ExcelUtil.java

/**
 * For the analysis of Excel files with a large amount of data (more than 80000 pieces), the excel file is parsed into the form of row and column coordinates value and stored in the map. This method has high speed and small memory consumption, but it can only read Excel files
 * It provides the method processonesheet (string filename) for processing a single sheet and the method processallsheets (string filename) for processing multiple sheets
 * After the file name is passed in, you only need to call the file name + excel.util method to finish processing The parsed data can be obtained by the return value of getrowcontents()
 */
public class ExcelUtil {
    private static LinkedHashMap<String, String> rowContents = new LinkedHashMap<String, String>();
    private static SheetHandler sheetHandler;

    public LinkedHashMap<String, String> getRowContents() {
        return rowContents;
    }

    public static void setRowContents(LinkedHashMap<String, String> rc) {
        rowContents = rc;
    }

    public SheetHandler getSheetHandler() {
        return sheetHandler;
    }

    public static void setSheetHandler(SheetHandler sh) {
        sheetHandler = sh;
    }

    public static List<LinkedHashMap<String, String>> processSheetByRId(InputStream in, Integer count) throws Exception {
        OPCPackage pkg = null;
        InputStream sheet = null;
        List<LinkedHashMap<String, String>> results = new ArrayList<LinkedHashMap<String, String>>();
        try {
            pkg = OPCPackage.open(in);
            XSSFReader r = new XSSFReader(pkg);
            SharedStringsTable sst = r.getSharedStringsTable();

            for (int i = 0; i < count; i++) {
                sheet = r.getSheet("rId" + (i + 1));
                InputSource sheetSource = new InputSource(sheet);
                XMLReader parser = fetchSheetParser(sst);
                parser.parse(sheetSource);
                results.add(sheetHandler.getRowContents());
            }
            return results;
        } catch (Exception e) {
            e.printStackTrace();
            throw e;
        } finally {
            if (pkg != null) {
                pkg.close();
            }
            if (sheet != null) {
                sheet.close();
            }
        }
    }

    // Process a sheet
    public static void processOneSheet(String filename) throws Exception {
        InputStream sheet2 = null;
        OPCPackage pkg = null;
        try {
            pkg = OPCPackage.open(filename);
            XSSFReader r = new XSSFReader(pkg);
            SharedStringsTable sst = r.getSharedStringsTable();
            XMLReader parser = fetchSheetParser(sst);
            sheet2 = r.getSheet("rId1");
            InputSource sheetSource = new InputSource(sheet2);
            parser.parse(sheetSource);
            setRowContents(sheetHandler.getRowContents());
        } catch (Exception e) {
            e.printStackTrace();
            throw e;
        } finally {
            if (pkg != null) {
                pkg.close();
            }
            if (sheet2 != null) {
                sheet2.close();
            }
        }
    }

    // Process multiple sheet s
    public static void processAllSheets(String filename) throws Exception {
        OPCPackage pkg = null;
        InputStream sheet = null;
        try {
            pkg = OPCPackage.open(filename);
            XSSFReader r = new XSSFReader(pkg);
            SharedStringsTable sst = r.getSharedStringsTable();
            XMLReader parser = fetchSheetParser(sst);
            Iterator<InputStream> sheets = r.getSheetsData();
            while (sheets.hasNext()) {
                System.out.println("Processing new sheet:\n");
                sheet = sheets.next();
                InputSource sheetSource = new InputSource(sheet);
                parser.parse(sheetSource);
            }
        } catch (Exception e) {
            e.printStackTrace();
            throw e;
        } finally {
            if (pkg != null) {
                pkg.close();
            }
            if (sheet != null) {
                sheet.close();
            }
        }
    }

    public static XMLReader fetchSheetParser(SharedStringsTable sst) throws SAXException {
        XMLReader parser = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAXParser");
        setSheetHandler(new SheetHandler(sst));
        ContentHandler handler = (ContentHandler) sheetHandler;
        parser.setContentHandler(handler);
        return parser;
    }

    /**
     * See org.xml.sax.helpers.DefaultHandler
     */
    // test
    public void test() throws Exception {
        Long time = System.currentTimeMillis();
        ExcelUtil example = new ExcelUtil();
        example.processOneSheet("C:/Users/Desktop/2018041310024112.xlsx");
        Long endtime = System.currentTimeMillis();
        LinkedHashMap<String, String> map = example.getRowContents();
        Iterator<Entry<String, String>> it = map.entrySet().iterator();
        int count = 0;
        String prePos = "";
        while (it.hasNext()) {
            Map.Entry<String, String> entry = (Map.Entry<String, String>) it.next();
            String pos = entry.getKey();
            if (!pos.substring(1).equals(prePos)) {
                prePos = pos.substring(1);
                count++;
            }
            System.out.println(pos + ";" + entry.getValue());
        }
        System.out.println("Parse data" + count + "strip;time consuming" + (endtime - time) / 1000 + "second");
    }
}


END

Tags: Java

Posted by mars16 on Wed, 18 May 2022 03:32:18 +0300