Sorting Large CSV Files Using SQLite
Sorting large CSV files can be a pain. In this post we take a look at how to ease that pain a little with a little help from SQLite and JDBC.
Join the DZone community and get the full member experience.
Join For Free“When you’re at the end of your rope, tie a knot and hold on.”
― Theodore Roosevelt
Review
We are trying to sort a large CSV file. The file contains a couple of million rows – not large by “big-data” standards, but large enough to face problems working with it.
In a previous article, we saw how Excel failed with a memory error, unable to load the CSV file.
An attempt to load the data into an ArrayList failed with an OutOfMemoryException.
Using SQLite
The first database option we tried for loading the CSV was H2.
Another option for a “desktop” database is SQLite — a very popular embedded database in the C/C++ world that is reputed to be very fast. Let’s use that to load our data and manipulate it in our application.
SQLite supports direct loading of a CSV file using the “mode” and “import” commands. However, it did not quite work right for me. So I am going to be writing code using JDBC to import and sort the data.
SQLite’s CSV import also cannot handle quoted fields, multi-line fields, and other idiosyncrasies found in some CSV files.
SQLite With JDBC
SQLite is a database library suitable for embedding within C/C++ programs. There is no “official” JDBC driver for it. However, some nice folks have developed one for it, which we are going to use. It seems to be in active development (as of Mar 2017) as far as I can see.
Download the JDBC driver from here. Or use this POM entry.
<dependency>
<groupId>org.xerial</groupId>
<artifactId>sqlite-jdbc</artifactId>
<version>3.16.1</version>
</dependency>
Create Table From Header
Generally, the first line of a CSV file contains column headers. We are going to use this header line to create a table.
One nice thing about SQLite is that you can skip the column type when creating a table. SQLite uses dynamic typing and you can skip defining the type of a column when creating a table. The following is a valid SQLite table definition:
create table sample(
yearID,
teamID,
lgID,
playerID,
salary
)
With this advantage, the code for creating a table is simple.
StringBuffer sbuf = new StringBuffer();
sbuf.append("create table sample(\n");
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
String name = fieldNames.get(i);
sbuf.append(" ").append(name);
if ( i < n - 1 ) sbuf.append(",\n");
else sbuf.append("\n");
}
sbuf.append(")");
con.createStatement().execute(sbuf.toString());
Transactions
In SQLite, by default, each statement is executed within the scope of a transaction. That is each statement can be assumed to be enclosed within a begin
and commit
block as follows:
begin transaction
sql statement ...
commit transaction
While this increases the reliability of the code and the data, it makes things very, very slow for us. So what we are going to do is to commit every N number of transactions.
if ( COMMIT_EVERY > 0 && uncommitted >= COMMIT_EVERY ) {
con.createStatement().execute("commit transaction"); con.createStatement().execute("begin transaction"); uncommitted = 0; }
With this change, I saw a 10x improvement in performance.
Inserting the Data
We use a PreparedStatement to insert the data into the table.
StringBuffer sbuf = new StringBuffer();
StringBuffer vbuf = new StringBuffer();
sbuf.append("insert into sample(");
vbuf.append("values(");
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
String name = fieldNames.get(i);
sbuf.append(name);
if ( i < n - 1 ) {
sbuf.append(", ");
vbuf.append("?, ");
} else {
vbuf.append("?)\n");
sbuf.append(")\n");
}
}
sbuf.append(vbuf.toString());
return con.prepareStatement(sbuf.toString());
Sort and Extract to CSV
After the data is inserted, we can sort the data with the required clause(s) and extract the data to CSV using a ResultSet.
String sortClause = "order by CAST(yearID as Integer) DESC, CAST(Salary as Integer) DESC, teamID";
String selectStmt = "select * from sample\n" + sortClause;
ResultSet rset = con.createStatement().executeQuery(selectStmt);
Main loop for extracting and exporting the data:
while (rset.next()) {
rowNum++;
for (int i = 0, n = fieldNames.size() ; i < n ; i++) {
out.print(encodeValue(rset.getString(i+1)));
if ( i < n - 1 ) out.print(',');
else out.println();
}
}
Encoding a field value to CSV:
static private Pattern rxquote = Pattern.compile("\"");
static private String encodeValue(String value) {
boolean needQuotes = false;
if ( value.indexOf(',') != -1 || value.indexOf('"') != -1 ||
value.indexOf('\n') != -1 || value.indexOf('\r') != -1 )
needQuotes = true;
Matcher m = rxquote.matcher(value);
if ( m.find() ) { needQuotes = true; value = m.replaceAll("\"\""); }
if ( needQuotes ) return "\"" + value + "\"";
else return value;
}
And here is the nicely sorted output data. It is sorted by yearID, salary and teamID.
yearID,teamID,lgID,playerID,salary
2016,LAD,NL,kershcl01,33000000
2016,ARI,NL,greinza01,31799030
2016,BOS,AL,priceda01,30000000
2016,DET,AL,cabremi01,28000000
...
For comparison purposes, the same data file which took 11 mins with an H2 database took about 5 mins with SQLite. Draw from that what you will, but keep in mind that there are probably too few data points to draw a concrete conclusion. The only thing worse than generalizing from one case is generalizing from no cases at all.
Summary
And that completes our exercise in sorting a large CSV file. Using a database seems to be the easiest way out. We have a choice of using H2 database or SQLite. Not much to choose between either,
Published at DZone with permission of Jay Sridhar, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments