Extracting Data From Very Large XML Files With X-definition
X-definition is an open-source Java API that can be used to extract data from XML files regardless of their size. In this tutorial, see X-definition in action.
Join the DZone community and get the full member experience.
Join For FreeX-definition is an open-source Java API that can be used to extract data from XML files regardless of their size. It will not compel the Java Virtual Machine to complain that it is out of heap memory, nor does it even require that your Java code step through the parts of your XML in the order of their occurrence until the location of the data you need is reached. It requires little more than a markup model of your XML document, and about 90 to 120 seconds of processing time for each gigabyte of XML data.
In this article, we'll download a modest (2.5 GB) file from data.discogs.com and extract data from it using a minimum of code. Our X-definition instructions will amount to the following:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther; occurs *; finally outln();forget;'>
<main_release>
onTrue out(getText());
</main_release>
<artists>
<artist xd:script="options ignoreOther;occurs *;">
<id>
onTrue out("\t" + getText());
</id>
<name>
onTrue out("\t" + getText());
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>
In order to call the X-definition API, all we'll need is this code:
import org.xdef.sys.NullReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;
public class Xdefinition {
public static void main(String[] argv) throws Exception
{
Xdefinition xdefinition = new Xdefinition();
}
{
XDPool xpool = XDFactory.compileXD(null,"markup.xdef");
XDDocument xdoc = xpool.createXDDocument();
xdoc.xparse("discogs_20211201_masters.xml",new NullReportWriter(false));
}
}
There will be nothing else to write, compile, or run.
Downloading the XML Document
From the download site, I used a file obtained from the 2021
directory: the name of the archive is discogs_20211201_masters.xml.gz
.
It's the largest file (2.5 GB, as I mentioned earlier) but one for the month in question. The largest (discogs_20211201_releases.xml.gz
) exceeds 60 GB when decompressed.
Downloading X-definition
You can download X-definition here. The only file that you'll need in your classpath is xdef-41.0.2.jar
, located in the archive's xdef
directory.
The archive also contains API documentation, source code (including the code for both the API's interfaces and Syntea's implementations of same; the JavaDoc furnished is currently only for the former), and various user manuals in PDF format. The user manual to which I'll refer throughout this article is xdef-4.1.pdf
(Language Description), which can also be found on GitHub (and will be linked to throughout when opportunity knocks).
Preparing the .xdef File
We saw the .xdef
file in its entirety a moment ago: it bore an XML declaration and consisted of markup written in XML. In the manual, either it or an xd:def
element is frequently referred to as an "X-definition".
What that more or less indicates is that it is a [perhaps abbreviated] model of an XML document. "Model" is a term the manual uses at least once. The general idea is to offer a manner of representing an existing XML document that parallels the human-readable document to hand a little more obviously than an XML schema definition might.
Indeed, as we begin writing our .xdef
file, we can proceed quite as we would if were we making an empty copy of the document we downloaded. It's not readily done when the document is far too large to open in a text editor, but it's not impossible. On a Linux system, we can open the XML file with the less
command and discover the document element, which is called masters
. Once we have done this, we have enough information to write our .xdef
file's enclosing element:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
</xd:def>
We're not required to do more than identify the document root by means of the root
attribute (and also declare the X-definition namespace in order to use the prefix xd
throughout). There is an optional name
attribute that we can use to identify the xdef
element itself, but we would only need it if we were incorporating more than one in a project. All we'll need when we call the API later will be the path to our .xdef
file.
Appendix A of the manual contains a description in the standard (Backus-Naur) notation of the xd:def
element and the other elements and attributes that are permitted in an .xdef
file.
When we glance at our XML document, it becomes apparent that the masters
element encloses master
elements. The following is to some extent representative of a master
element:
<master id="2407459">
<main_release>6201234</main_release>
<images>
<image type="primary" uri="" uri150="" width="600" height="600"/>
<image type="secondary" uri="" uri150="" width="600" height="597"/>
</images>
<artists>
<artist>
<id>4054418</id>
<name>Lee Caron</name>
<anv></anv>
<join></join>
<role></role>
<tracks></tracks>
</artist>
</artists>
<genres>
<genre>Rock</genre>
<genre>Pop</genre>
</genres>
<styles>
<style>Rhythm & Blues</style>
<style>Rock & Roll</style>
</styles>
<year>1955</year>
<title>Back To An Empty Room</title>
<data_quality>Correct</data_quality>
<videos>
<video src="https://www.youtube.com/watch?v=2h-xb5bUhTE" duration="185" emb
ed="true">
<title>Back To An Empty Room by Lee Caron</title>
<description>Please leave comment.</description>
</video>
</videos>
</master>
It isn't completely so: for instance, if we look further at the document, we'll learn that an artists
element sometimes encloses more than one artist
element. Also, not each and every master
element contains a videos
element. As long as we appreciate that there can be consecutive artist
elements, however, we'll find that what we know thus far about our XML document is sufficient.
We'll focus on our master
, main_release
, and artists
elements. Our objective will amount to nothing more than a simple text table. Each row will contain the master
element's id
attribute, the main_release
element's content, and the content of each id
and name
element that is a child of an artist
element.
It's not a challenge as long as our .xdef
file can identify the elements or attributes in which we're interested. The first of these is the element our xd:def
element's root
attribute designated earlier on:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
</masters>
</xd:def>
We've incorporated masters
in our .xdef
file in such a way that it will enclose the other elements, as it does in our XML document. However, what we've typed is less simple than it appears, as being equivalent to:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters xd:script="required">
</masters>
</xd:def>
We'll use the auxiliary xd:script
attribute elsewhere, but where it isn't used or doesn't contain what we will soon find is a quantifier (or, as above, a keyword that is equivalent to a quantifier, in this case to occurs 1
), X-definition will default to required
, which translates to "occurs once and only once".
Because the master
element not only contains an attribute (id
) that we will need, but also encloses other elements the respective content of which we will need, it will demand more of our attention than the other elements shall. Uniquely for our .xdef
file, it will have one attribute corresponding to an attribute in the XML document, as well as one auxiliary xd:script
attribute:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='' xd:script=''>
</master>
</masters>
</xd:def>
The values will be single-quoted in order to accommodate double quotes and will consist of what X-definition's developers call X-script.
The first of the two attributes to begin thinking about is xd:script
. Please bear in mind: once both have been filled in, much of what we really need to do where the .xdef
file is concerned will be finished.
I noted earlier that the master
element encloses other elements (main_release
, artists
) that we will need. Not all shall be needed, however, and if we include an options
section in our xd:script
attribute's value, we can tell X-definition to take account of only those we mention in our .xdef
file:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='' xd:script='options ignoreOther;'>
</master>
</masters>
</xd:def>
We'll need to do that any time that the element we are on contains child elements we wish to ignore. Otherwise, X-definition will generate error output (in our case invisibly, as we'll determine when calling the API in our Java code) whenever elements we ignore in our .xdef
file are detected in our XML document. The options section is discussed in section 4.1.22 of the manual.
What we need to consider next is the number of times the master
element potentially occurs. Where masters
were concerned, we declined to do so because X-definition's default (understood as the required
keyword, equivalent to occurs 1
) already was acceptable. If we do the same here, just one master
element will be processed (and an error message will be generated for each of the rest). We can indicate that master
occurs zero or more times:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='' xd:script='options ignoreOther;occurs *;'>
</master>
</masters>
</xd:def>
A quantifier like the above (occurs *
) will be needed every time that we know an element (e.g., artist
) will occur more than once. Section 4.1.10 of the manual discusses quantifiers.
The last thing that we need to consider (for the moment) about our master
element's xd:script
attribute is to do with the very large number of master
elements — more than 1.5 million — at our disposal. In order for each master
element to be removed from memory once X-definition is ready to move along to the following master
element, the forget
keyword needs to be deployed at the very tail of the xd:script
attribute's value:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='' xd:script='options ignoreOther;occurs *;forget;'>
</master>
</masters>
</xd:def>
Were our XML document to have had any other elements on the same level in the tree as master
, and were they to have been referred to as well in our .xdef
file, the forget
keyword would have needed to be used in their context as well as in that of master
. Otherwise, the heap memory available to the JVM could ultimately be exhausted. The forget
keyword is discussed in section 4.1.9 of the manual.
Our X-script thus far has been assigned to a part of a document (an element) by means of an auxiliary attribute (xd:script
). We can do similarly for an attribute or a text node known to exist in our XML document by simply entering the X-script in our .xdef
file's corresponding location: for our master
element's id
attribute, that location would be within the single quotation marks. The id
attribute's value as found in our XML document is the source of the initial cell of each row of our text table:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;forget;'>
</master>
</masters>
</xd:def>
The onTrue
keyword refers to an event. When X-definition alights on a master
element, its id
attribute becomes available. In the present case, the onTrue
event occurs if the id
attribute bears a string of any length as its value. Were it indicated elsewhere in the X-script that the value of id
need only ever be an integer, and were X-definition unable to interpret that value as such, the onFalse
event would occur instead. Events in X-script have corresponding actions. Here, the X-script defines the action as writing the id
attribute's value, followed by a tab character, to standard output. Events are discussed in sections 2.10 and 4.1.9 of the manual; the getText()
and out()
functions are in section 4.1.19. Section 7.1 (6.1 in the version on GitHub) discusses the order in which X-definition processes the parts of an XML document.
Prior to going away from the master
element, X-definition will extract the rest of the data for our table row from its child elements. We can terminate our table row once that it has finished with the main_release
and artists
elements — which is to say, once that the finally
event has occurred in the master
element's context — by revisiting our xd:script
auxiliary attribute and adding code to it:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
</master>
</masters>
</xd:def>
A carriage return will terminate each row.
Partly because our text table presents its data in the order in which it occurs in the XML document — we therefore don't need to define variables for storing any — remarkably little code is left to write. We won't need to refer to any more attributes in the XML document, so adding our remaining elements to what we have got so far is really easy:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
<main_release>
</main_release>
<artists>
<artist>
<id>
</id>
<name>
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>
The text content of the main_release
, id
, and name
elements will furnish the rest of the data for our table row.
The X-script we'll add to main_release
is identical to that used earlier for the master
element's id
attribute, except for the tab character (which at this point would prove redundant):
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
<main_release>
onTrue out(getText());
</main_release>
<artists>
<artist>
<id>
</id>
<name>
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>
Where we added it before where the master
element's id
attribute's value would have appeared in the XML document, we've added it above where the main_release
element's text content would go. The getText()
function will obtain the main_release
element's text content, and the out()
function will send it to standard output.
The way we deal with the artist
tag parallels what we did earlier in the master
element's xd:script
tag:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
<main_release>
onTrue out(getText());
</main_release>
<artists>
<artist xd:script="options ignoreOther;occurs *;">
<id>
</id>
<name>
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>
As before, leaving out the ignoreOther
option will assure copious error output, because we don't plan on taking account of every last one of the artist
element's children, and omitting the quantifier (occurs *
) will cause still more error output and also result in only the first artist
element being visited.
Finishing up is as easy as adding X-script to the name
and id
tags where the text content would occur in the original XML document:
<?xml version="1.0" encoding="UTF-8"?>
<xd:def xmlns:xd="http://www.xdef.org/xdef/4.1" root="masters">
<masters>
<master id='onTrue out(getText() + "\t");' xd:script='options ignoreOther;occurs *;finally outln();forget;'>
<main_release>
onTrue out(getText());
</main_release>
<artists>
<artist xd:script="options ignoreOther;occurs *;">
<id>
onTrue out("\t" + getText());
</id>
<name>
onTrue out("\t" + getText());
</name>
</artist>
</artists>
</master>
</masters>
</xd:def>
When X-definition detects text in either element, the element's content (preceded by a tab character) is written to standard output. As noted earlier, the finally
clause in the master
element's xd:script
tag will terminate the table row at last.
With our .xdef
file complete (just as we previewed it at the start of this article), we're ready to run X-definition!
Running X-definition
Instructions for calling the X-definition API are contained in sections 9 and 9.1 of the manual (8.1 in the version on GitHub). Much of the work that will go into our Java source file is to do with importing the handful of interfaces and classes that our code will require:
import org.xdef.sys.NullReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;
The main
method really needn't do more than instantiate our driver class, which I've named Xdefinition
:
import org.xdef.sys.NullReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;
public class Xdefinition {
public static void main(String[] argv) throws Exception
{
Xdefinition xdefinition = new Xdefinition();
}
As was previewed in the beginning, the initializer block shall consist of all of three lines of code.
Our first line needs to compile our .xdef
file, which I named "markup.xdef
", into an XDPool
:
XDPool xpool = XDFactory.compileXD(null,"markup.xdef");
XDPool
is defined in the API as an interface: here, you obtain an instance by passing the name of the .xdef
file to the XDFactory
class's compileXD()
method. Were the first argument to compileXD()
not null
, it would have been the name of a properties file (details are in the API documentation for org.xdef.XDFactory
).
The next line of the initializer block is simple:
XDDocument xdoc = xpool.createXDDocument();
We just call our XDPool
instance's createXDDocument()
method in order to obtain an XDDocument
instance. XDDocument
is defined in the API as an interface. The API JavaDoc implies that instances thereof do the actual processing of our XML document.
Indeed, the final line is what triggers the processing:
xdoc.xparse("discogs_20211201_masters.xml",new NullReportWriter(false));
The XML document (discogs_20211201_masters.xml
) has been presumed to be located in the same directory as the .xdef
file passed earlier to compileXD()
. Our XDDocument
instance's xparse()
method effectively starts X-definition. X-definition more or less requires that any error output go somewhere: accordingly, an instance of a dedicated class defined in the API, NullReportWriter
, is passed as xparse()
's second argument, after the location of the source XML. The API documentation for org.xdef.sys.NullReportWriter
suggests the potential consequences of passing true
rather than false
as a parameter when making a new instance.
The code we will compile, including the initializer block, looks like this:
import org.xdef.sys.NullReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;
public class Xdefinition {
public static void main(String[] argv) throws Exception
{
Xdefinition xdefinition = new Xdefinition();
}
{
XDPool xpool = XDFactory.compileXD(null,"markup.xdef");
XDDocument xdoc = xpool.createXDDocument();
xdoc.xparse("discogs_20211201_masters.xml",new NullReportWriter(false));
}
}
In order for it to compile by means of the javac
command, we need to have the xdef-41.0.2.jar
file on our classpath. We wired the locations of our .xdef
file and our large XML document into our code, so all we need in order to finally run X-definition is a command like:
java -cp .:xdef-41.0.2.jar Xdefinition
When X-definition runs, the text table shall spill out onto the terminal unless your command's output was redirected. What you won't see in any case are any error messages, because the NullReportWriter
class has the distinction of making the error output disappear. You don't even learn whether or not there was any.
The alternative to passing a NullReportWriter
instance to xparse()
is to pass an instance of org.xdef.sys.FileReportWriter
. Had we chosen to do, our code could then have been as follows:
import org.xdef.sys.FileReportWriter;
import org.xdef.XDDocument;
import org.xdef.XDFactory;
import org.xdef.XDPool;
public class Xdefinition {
public static void main(String[] argv) throws Exception
{
Xdefinition xdefinition = new Xdefinition();
}
{
XDPool xpool = XDFactory.compileXD(null,"markup.xdef");
XDDocument xdoc = xpool.createXDDocument();
xdoc.xparse("discogs_20211201_masters.xml",new FileReportWriter("/dev/stderr"));
}
}
The outcome will be identical if the above code is compiled and run simply because I made sure earlier that our .xdef
file wouldn't cause X-definition to detect any errors. If you use the FileReportWriter
class and really have got errors, you'll likely wind up with a terminal (or file: above, I used the location of my system's standard error stream as the FileReportWriter
constructor's argument rather than store potential error messages to disc) great and big with verbose output; notwithstanding which, your code will still run and you might not necessarily have to take account of errors provided the output you find is all that you anticipated. Strictly avoiding errors as I've done here is, I'm afraid, less a question of convenience than it is of using X-definition consistently with its overall design.
You see, X-definition is designed principally for checking whether a given XML document is valid while you process it. We didn't purposefully attempt it here, but you can, if you wish, set the criteria much as you could were you using a schema-aware XSLT processor instead. Quite as iText's developer built on his knowledge of the PDF specification, the developers of X-definition familiarized themselves with XSD and devised an alternative means of realizing its possibilities.
Opinions expressed by DZone contributors are their own.
Comments