XML Processing Made Easy with Ballerina

Let's take a look at a modern approach in handling XML as a built-in functionality in a programming language.

Anjana Fernando

Oct. 26, 20 · Tutorial

Likes (3)

Comment

Save

6.1K Views

Introduction

The Ballerina programming language contains built-in support for XML data. It supports defining, validating, and manipulating XML directly from the language syntax itself. In this article, we will go through its features, and how to use it effectively.

Creating and Manipulating XML

The first approach when defining an XML value in Ballerina is to use direct literals.

     C 
   
          x 
         
xml movie = xml `<Movie>
                   <Name>Jurassic Park</Name>
                   <Year>1993</Year>
                   <Director>Steven Spielberg</Director>
                 </Movie>`;

The XML value above is created using the XML literal syntax. In this way, the compiler identifies this specifically as an XML value and validates the literal value given by the user. So if we have mistakes in the XML value, such as mismatching start/end tags, you will be immediately given an error at compile-time, and of course, it will be highlighted as an invalid value in the IDE.

An XML value in Ballerina is structured as a sequence of singleton XML values. These singletons are XML elements, processing instructions, comments, and text. The following example shows how to create a single XML value by combining two XML elements.

     C 
   
xxxxxxxxxx
 
xml movie1 = xml `<Movie year="1993">
                    <Name>Jurassic Park</Name>
                    <Director>Steven Spielberg</Director>
                  </Movie>`;
 
xml movie2 = xml `<Movie year="1997">
                    <Name>Titanic</Name>
                    <Director>James Cameron</Director>
                  </Movie>`;
 
xml movieList = movie1 + movie2;

The above “movieList” contains an XML sequence of two XML element values. We can access individual items in the sequence similar to arrays by using the subscript operator in the following manner.

     C 
   
xxxxxxxxxx
 
xml m1 = movieList[0];
xml m2 = movieList[1];

The built-in function “length” can be used to get the number of items in the sequence.

     C 
   
xxxxxxxxxx
 
int n = movieList.length();

Also, other functions generally available for lists such as “filter”, “foreach”, and “map” can be used for functional iteration operations.

In XML, literal values can also be interpolated with an expression to provide parts of its content. This is done with the syntax “${expr}”. In this manner, for the expression, we can provide an in-scope variable, function call, or any expression which will return a supported value in the placeholder. An example of this is shown below, where we use variables to provide an integer and a string value for the movie year and director respectively.

     C 
   
xxxxxxxxxx
 
int titanicYear = 1997;
string titanicDirector = "James Cameron";
 
xml movie2 = xml `<Movie year="${titanicYear}">
                    <Name>Titanic</Name>
                    <Director>${titanicDirector}</Director>
                  </Movie>`;

The language library also provides XML subtypes: “xml:Element”, “xml:ProcessingInstruction”, “xml:Comment”, and “xml:Text”. These types can be used when we need to use specific subtype related operations. Let’s create an XML element and set its child elements using this functionality.

     C 
   
xxxxxxxxxx
 
import ballerina/lang.'xml;
...
'xml:Element movies = <'xml:Element> xml `<Movies/>`;
movies.setChildren(movieList);

The XML values can also be namespace qualified. The following code snippet shows how an XML namespace is defined for a given namespace prefix.

     C 
   
xxxxxxxxxx
 
xmlns "http://example.com/ns1" as ns1;
xml movies = xml `<ns1:Movies>${movieList}</ns1:Movies>`;

Here, we’ve defined the namespace prefix “ns1” as associated with the namespace URI "http://example.com/ns1”. Afterward, we have created a new XML element by associating it with the namespace in the “ns1” prefix. Also, we have simultaneously set the children of the element by using interpolation.

Similarly, we can set the default namespace of the XML values in the scope following its declaration by simply not setting a namespace prefix.

     C 
   
xxxxxxxxxx
 
xmlns "http://example.com/ns1";
xml movies = xml `<Movies>${movieList}</Movies>`;

In the example above, the “movies” XML element and its children will inherit the namespace defined above it since it has been declared as the default namespace.

Accessing XML

After we have XML values in our code, let’s see how we can access and query the structure that it’s representing.

Let’s start with attribute access of an XML element value. This is done in the format “xml_value.attr_name”. The following code snippet shows how we can extract the “year” attribute from the movie element we created earlier.

     C 
   
xxxxxxxxxx
 
string|error year = movie1.year;

Here, the attribute accessing expression returns a union of “string” and “error”. This is because, in the runtime, if the attribute is not existent in the given XML element value, it will return an “error” value.

In the case of an XML attribute having a specific namespace, we can prefix the attribute name with the namespace prefix in the following manner.

     C 
   
xxxxxxxxxx
 
xmlns "http://example.com/ns1" as ns1;
 
xml movie2 = xml `<ns1:Movie ns1:year="1997">
                    <Name>Titanic</Name>
                    <Director>James Cameron</Director>
                  </ns1:Movie>`;
string|error year = movie2.ns1:year;

Now let’s see how we access elements in an XML sequence value. As we saw earlier, an XML sequence can have multiple single XML values at the same level. Let’s see how we can extract specific elements in such a sequence using filter expressions. First, let’s define a set of XML values.

     C 
   
xxxxxxxxxx
 
xmlns "http://example.com/ns1" as ns1;
 
xml movie1 = xml `<Movie year="1993">
                   <Name>Jurassic Park</Name>
                   <Director>Steven Speilberg</Director>
                   </Movie>`;
xml movie2 = xml `<ns1:Movie ns1:year="1997">
                   <Name>Titanic</Name>
                   <Director>James Cameron</Director>
                   </ns1:Movie>`;
xml book1 = xml `<Book>
                   <Name>Harry Potter</Name>
                   <Author>J.K. Rowling</Author>
                   </Book>`;
 
xml person1 = xml `<Person>
                       <Name>Jack Smith</Name>
                       <BirthYear>1990</BirthYear>
                   </Person>`;
 
xml entries = movie1 + movie2 + book1 + person1;

Here, we have created an XML value “entries”, which contains a sequence of XML elements. Now let’s select all the elements that have the element “Movie”. The syntax for this is “xml_val.<xml_name_pattern>”

     C 
   
 x 
         
xml<'xml:Element> movieElements = entries.<Movie>;

Here, we have directly given “Movie” as the element name. And in the “movieElements” XML sequence, it will contain a single element, which is “movie1”. The XML element in “movie2” is not there, due to it being namespace qualified. Notice that we can also use a more constrained “xml<’xml:Element>” type for “movieElements” because the filter expressions specifically return XML elements.

Now if we want to specifically access the movie element with a given namespace, we can use the following syntax.

     C 
   
xxxxxxxxxx
 
xml<'xml:Element> movieElements = entries.<ns1:Movie>;

In this case, we will only get the XML element in “movie2” in “movieElements”. If we want to extract multiple elements with different names, we can delineate the names using “|” and provide this in the filter expression.

     C 
   
xxxxxxxxxx
 
xml<'xml:Element> moviesAndBooks = entries.<ns1:Movie|Movie|Book>;

The statement above extracted all the XML elements in “entries” having either “Movie” or “Book” element names.

The XML name pattern can also be “*”, which is used to select all the XML elements in a sequence.

     C 
   
xxxxxxxxxx
 
xml<'xml:Element> allElements = entries.<*>;

The code above returns all the XML elements that are in the XML sequence “entries” and returns a new XML sequence that only has XML element values.

Next, let’s see how we can query child items in an XML value. This can be done with XML step expressions. This has a syntax and functionality that would be familiar if you have used XPath before.

     C 
   
xxxxxxxxxx
 
xml allChildItems = movie1/*;

Here, we read in all the child items in the XML value “movie1”. If we want to restrict this to only XML child elements, we will use the following syntax.

     C 
   
xxxxxxxxxx
 
xml allChildElements = movie1/<*>;

We can drill into any level as we want recursively with each XML value returned in each step.

     C 
   
xxxxxxxxxx
 
xml doc = xml `<Doc>${entries}</Doc>`;
xml allMoviesNames = doc/<Movie|ns1:Movie>/<Name>;

Here, we read in all the movie names, where we used step expressions to consider all the elements which have the name “Movie”.

Also, we can search through all the descendants of an XML value to access the required items. The example below shows how this is done.

     C 
   
xxxxxxxxxx
 
xml allNames = doc/**/<Name>;

With the “/**” syntax, the execution will search through all the descendants of the “doc” XML value and find all the elements with the name “Name”.

Operations that are supported in XML filter expressions and step expressions can also be implemented using the functions available in the XML language library.

XML and Language Integrated Queries

In Ballerina, we can incorporate the language integrated query features with XML processing to do advanced processing and transformation operations. Let’s take a look at a sample dataset and see how we can transform it to have a better representation. Here, we will be using a publicly available XML dataset, which contains the annual CO2 emissions of each country. Below shows a sample snippet of this data.

     XML 
   
xxxxxxxxxx
 
<Root>
   <data>
       <record>
           <field name="Country or Area" key="ABW">Aruba</field>
           <field name="Item" key="EN.ATM.CO2E.PC">CO2 emissions (metric tons per capita)</field>
           <field name="Year">1960</field>
           <field name="Value">204.620372249175</field>
       </record>
       <record>
           <field name="Country or Area" key="AFG">Afghanistan</field>
           <field name="Item" key="EN.ATM.CO2E.PC">CO2 emissions (metric tons per capita)</field>
           <field name="Year">1964</field>
           <field name="Value">0.0861736143685528</field>
       </record>
   </data>
</Root>

We want the above dataset to be transformed in a way that we have the XML element names itself represent the meaning of its text value. Also, the source dataset contains some records with value fields that are empty, which we would like to skip. The final result should be similar to the dataset below.

     XML 
   
xxxxxxxxxx
 
<records>
   <record>
       <country>Aruba</country>
       <year>1960</year>
       <value>204.620372249175</value>
   </record>
   <record>
       <country>Afghanistan</country>
       <year>1964</year>
       <value>0.0861736143685528</value>
   </record>
</records>

The transformation above can be done with a single statement in Ballerina using its integrated query functionality. Below contains the full Ballerina source code used in implementing the required transformation.

     C 
   
xxxxxxxxxx
 
import ballerina/io;
 
public function main() returns @tainted error? {
 
   io:ReadableByteChannel rbc = check io:openReadableFile("/home/laf/Downloads/API_EN.ATM.CO2E.PC_DS2_en_xml_v2_1500418.xml");
   io:ReadableCharacterChannel rch = new (rbc, "UTF8");
 
   xml payload = check rch.readXml();
 
   xml transformedData = xml `<records>
                               ${from var x in payload/<data>/<*>
                                 let var country = <xml> x/<'field>[0]/*
                                 let var year = <xml> x/<'field>[2]/*
                                 let var value = <xml> x/<'field>[3]/*
                                 where value.length() > 0
                                 select xml `<record>
                                                 <country>${country}</country>
                                                 <year>${year}</year>
                                                 <value>${value}</value>
                                             </record>`
                                }
                              </records>`;
 
   io:WritableByteChannel wbc = check io:openWritableFile("/home/laf/Downloads/transformed.xml");
   io:WritableCharacterChannel wch = new (wbc, "UTF8");   
   check wch.writeXml(transformedData);
   check wch.close();
   check rch.close();
}

As shown in the code above, we can mix and match various aspects of the language to create more powerful functionality.

Summary

In this article, we have gone through the main aspects of XML handling in Ballerina. We provided an overview of how to create XML values, manipulate them, and access XML using various technologies available in the language.

For more information on Ballerina and XML handling, refer to the following resources:

XML Ballerina (programming language) Element Processing Database Syntax (programming languages)

Opinions expressed by DZone contributors are their own.

Related

Trending