XML


There is an good tutorial on XML at the Sun site:
XML Tutorial

There is a web site devoted to XML which also contains useful tutorials:
www.xml.org Tutorials

IBM also has a good set of tutorials about XML:
IBM Tutorials on XML.

XML is a set of standards written by the W3C consortium:
W3C and XML


In traditional IT applications, the format of a file is determined by the application that uses it.
The information about how to interpret the different fields in a file is coded in the programs that use it.

This makes it difficult to exchange files among different applications. If application X is going to use a file generated by  application Y, the developers need access to the description of the file which is not available in the file itself and usually is not available to the general public.

An additional difficulty may arise because of the differences in the internal representation of certain data types. For instance, there are two main architectures for representing integers: Big endian and little endian.  If one is going to read a file generated in a machine of one architecture in a machine with the other architecture, compatibility problems arise.

These problems have been addressed before using a very general language called SGML which was too complex to be widely used. XML is a descendant of SGML which is easier to use and has become quite popular.

Key aspects of XML


To solve the problems described previously, XML offers two key characteristics:
  1. Every document that is going to be encoded using XML must have a definition.
  2. Document in XML will contain plain text only
This makes document very portable among platforms, because the definitions are not tied to a particular program that will create or read the document, instead there is a definition of the document that is available to anybody interested in reading or creating this type of documents. Text is text, no matter which platform is manipulating it, therefore an XML document will be readable by any kind of computer, no matter how numbers are represented internally.

Defining a document: DTD and Schemas


There are several alternatives to describe a type of document. The one that seems to be most widely used now is called DTD - Document Type Definition.

There are other alternatives which will not be covered here: Schemas and DSD. To learn more about those alternatives, check this link: DTD, Schemas and DSD.

The problem of describing precisely the valid content of a file has been solved previously in  Computer Science in the realm of compilers.  Modern programming languages are defined using a formal grammar which specifies what constitutes a valid program in this particular programming language. This topic is cover in depth in compiler courses.

A grammar, as used in compilers, contains four elements:
The approach used in DTD is very similar to the one used in grammars for programming languages.

A compiler reads a source code and applies the productions to the text that it reads to check if the source complies with the grammar of the programming language. It creates in memory something called a derivation tree. The derivation tree is used to generate the executable code. If the source code does not follow the rules from the grammar, the compiler generates error messages.

An XML system will behave in a way very similar to a compiler.

Let's start with the grammar used to define a valid document.

Take as an example, the code included in the CD in van der Linde's book:

<?xml version="1.0"?>
<!DOCTYPE inventory  [
<!ELEMENT inventory (cd)* >
    <!ELEMENT cd (title, artist, price, qty)>
        <!ELEMENT title (#PCDATA)>
        <!ELEMENT artist (#PCDATA)>
        <!ELEMENT price (#PCDATA)>
        <!ELEMENT qty (#PCDATA)>
]>

<inventory>
  <cd> <title>Some Girls</title>   <artist>Rolling Stones</artist>
       <price>25</price>  <qty>5</qty>  </cd>
</inventory>


In this example, the type of document being processed is inventory.

All XML files will start with: <?xml version="1.0"?> to indicate that this is an XML document and which version of XML is being used.

Comments may be included using the following format:
<!-- Comment  -->

The definition of the document may be included from a remote source, using a statement like:
<!DOCTYPE PLAY SYSTEM "play.dtd">
This is a way to avoid having to include the DTD along with every document of this particular kind.

In the example we are examining, the DTD is included along with the document.

The next element found in the example is the DTD for inventory. The part of interest is:
<!DOCTYPE inventory  [
<!ELEMENT inventory (cd)* >
    <!ELEMENT cd (title, artist, price, qty)>
        <!ELEMENT title (#PCDATA)>
        <!ELEMENT artist (#PCDATA)>
        <!ELEMENT price (#PCDATA)>
        <!ELEMENT qty (#PCDATA)>
]>

The starting symbol of this grammar is inventory and is specified by the statement:
 <!DOCTYPE inventory

In a DTD there should be only one DOCTYPE entry.

This is a grammar that states that the documents of type inventory will contain a tag of the type inventory followed by 0 (zero) or more instances of CD elements . This is stated by the line:
 <!ELEMENT inventory (cd)* >
The asterisk is a short way of stating 0 (zero) or more instances (this is covered in discrete math or compiler courses).
The presence of the keyword ELEMENT in this line indicates that one tag for inventory should appear in the document.
Every time a tag is opened, it should be closed  later.

The following file, with 0 CDs,  would be a valid instance of a document of the type inventory:
...
<inventory>
</inventory>

The two lines are the two tags that identify the beginning and the end of the inventory. They must appear because inventory is an element and hence the two tags must appear to mark the start and the end of this element.

The definition of the element CD is contained in the following line:
<!ELEMENT cd (title, artist, price, qty)>
This states that a cd is an element, and therefore the corresponding tag should be included in the documents, and that every CD will contain a title, an artist, a price and a quantity. In terms of traditional compiler grammars, this is also a non-terminal because it is composed of other elements.

It is time now to define what a title looks like. The line
<!ELEMENT title (#PCDATA)>
states that title is an element, and again there should be a tag associated with title, but the only thing that will appear between the tags will be "parsed character data", in other words, plain text.

The declarations for artist, price and qty are similar. Notice that qty contains a number (probably an integer) but all that is stated here is that it will be treated as text.

The following example contains two CDs instead of just one:
<?xml version="1.0" ?>
 ...
<inventory>
 <cd>
      <title>Some Girls</title>
      <artist>Rolling Stones</artist>
      <price>25</price>
      <qty>5</qty>
  </cd>
  <cd>
      <title>Aniversario</title>
      <artist>Pablo Milanes</artist>
      <price>20</price>
      <qty>5</qty>
  </cd>
 </inventory>

There are many other options when defining DTDs, in particular it is possible to state that a particular element will contain certain attributes. There are rules  to specify alternatives and other regular expressions to specify the number of times that a particular element should appear.

How are XML documents processed?


In compilers, once a grammar has been defined one creates a compiler for that grammar that will check if a source code is correct and will translate into binary code.

Something similar will happen with XML documents. It is possible to:
  1. Validate if the document being read is valid, it is if it conforms to the grammar for the kind of document being processed.
  2. Manipulate the content.

Validating

Validating is optional. It is a decision taken by the user. If the user is absolutely sure that the document is well formed, then it might not be necessary to validate the document.

The following file is very similar to the previous example, but it has been modified so that it does not comply with the DTD.
<?xml version="1.0"?>
<!DOCTYPE inventory  [
<!ELEMENT inventory (cd)* >
    <!ELEMENT cd (title, artist, price, qty)>
        <!ELEMENT title (#PCDATA)>
        <!ELEMENT artist (#PCDATA)>
        <!ELEMENT price (#PCDATA)>
        <!ELEMENT qty (#PCDATA)>
]>

<inventory>
  <cd> <title>Some Girls</title>   <artist>Rolling Stones</artist>
<!--  Comment out price to cause an error:        <price>25</price>  -->
       <qty>5</qty>  </cd>
</inventory>

If one runs the program DOMEcho (included in the CD that comes with the book)  with the -v option,:
  java DOMEcho -v cdInvalid.xml

The following output is produced:
 Error: URI=file:/home/trefftzc/cs621/XML/cdInvalid.xml Line=14: Element "cd" does not allow "qty" here.

The -v option tells the program to validate the input file against the DTD.

The error is produced because the next expected field is price.  It has been commented out, hence the next element found is qty.

Parsers

If  one knows in advance the grammar of the documents that will be used, it is possible to build a parser to process precisely documents of this type. This is the approach taken in, for instance, VoiceXML (more information at VoiceXML Forum;
professor Jon Engelsma is a member of the VoiceXML forum and the W3C committee in charge of defining the standard see
VoiceXML Review).
Fast parsers have been built specifically to parse XML documents that follow the DTD defined for VoiceXML documents.

Surprisingly, this is not always the case. It is possible to use libraries that will take any valid DTD and create, "on the fly", a (validating) parser for that particular DTD (grammar).

These parsers are available in Java, C++ and other programming languages (python). C++ parsers are faster than Java parsers.

There are two main kinds of parsers:
  1. DOM: Document Object Model
  2. SAX: Simple Access to XML

DOM Parsers

A DOM Parser reads the entire document into main memory, parses it and builds a tree that can be traversed using methods provided for this purpose in the API.

The following program is based on the example from the book. It will print the titles of the CDs in the inventory:

import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.w3c.dom.*;
import java.io.*;

public class MyDOM {
    /** All output will be use this encoding */
    static final String outputEncoding = "UTF-8";

    public static void main(String[] args) throws Exception {
        String filename = null;
        boolean validation = false;
        boolean ignoreWhitespace = true;
        boolean ignoreComments = false;
        boolean putCDATAIntoText = false;
        boolean createEntityRefs = false;

    filename = args[0];
        // Step 1: create a DocumentBuilderFactory and configure it
        DocumentBuilderFactory dbf =
            DocumentBuilderFactory.newInstance();

        // Optional: set various configuration options
        dbf.setValidating(validation);
        dbf.setIgnoringComments(ignoreComments);
        dbf.setIgnoringElementContentWhitespace(ignoreWhitespace);
        dbf.setCoalescing(putCDATAIntoText);
    // This is an example of a pattern called Factory
        DocumentBuilder db = null;
        try {
            db = dbf.newDocumentBuilder();
        } catch (ParserConfigurationException pce) {
            System.err.println(pce);
            System.exit(1);
        }

        // Set an ErrorHandler before parsing
        OutputStreamWriter errorWriter =
            new OutputStreamWriter(System.err, outputEncoding);
        db.setErrorHandler(
            new MyErrorHandler(new PrintWriter(errorWriter, true)));

        // Step 3: parse the input file
        Document doc = null;
        try {
            doc = db.parse(new File(filename));
        } catch (SAXException se) {
            System.err.println(se.getMessage());
            System.exit(1);
        } catch (IOException ioe) {
            System.err.println(ioe);
            System.exit(1);
        }
    Element root = doc.getDocumentElement();   
    MyDOM.processInventory(root);

    } // end main
    // Sample methods
    private static void processInventory(Element n) {
    for (Node child = n.getFirstChild(); child != null;
         child = child.getNextSibling()) {
        processCD(child);
    }
    } // end processInventory
    // Process the CD
    // Every CD is expected to contain:
    // the title, the artist, the price and the quantity
    private static void processCD(Node n) {
    Node child;
    child = n.getFirstChild(); // Child is the title now
    processTitle(child);
    child = child.getNextSibling(); // Child is the artist now
        child = child.getNextSibling(); // Child is the price now
        child = child.getNextSibling(); // Child is the quantity now
    } // end processInventory

    private static void processTitle(Node n) {
    // An element that just has PCData will have a child node
    // of type text and the node value that is the elment's PCData
    Node child = n.getFirstChild(); // Child is the text for the title now
    System.out.println("The title of this CD is: " + child.getNodeValue());
    }
    // Error handler to report errors and warnings
    private static class MyErrorHandler implements ErrorHandler {
        /** Error handler output goes here */
        private PrintWriter out;

        MyErrorHandler(PrintWriter out) {
            this.out = out;
        }

        /**
         * Returns a string describing parse exception details
         */
        private String getParseExceptionInfo(SAXParseException spe) {
            String systemId = spe.getSystemId();
            if (systemId == null) {
                systemId = "null";
            }
            String info = "URI=" + systemId +
                " Line=" + spe.getLineNumber() +
                ": " + spe.getMessage();
            return info;
        }

        // The following methods are standard SAX ErrorHandler methods.
        // See SAX documentation for more info.

        public void warning(SAXParseException spe) throws SAXException {
            out.println("Warning: " + getParseExceptionInfo(spe));
        }
       
        public void error(SAXParseException spe) throws SAXException {
            String message = "Error: " + getParseExceptionInfo(spe);
            throw new SAXException(message);
        }

        public void fatalError(SAXParseException spe) throws SAXException {
            String message = "Fatal Error: " + getParseExceptionInfo(spe);
            throw new SAXException(message);
        }
    }

} // end MyDOM

SAX Parsers

SAX parsers read the input file piece by piece. As a particular element is read, an action is taken. The style of programming is based on call backs depending on the elements that have been read. SAX processing is more involved but it does not require reading the entire document at once and hence it is more appropriate for very large documents that may not fit in main memory.

Other uses of XML

There are many other possible uses of XML.

For instance an entire framework for publishing documents based on XML has been created. It is called Cocoon (more information at Apache Cocoon).

There is an utility called XSLT (XSL Transformations) which allows the transformation from an XML document into another XML document (more information at W3C information about XSLT).

XML is being used as the basis for Web services (more information on Web services).  These are emerging technologies which might be very useful in the future.