XML
There is an good tutorial on XML at the Sun site:
XML Tutorial
There is a web site devoted to XML which also contains useful tutorials:
www.xml.org
Tutorials
IBM also has a good set of tutorials about XML:
IBM
Tutorials on XML.
XML is a set of standards written by the W3C consortium:
W3C and XML
In traditional IT applications, the format of a file is determined by the
application that uses it.
The information about how to interpret the different fields in a file is
coded in the programs that use it.
This makes it difficult to exchange files among different applications.
If application X is going to use a file generated by application Y,
the developers need access to the description of the file which is not available
in the file itself and usually is not available to the general public.
An additional difficulty may arise because of the differences in the internal
representation of certain data types. For instance, there are two main architectures
for representing integers: Big endian and little endian. If one is going
to read a file generated in a machine of one architecture in a machine with
the other architecture, compatibility problems arise.
These problems have been addressed before using a very general language
called SGML which was too complex to be widely used. XML is a descendant
of SGML which is easier to use and has become quite popular.
Key aspects of XML
To solve the problems described previously, XML offers two key characteristics:
- Every document that is going to be encoded using XML must have a definition.
- Document in XML will contain plain text only
This makes document very portable among platforms, because the definitions
are not tied to a particular program that will create or read the document,
instead there is a definition of the document that is available to anybody
interested in reading or creating this type of documents. Text is text, no
matter which platform is manipulating it, therefore an XML document will be
readable by any kind of computer, no matter how numbers are represented internally.
Defining a document: DTD and Schemas
There are several alternatives to describe a type of document. The one that
seems to be most widely used now is called DTD - Document Type Definition.
There are other alternatives which will not be covered here: Schemas and
DSD. To learn more about those alternatives, check this link: DTD, Schemas and DSD.
The problem of describing precisely the valid content of a file has been
solved previously in Computer Science in the realm of compilers. Modern
programming languages are defined using a formal grammar which specifies what
constitutes a valid program in this particular programming language. This
topic is cover in depth in compiler courses.
A grammar, as used in compilers, contains four elements:
- A set of terminals: Words that can appear in the sentences in the
language being defined.
- A set of non-terminals: Categories of words that are used to
specify the structure of the valid sentences.
- A starting symbol: A particular symbol chosen from the set of non-terminals.
- A set of productions: Rules that can specify how to derive (rewrite)
a non-terminal in terms of other non-terminals or a non-terminal in
terms of terminals.
The approach used in DTD is very similar to the one used in grammars for
programming languages.
A compiler reads a source code and applies the productions to the text that
it reads to check if the source complies with the grammar of the programming
language. It creates in memory something called a derivation tree. The derivation
tree is used to generate the executable code. If the source code does not
follow the rules from the grammar, the compiler generates error messages.
An XML system will behave in a way very similar to a compiler.
Let's start with the grammar used to define a valid document.
Take as an example, the code included in the CD in van der Linde's book:
<?xml version="1.0"?>
<!DOCTYPE inventory [
<!ELEMENT inventory (cd)* >
<!ELEMENT cd (title, artist, price, qty)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT artist (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT qty (#PCDATA)>
]>
<inventory>
<cd> <title>Some Girls</title> <artist>Rolling
Stones</artist>
<price>25</price>
<qty>5</qty> </cd>
</inventory>
In this example, the type of document being processed is inventory.
All XML files will start with: <?xml version="1.0"?> to indicate that
this is an XML document and which version of XML is being used.
Comments may be included using the following format:
<!-- Comment -->
The definition of the document may be included from a remote source, using
a statement like:
<!DOCTYPE PLAY SYSTEM "play.dtd">
This is a way to avoid having to include the DTD along with every document
of this particular kind.
In the example we are examining, the DTD is included along with the document.
The next element found in the example is the DTD for inventory. The part
of interest is:
<!DOCTYPE inventory [
<!ELEMENT inventory (cd)* >
<!ELEMENT cd (title, artist, price, qty)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT artist (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT qty (#PCDATA)>
]>
The starting symbol of this grammar is inventory and is specified by the
statement:
<!DOCTYPE inventory
In a DTD there should be only one DOCTYPE entry.
This is a grammar that states that the documents of type inventory will
contain a tag of the type inventory followed by 0 (zero) or more instances
of CD elements . This is stated by the line:
<!ELEMENT inventory (cd)* >
The asterisk is a short way of stating 0 (zero) or more instances (this
is covered in discrete math or compiler courses).
The presence of the keyword ELEMENT in this line indicates that one tag
for inventory should appear in the document.
Every time a tag is opened, it should be closed later.
The following file, with 0 CDs, would be a valid instance of a document
of the type inventory:
...
<inventory>
</inventory>
The two lines are the two tags that identify the beginning and the end of
the inventory. They must appear because inventory is an element and hence
the two tags must appear to mark the start and the end of this element.
The definition of the element CD is contained in the following line:
<!ELEMENT cd (title, artist, price, qty)>
This states that a cd is an element, and therefore the corresponding tag
should be included in the documents, and that every CD will contain a title,
an artist, a price and a quantity. In terms of traditional compiler grammars,
this is also a non-terminal because it is composed of other elements.
It is time now to define what a title looks like. The line
<!ELEMENT title (#PCDATA)>
states that title is an element, and again there should be a tag associated
with title, but the only thing that will appear between the tags will be "parsed
character data", in other words, plain text.
The declarations for artist, price and qty are similar. Notice that qty
contains a number (probably an integer) but all that is stated here is that
it will be treated as text.
The following example contains two CDs instead of just one:
<?xml version="1.0" ?>
...
<inventory>
<cd>
<title>Some Girls</title>
<artist>Rolling Stones</artist>
<price>25</price>
<qty>5</qty>
</cd>
<cd>
<title>Aniversario</title>
<artist>Pablo Milanes</artist>
<price>20</price>
<qty>5</qty>
</cd>
</inventory>
There are many other options when defining DTDs, in particular it is possible
to state that a particular element will contain certain attributes. There
are rules to specify alternatives and other regular expressions to
specify the number of times that a particular element should appear.
How are XML documents processed?
In compilers, once a grammar has been defined one creates a compiler for
that grammar that will check if a source code is correct and will translate
into binary code.
Something similar will happen with XML documents. It is possible to:
- Validate if the document being read is valid, it is if it conforms
to the grammar for the kind of document being processed.
- Manipulate the content.
Validating
Validating is optional. It is a decision taken by the user. If the user is
absolutely sure that the document is well formed, then it might not be necessary
to validate the document.
The following file is very similar to the previous example, but it has been
modified so that it does not comply with the DTD.
<?xml version="1.0"?>
<!DOCTYPE inventory [
<!ELEMENT inventory (cd)* >
<!ELEMENT cd (title, artist, price, qty)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT artist (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT qty (#PCDATA)>
]>
<inventory>
<cd> <title>Some Girls</title> <artist>Rolling
Stones</artist>
<!-- Comment out price to cause an error:
<price>25</price> -->
<qty>5</qty> </cd>
</inventory>
If one runs the program DOMEcho (included in the CD that comes with the book)
with the -v option,:
java DOMEcho -v cdInvalid.xml
The following output is produced:
Error: URI=file:/home/trefftzc/cs621/XML/cdInvalid.xml Line=14: Element
"cd" does not allow "qty" here.
The -v option tells the program to validate the input file against the DTD.
The error is produced because the next expected field is price. It
has been commented out, hence the next element found is qty.
Parsers
If one knows in advance the grammar of the documents that will be used,
it is possible to build a parser to process precisely documents of this type.
This is the approach taken in, for instance, VoiceXML (more information at
VoiceXML Forum;
professor Jon Engelsma is a member of the VoiceXML forum and the W3C committee
in charge of defining the standard see
VoiceXML Review).
Fast parsers have been built specifically to parse XML documents that follow
the DTD defined for VoiceXML documents.
Surprisingly, this is not always the case. It is possible to use libraries
that will take any valid DTD and create, "on the fly", a (validating) parser
for that particular DTD (grammar).
These parsers are available in Java, C++ and other programming languages
(python). C++ parsers are faster than Java parsers.
There are two main kinds of parsers:
- DOM: Document Object Model
- SAX: Simple Access to XML
DOM Parsers
A DOM Parser reads the entire document into main memory, parses it and builds
a tree that can be traversed using methods provided for this purpose in the
API.
The following program is based on the example from the book. It will print
the titles of the CDs in the inventory:
import javax.xml.parsers.*;
import org.xml.sax.*;
import org.xml.sax.helpers.*;
import org.w3c.dom.*;
import java.io.*;
public class MyDOM {
/** All output will be use this encoding */
static final String outputEncoding = "UTF-8";
public static void main(String[] args) throws Exception
{
String filename = null;
boolean validation = false;
boolean ignoreWhitespace = true;
boolean ignoreComments = false;
boolean putCDATAIntoText = false;
boolean createEntityRefs = false;
filename = args[0];
// Step 1: create a DocumentBuilderFactory
and configure it
DocumentBuilderFactory dbf =
DocumentBuilderFactory.newInstance();
// Optional: set various configuration
options
dbf.setValidating(validation);
dbf.setIgnoringComments(ignoreComments);
dbf.setIgnoringElementContentWhitespace(ignoreWhitespace);
dbf.setCoalescing(putCDATAIntoText);
// This is an example of a pattern called Factory
DocumentBuilder db = null;
try {
db = dbf.newDocumentBuilder();
} catch (ParserConfigurationException
pce) {
System.err.println(pce);
System.exit(1);
}
// Set an ErrorHandler before
parsing
OutputStreamWriter errorWriter
=
new OutputStreamWriter(System.err,
outputEncoding);
db.setErrorHandler(
new MyErrorHandler(new
PrintWriter(errorWriter, true)));
// Step 3: parse the input file
Document doc = null;
try {
doc =
db.parse(new File(filename));
} catch (SAXException se) {
System.err.println(se.getMessage());
System.exit(1);
} catch (IOException ioe) {
System.err.println(ioe);
System.exit(1);
}
Element root = doc.getDocumentElement();
MyDOM.processInventory(root);
} // end main
// Sample methods
private static void processInventory(Element n) {
for (Node child = n.getFirstChild(); child != null;
child = child.getNextSibling())
{
processCD(child);
}
} // end processInventory
// Process the CD
// Every CD is expected to contain:
// the title, the artist, the price and the quantity
private static void processCD(Node n) {
Node child;
child = n.getFirstChild(); // Child is the title now
processTitle(child);
child = child.getNextSibling(); // Child is the artist
now
child = child.getNextSibling(); //
Child is the price now
child = child.getNextSibling(); //
Child is the quantity now
} // end processInventory
private static void processTitle(Node n) {
// An element that just has PCData will have a child node
// of type text and the node value that is the elment's
PCData
Node child = n.getFirstChild(); // Child is the text for
the title now
System.out.println("The title of this CD is: " + child.getNodeValue());
}
// Error handler to report errors and warnings
private static class MyErrorHandler implements ErrorHandler
{
/** Error handler output goes
here */
private PrintWriter out;
MyErrorHandler(PrintWriter out)
{
this.out
= out;
}
/**
* Returns a string describing
parse exception details
*/
private String getParseExceptionInfo(SAXParseException
spe) {
String
systemId = spe.getSystemId();
if (systemId
== null) {
systemId = "null";
}
String
info = "URI=" + systemId +
" Line=" + spe.getLineNumber() +
": " + spe.getMessage();
return
info;
}
// The following methods are standard
SAX ErrorHandler methods.
// See SAX documentation for more
info.
public void warning(SAXParseException
spe) throws SAXException {
out.println("Warning:
" + getParseExceptionInfo(spe));
}
public void error(SAXParseException
spe) throws SAXException {
String
message = "Error: " + getParseExceptionInfo(spe);
throw
new SAXException(message);
}
public void fatalError(SAXParseException
spe) throws SAXException {
String
message = "Fatal Error: " + getParseExceptionInfo(spe);
throw
new SAXException(message);
}
}
} // end MyDOM
SAX Parsers
SAX parsers read the input file piece by piece. As a particular element is
read, an action is taken. The style of programming is based on call backs
depending on the elements that have been read. SAX processing is more involved
but it does not require reading the entire document at once and hence it
is more appropriate for very large documents that may not fit in main memory.
Other uses of XML
There are many other possible uses of XML.
For instance an entire framework for publishing documents based on XML has
been created. It is called Cocoon (more information at Apache Cocoon).
There is an utility called XSLT (XSL Transformations) which allows the transformation
from an XML document into another XML document (more information at W3C information about XSLT).
XML is being used as the basis for Web services (more information on Web services).
These are emerging technologies which might be very useful in the future.