Retired Content‎ > ‎

(Retired) Section 09: XML

What is XML?

XML (eXtensible Markup Language) was invented for the purpose of having a standard and powerful way of describing any kind of data. XML offers a widely adopted standard way of representing text and data in a format that can be processed without much human or machine intelligence. Information formated in XML can be exchanged across platforms, languages, and applications, and can be used with a wide range of development tools and utilities.

HTML and XML - why two markup languages? Both markup languages were designed with different purposes in mind. XML is actually similar enough to HTML - they are both closely related to the SGML (Standard Generalized Markup Language) markup definition language that has been an ISO standard since 1986. SGML is an early attempt to combine the metadata (data about the data!) with the data and was used primarily in large document management systems. Because SGML is a very complex language, it has limited mass appeal and we will have no further need for it in this module!

So let's deal with HTML first. HTML has been covered somewhat in earlier chapters, so you should already have a decent understanding of its structure. HTML (Hypertext Markup Language) is the most recognized application of SGML, and is devised to allow any Web browser or application which understands HTML to display information in a consistent form. A HTML document is effective when it comes to laying out and displaying data, but it is a fixed set of tags, and it does not have the flexibility to describe different document and data types. HTML, in conjunction with Cascading Style Sheets (CSS), is reasonably good at displaying data, but it is not as good as XML at transporting data that is meant to be viewed or parsed in dozens of different ways by a variety of devices. In essence, where HTML is a presentation language, we require a richer communication means, with means of exchanging information from one computer to another.

The need to extract data and put structure around information led to the creation of XML. Since it was released in 1997, XML use has been growing rapidly. There are two major fundamental differences between HTML and XML:

  • Separation of form and content - HTML mostly consists of tags defining the appearance of text; in XML the tags generally define the structure and content of the data, with actual appearance specified by a specific application or associated stylesheet.

  • XML is extensible - tags can be defined by individuals or organisations for some specific application, whereas the HTML standard tagset is defined by the World Wide Web Consortium (W3C).

XML is not intended as a replacement for HTML and both are complementary technologies. XML is a more general and better solution to the problem of sharing data on the Web than extending HTML.

What about relational databases?   Traditional databases may be well suited for data that fits into rows and columns, but cannot adequately handle rich data such as audio, video, nested data structures or complex documents, which are characteristic of typical Web content. In order to deal with XML, many traditional databases are typically retrofitted with external conversion layers that mimic XML storage by translating it between XML and some other data format. This conversion is error-prone and results in a great deal of overhead, particularly with increasing transaction rates and document complexity.

XML databases, on the other hand, store XML data natively in its structured, hierarchical form. Queries can be resolved much faster because there is no need to map the XML data tree structure to tables. This preserves the hierarchy of the data and increases performance.

However, over 90+% of all web applications use RDBMS systems for their persistence layer.  XML on the other hand, has found it's niche in a number of different areas:

  1. As a storage mechanism in certain circumstances when structures are complex
  2. As a information transfer mechanism in web services   (more on this later)
  3. Configuration files frequently use XML as any data structure can be customised to a particular application


The following is a sample section from a possible XML document, to provide an example for you at this point. It is *not* a full XML document - we will discuss the structure of XML documents shortly and you will notice that we need a few extra lines to consider it to be a full document.

<employee> <ident>3348498</ident> <name> <lastname>Peterson</lastname> <firstname>Sam</firstname> <title>Dr.</title> </name> <phonedetails> <extension>8221</extension> <companyprefix>700</companyprefix> <regionprefix>1</regionprefix> <intprefix>+353</intprefix> </phonedetails> <department> <title>Software Development</title> <depid>8</depid> </department> <location> <building>Aston Quay</building> <room>A142</room> </location> </employee>

While not necessarily the optimum structure for information such as above, it illustrates a major point of XML. The tags are defined by individuals, rather than some predefined standard structure - so we can do what we want!

There are two different kinds of information in the above example:

  • markup - such as <department> and <firstname>

  • text/character data - such as 'Peterson' and '+353'

XML documents mix markup and text together into a single file: the markup describes the structure of the document, while the text is the document's content.


Why XML? Key features!

There are a range of benefits associated with XML:

  • Simplicity - Information coded in XML is easy to read and understand, plus it can be processed easily by computers.

  • Self-Describing - unlike records in traditional database systems, XML data does not require relational schemata, file description tables, external data type definitions etc. because the data itself contains this information XML also guarantees total usability of data, which is imperative for business applications whose tasks extend beyond the mere presentation of content.

  • Open and Extensible - XML allows you to add other elements when needed. This means you can always adapt your system to address specification modifications.

  • Application Independence - Using XML, data is no longer dependent on a specific application for creation, viewing or editing. In this sense, XML is to data what Java is to applications. Java allows programs to run anywhere - XML allows data to be used by any application.

  • Data Format Integration - XML documents can contain any imaginable data type - from classical data like text and numbers, or multimedia objects such as sounds and video or active components like Applets.

  • One Data Source, multiple views - By formatting our data in a markup language, we allow computer applications to process and present this data to us in different ways. In contrast, HTML presents data in one fixed way.

  • Data Presentation Modification - You can change the look and feel of documents or even entire websites with XSL Style Sheets without manipulating the data itself.

  • Internationalization - Internationalization is important for electronic worldwide business applications. XML supports multilingual documents and the Unicode standard.

  • Future-Oriented - XML is the endorsed industry standard of the World Wide Web Consortium (W3C) and is supported by all leading software providers. Furthermore, XML is also the standard today in an increasing number of other industries, for example health care.

  • Improved Data Searches - Tags, attributes and element structure provide context information that can be used to interpret the meaning of content, opening up new possibilities for highly efficient search engines, intelligent data mining etc. An intelligent search engine against a body of XML-compliant markup languages would search both the content and the meta-data, which would drastically improve the accuracy of searches. This will obviously cause an increase in relevant and accessible data on a world-wide basis. Why do you think HTML searches are so basic?

  • Enables e-Commerce Transactions - An e-Commerce transaction requires instant cooperation between a host of agents involved in a single purchase. For example, a customer ordering an item from a supplier involves a number of transactions including those with the customer ("B2C e-Commerce"), businesses in a supply chain ("B2B e-Commerce"), bank ("B2B") and between systems ("enterprise integration"). The initial reaction of most companies was to integrate these diverse operations by building or buying software that employed protocols such as DCOM or CORBA to perform such integration. However, more recently, XML offers the option of performing the necessary integration by exchanging standardized data.


XML Document Structure

XML documents are intended to store data, not necessarily to be viewed. They follow a layout very similar to HTML. In HTML there are two main sections in a document defined by the HEAD and BODY tags. An XML document also contains two sections: the document prolog at the head of the document and the instance, or the body.


Prolog Section

The document prolog must be the first thing in an XML document - it is the introduction to the document. Here is a sample prolog of an XML document:

<?xml version="1.0"?> <!DOCTYPE book SYSTEM "DTD/book.dtd">

The specification states that both parts of the prolog are optional. The first part is called the XML declaration and the second part the Document Type Definition. A Document Type Definition (DTD) sets all the rules for the document regarding elements, attributes, and other components. This DTD may be either an external DTD or Internal DTD.

  • Internal DTD - An internal DTD document is contained completely within the XML document.

  • External DTD - An external DTD document is a seperate document, referenced from within the XML document

In the example prolog above, it refers to an external DTD which can be found in the local system path 'DTD/book.dtd'. Any time you use a relative or absolute file path or a URL, you must use the SYSTEM keyword. The other option is using the PUBLIC keyword, and following it with a public identifier. This means that the W3C or another consortium has defined a standard DTD that is associated with that public identifier. For example:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

We will discuss DTDs in more detail shortly in this Chapter.


Instance Section

The instance contains the remaining parts of the XML document, including the actual contents of the document, such as characters, paragraphs, pages and graphics

Elements

Elements are the most important part of an XML document. An element consists of content enclosed in an opening tag and a closing tag. An element can contain several different types of content:

  • Element Content - Contains only other elements. Example: the <name> element in <name><firstname>Tom</firstname><lastname>Smith</lastname></name>

  • Mixed Content - Contains both text and other elements. Example: the <para> element in <para>This point is a <emphasis>very important</emphasis> point.</para>

  • Simple Content - Contains only text. Example: <lastname>Molloy</lastname>

  • Empty Content - Does not contain information. Example: <image src="test.jpg"></image>

XML element names are case-sensitive, meaning that opening and closing tags must be written in the same case. XML documents require both a begin and an end tag. Althought frequently with some elements in HTML (such as <BR>) you can omit the closing tags, all XML elementsmust include an end tag. Otherwise the XML would not be properly structured and would result in an error. For example, the following is errored XML:

<title>Introduction to XML

The correct format would be:

<title>Introduction to XML</title>

When dealing with elements such as empty elements it is possible to specify them using the following shorthand:

<image src="test.jpg"></image> <image src="test.jpg" />

XML Documents must be well-formed. Firstly, this means that you must follow the rules regarding case-sensitivity and always including closing tags. Additionally, you cannot mix the order of the nested tags: the first opened element must always be the last closed element. If any of the rules for XML syntax are not followed in an XML document, the document is not well-formed. The following is an example of an XML fragment, which is not well-formed:

<tag1> <tag2> </tag1> </tag2>

And now time to get a little confusing! well-formed document is not necessarily valid . Valid XML must additionally follow the constraints set upon an XML document by its Document Type Definition or schema.

In XML, you can only have a single root element. That root element has subelements which may also further have subelements. The structure of an XML document is a tree of elements. So if you think of an element as a container, an XML document becomes a container of containers. Containers have a name associated with them (the element name) and possible additional characteristics (called attributes). The containers hold the content (or data) of the document. The start and end tags define the boundaries of the container.


Figure 9.1. XML Element Structure



Attributes

In additional to content, elements may have attributes. XML attributes are identical to HTML attributes, allowing you to attach characteristics to an element. For example in HTML:

<IMG SRC="images/test.jpg"> and XML <image src="images/test.jpg" />

Attributes have a name and a value and are placed within the start tag. In the document type definition (DTD), you define the legal attributes for an element and what values are legal for that attribute. Again, we will cover creating DTDs shortly, so for now just consider that they define the structure of our XML document.

An element can have multiple attributes. While you can get away with omitting quotes for attributes in HTML, in XML the value must be surrounded by single or double quotes. When you use one type of quotes, the other type is legal within the quotes - for example:

<topic name=" Brian O'Sullivan"> or <topic name=' The Use of "s in Popular Literature '>

In addition to learning how to use attributes, there is an issue of when to use attributes. Because XML allows such a variety of data formatting, it is rare that an attribute cannot be represented by an element, or that an element could not be easily converted to an attribute. Although there's no specification or widely accepted standard for determining when to use an attribute and when to use an element, there is a good rule of thumb: use elements for multiple-valued data and attributes for single-valued data. If data can have multiple values, or is very lengthy, the data most likely belongs in an element. To understand this, let us consider two formats for storing phone information:

<phone number="+35318008583" /> and <phone> <intcode>+353</intcode> <localcode>1</localcode> <prefix>800</prefix> <extension>8583</extension> </phone>

Using attributes in this case is obviously far simpler to write and less verbose. However, it would make searching our data for all phone numbers with an 800 prefix quite difficult. Equally, the multiple element format would make it easy to generate an internal phone book, only showing the local extensions.

Both formats are correct data formats which can be used. Essentially, which you use comes down to your own decision.


Entities References and Constants

Let us consider a XML file where we wish to include the data <HTML>. This would be a certainty where we were writing notes in XML describing XML or HTML (like I am right now!). For example:

<chapter> <sect1> <title>Using HTML</title> <para> HTML is defined using tags, such as <HTML> and <BODY> ..... </para> </sect1> </chapter>

So what's the problem here? The problem here is that XML parsers will attempt to handle this data as an XML tag, and then generate an error because there is no closing tag! This is a common problem, as any use of angle brackets results in this behaviour. Entity References provide a way to overcome this problem. An entity reference is a special data type in XML used to refer to another piece of data. The entity reference consists of a unique name, preceded by an ampersand and followed by a semicolon: &[entityname];. When an XML parser sees an entity reference, the specified substitution value is inserted and no processing of that value occurs. XML defines five special entities to address this problem: &lt; for <, &gt; for >, &amp; for &, &quot; for " and &apos; for '. Using these entities it is possible to define the above example, in the following way:

<chapter> <sect1> <title>Using HTML</title> <para> HTML is uses tags, such as &lt;HTML&gt; and &lt;BODY&gt; ..... </para> </sect1> </chapter>

Once this document is parsed, the data is interpreted as "<HTML> and <BODY>" and the document is still considered well-formed.

Using entities is not restricted to simply handling difficult escape characters within data. It is possible to use entities to effectively define 'variables' or 'constants' within your XML data. Consider the case where we repeatedly use the data 'Royal Society for the Prevention of Cruelty to Animals (RSPCA)' in our XML document. Rather than repeatedly type this every time, at our XML document (or root XML document if we use multiple subdocs) we define the following:

<!ENTITY rspca "Royal Soc. for the Prevention of Cruelty to Animals">

Then, when we wish to use this text within our XML document at any subsequent stage, we simply use the following entity: &rspca; to represent our constant. Likewise, the 'variable' representing the author's current email could be defined as an entity and referenced throughout the rest of the document. If the authors email address changes at a later date, then a simple change to the entity would modify the data throughout the rest of the document.


Unparsed Data

In XML, there are three kinds of data that will be ignored by the parser: comments, processing instructions (PIs) and character data (CDATA). When the parser encounters one of these, normal operation is suspended while the parser looks for the end marker.

Comments in XML are exactly like comments in HTML. Typically, they are ignored by most XML parsers.

<!-- this is a comment -->

Character Data (CDATA) sections allow you to put information that might be recognised as markup anywhere characters may occur. CDATA sections begin with <![CDATA[ and end with ]]>. The parser will ignore everything within the CDATA section. CDATA is also used when a significant amount of data should be passed to the calling application without any XML parsing, or when spacing must be preserved. Throughout these notes, CDATA is almost always used when it comes to displaying listings of programs or samples of XML and HTML where brackets and ampersands etc. are frequently used. It would not be practical in these situations to also use entity references repeatedly (although it could be done!) so we just use CDATA to display the block of unparsed code. Additionally, when it comes to program listings, it allows us to preserve spacing and layout of our sample code. So for example:

<para> <![CDATA[ <HTML> <HEAD> <TITLE>Test HTML Page</TITLE> </HEAD> <BODY> <H1>Hello World!</H1> </BODY> </HTML> ]]> </para>

Processing Instructions (PIs) allow XML documents to contain instructions for applications. Like comments, they are not part of the document's character data, so they are of little interest to the XML processor. However, they must be passed through to the proper application. The PI begins with <? and ends with ?>. The only PI we have encountered so far has been in the prolog:

<?xml version="1.0"?>

We will come across PIs later in the notes, when we introduce stylesheets to our XML documents, but for now you should not be too concerned!



Document Type Definitions

Why do we need constraints on our XML documents? Because XML is extensible and can represent data in hundreds and thousands of ways, constraints on a document provide meaning to those various formats. Without document contraints, it is typically impossible to tell what the data in a document means.

A DTD is used to define the structure of an XML document as well as what content is allowed. An XML document is not very usable without an accompanying DTD (or schema). Just as XML can effectively describe data, the DTD makes this data usable for many different programs in a variety of ways by defining the structure of the data. A DTD declares all the legal elements in a document, the legal attributes those elements can have, the hierarchy, nesting and occurrance indicators for all elements. A well-written DTD will give the same look to documents that use that DTD, as well as help the browser and XML parser display the document properly.

A DTD is a text file. As mentioned previously, the DTD information may either be embedded directly in the XML document (internal DTD) or linked to an external file containing the DTD (external DTD). Since an external DTD may be used by any document conforming to its definition, this method is generally preferred. This means that if you define a DTD, then you can write multiple XML documents which can use the same DTD. If you were to use the internal approach in this situation, then you would need to replicate the DTD information in each seperate XML document.

We will concentrate on the two main types of markup declarations found in the DTD:

  • Element Declarations

  • Attribute List Declarations


Element Declarations

Element declarations identify the names of elements as well as the nature of their content. An element definition begins with the ELEMENT keyword, followed by the standard <! opening of a DTD tag, and then the name of the element. The following element declaration defines the name of the tag (Author), and the content model for the tag. The + notation means the <Author> tag must contain one or more <Name> tags.

<!ELEMENT Author (Name)+>

This means that if the Author tag is used within our document, the following are both valid:

<Author> <Name>Joe Smith</Name> </Author> or <Author> <Name>Joe Smith</Name> <Name>Mary Jones</Name> </Author>

However, if there was not at least one <Name> entry, then our XML document would not be valid!

There are four different DTD recurrance modifiers, with the following descriptions:

  • [Default] - Must appear once and only once (1)

  • ? - May appear once or not at all (0,1)

  • + - Must appear at least once, up to an infinite number (1...N)

  • * - May appear any number of times, including not at all (0...N)

To help with further understanding these modifiers, let us consider another example; that of a more detailed <Author> situation. Let us assume that we require one or more <Name> elements as before. Now, the <Name> element must have one <Firstname> element, one <Lastname> element and an optional list of <Qualification>s. We could define these DTD declarations as:

<!ELEMENT Author (Name+)> <!ELEMENT Name (Firstname, Lastname, Qualification*)> <!ELEMENT Qualification (#PCDATA)> <!ELEMENT Firstname (#PCDATA)> <!ELEMENT Lastname (#PCDATA)>

Using this DTD, the following XML snippet would be valid XML:

<Author> <Name> <Firstname>Joe</Firstname> <Lastname>Smith</Lastname> <Qualification>B.Eng</Qualification> <Qualification>PhD</Qualification> <Qualification>MIEEE</Qualification> </Name> <Name> <Firstname>Mary</Firstname> <Lastname>Jones</Lastname> </Name> </Author>

The #PCDATA keyword signifies that the tag contains parsed character data. The XML parser will find only character data, not tags or entity references. The only descriptions of allowed content between tags are (#PCDATA), EMPTY, and ANY. The EMPTY description means that there must not be content after the opening tag. The ANY means that there can be any type of text as long as it is valid XML.

Finally, it is possible to use the '|' symbol as an OR operation.

<!ELEMENT Figure (Graphic | Table | Screen-shot)>

which means that a Figure element will have either a graphic, table or screen-shot sub-element.


Attribute Declarations

Attributes supply additional information about elements. They are created in the DTD when the elements are specified and are specified through an attribute list. Attributes identify additional data about an element. Attribute declarations define which attributes can appear inside a tag, as well as the kinds of data the attributes can contain.

In a DTD, an attribute list declaration begins with the string literal <!ATTLIST and then followed by the element name these attributes are for. After the name, you add one or more attribute declarations. An attribute declaration consists of three parts: the attribute name, its type, and a default declaration. The general form of an attribute declaration is:

<!ATTLIST elemName attName attType default-decl>

Most attributes with textual values will simply be of the type CDATA, as shown in the example below:

<!ATTLIST chapter title CDATA #REQUIRED number CDATA #REQUIRED >

Remember, that we are using attributes now, as opposed to elements, so in this example DTD entry we are representing a structure within the XML document, which could appear as follows:

<chapter title="Servlets and their Application" number="4" />

You can also specify a set of values that an attribute must take on for the document to be considered valid:

<!ATTLIST code type (Java | C | C++) #REQUIRED >

There are four attribute default declaration options:

  • #REQUIRED - A REQUIRED declaration means the attribute MUST be present with the element. If the XML data doesn't have a value, an error will occur.

  • #IMPLIED - An IMPLIED declaration is used if the attribute value is not required and a default value is not provided. The document writer can optionally include the attribute. The value does not have to be supplied in the XML data.

  • #FIXED - The FIXED keyword provides a default value that cannot be modified by the document author. The attribute is not necessarily required, but if it occurs it must have a specified value. If an alternate value exists in the XML data, then an error will occur.

  • Default Value - You can provide a default value for an attribute that will be used if the user does not override it. As an example consider the following:

    <!ATTLIST country name (United States|Canada|Other) "other">


Now as a further example, consider the following DTD statements:

<!ATTLIST person gender CDATA #DEFAULT "male"> <!ATTLIST person gender CDATA #FIXED "male"> <!ATTLIST person gender CDATA #REQUIRED> <!ATTLIST person gender CDATA #IMPLIED> <!ATTLIST person gender (male|female) "male">

Here is an XML statement that will satisfy all of the above DTD statements:

<person gender="male">

This XML statement does not satisfy DTD rule 2 which requires a value of "male".

<person gender="female">

This XML statement fails DTD rule 5 (and rule 2) because "unknown" is not an acceptable value.

<person gender="unknown">

XML Schemas

A schema is a model that describes the structure of information. The term originated in the database field, describing the structure of data in relational tables (database schema). XML Schema is a newly finalized candidate recommendation from the W3C. It seeks to improve on DTDs by adding more types and quite a few more constructs than DTDs, as well as following the XML format. For example, in DTDs, there is no facility for describing data, such as numbers, dates, and currency values not is there the ability to express the data type of character data in elements.

However, at this incarnation of the module, we will not be covering XML Schemas in further detail. They are largely replacing DTDs due to their greater range of functionality.  However, with this greater range of functionality comes greater complexity and would require too much module coverage to cover in sufficient detail.  The concepts remain the same as those provided by DTDs.



Parsing and Validating XML

An XML parser is used to read XML documents, providing access to their content and structure. The parser handles the important task of taking a raw XML document as input and making sense of the document; it will ensure that the document is well-formed, and if a DTD or schema is referenced, it may be able to ensure that the document is valid.


Essential XML Editor (Trial Version)

The Essential XML Editor is trial tool for XML document editing. It includes a built in XML wellformedness tester and a DTD validator. The home page for the Essential XML Editor can be found at http://www.philo.de/xmledit/, where students can download and install the application. 

For now, we are principally interested in using the Essential XML Editor to test for well-formedness and validation.

In order to explain how to use the XML Editor, we will explain it through the next question example.


Examples

We have been introducing large amounts of rules, suggestions and information regarding the creation of XML and corresponding DTD files. It is time to create some data structures of our own. Let us consider the following problems:

Sample Question 1

We wish to store data on customers of a bank. Each customer has the following structure:

  • Customer Name (Firstname, Lastname, Title) (required)

  • Account Type (Cashsave, SSIA, Current, Savings) (1 or more accounts)

  • The balance within the account (required)

  • Date of Account Creation (optional)

Generate an XML document which could represent this information (make two sample customers) and a corresponding DTD against which you should validate the document.

Sample Question 1: Answer

Let's first attempt to draw a diagram to show the structure we require. We can call the 'root element' <CUSTOMERS> and work from there.

Figure 3.2. Choosing a Document Structure (Q1)


Using this diagram, we can easily see the structure and inheritance of our data. Now, we need to decide whether we should write the sample XML document first or the DTD. Either is suitable, but as a preference I tend to write some data first and will follow that preference here. You should decide yourself, which you prefer to do - try both and make a decision.

The first stage is obviously to open up our XML Editor application. This has been installed on the machines in the Masters laboratories or can be downloaded and easily installed as detailed above.

  1. Run the XML Editor Application and select new.

  2. Type in your XML Document

  3. Select File->Save and save it with a suitable name. In this example, I called it customers.xml

  4. Select New, which should open up a new tab called 'Untitled1'

  5. Write your DTD and select File->Save. You should store in in the relative location as specified in the customers.xml file. In this case the same directory in a file called customers.dtd.

  6. Go to your XML document window and select Tools->Check Validity to test your document against the DTD you have created.

  7. If your document is validated and meets the problem parameters, then you are complete!

  8. Note: When opening DTD files you may get the following '[Fatal Error] users.dtd: Start of comment expected'. Please simply ignore this as we will only be performing validation on the XML file (using the DTD files of course).

Below is my attempt at how the data should be structured within the XML document:

<?xml version="1.0" encoding="UTF-16"?> <!DOCTYPE customers SYSTEM "customers.dtd"> <customers> <customer> <name> <firstname>David</firstname> <lastname>Molloy</lastname> <title>Mr.</title> </name> <account type="Cashsave"> <date>02/02/89</date> <balance>1202.80</balance> </account> <account type="Current"> <date>13/09/98</date> <balance>505.60</balance> </account> </customer> <customer> <name> <firstname>Maeve</firstname> <lastname>O'Reilly</lastname> <title>Ms.</title> </name> <account type="SSIA"> <date>01/03/02</date> <balance>5500.34</balance> </account> </customer> </customers>

Source: customers.xml

The corresponding DTD can be created by using either the XML document or the figure as a guideline. The following is my suggested DTD structure for this problem:

<!ELEMENT customers (customer*) > <!ELEMENT customer (name, account+) > <!ELEMENT name (firstname, lastname, title) > <!ELEMENT firstname (#PCDATA)> <!ELEMENT lastname (#PCDATA)> <!ELEMENT title (#PCDATA)> <!ELEMENT account (date?, balance) > <!ATTLIST account type (Cashsave|Current|Savings|SSIA) #REQUIRED > <!ELEMENT date (#PCDATA) > <!ELEMENT balance (#PCDATA) >

Source: customers.dtd

After your DTD and XML document are both written, you should attempt to validate the XML document against the DTD. This process should highlight problems associated with either file. Once you get the document to validate, you can add in further XML data to test the various specifications made in the question. Once you are happy that the objectives are met, then you are complete!


WARNING: Be very careful about having no whitespace in front of your first <?xml ...> line.  This will cause both validity errors and an inability to later do transformations.  This is the cause of most of the problems you are likely to encounter.


Sample Question 2

Your boss, within the University in which you work, has informed you that he wishes you to manage authentication for the new Virtual Learning System, which is being developed by a team of programmers. He tells you that he wants all user details to be held in one XML file called users.xml, which should be validated by a corresponding DTD file called users.dtd. The following data structures need to be followed:

  • Users of the system can either be students or lecturers

  • Data on either type to be maintained: firstname, surname, title(optional), username, email address, full address details

  • If a student exists, they must be registered on either one programme (store the programme name, a four-digit code and a one-digit year and whether the programme is being active, ie. not deferred) or no programme at all

  • Both students and lecturers, can be registered for 0 or more modules (store semester 1 or 2 and module id as a five digit code)

  • Lecturers must have a staff id

Generate an XML file containing data for at least two users (one lecturer, one student) and create a corresponding and 'clever' document type definition (DTD) file. Use both elements and attributes to represent the data.


Sample Question 2: Answer

There are a few different approaches which could be taken on this question. Depending on how comfortable you are with XML structures, you may decide to write the Document Type Definition (DTD) first and then write the XML document. Alternatively, you can write the XML document first. My own approach would be to decide upon the XML document structure first, filling in some sample data. In order to make this easier, you may decide that you want to layout a diagram first, such as that below.

In the diagram, you can see that we first choose the root element, which we call <user>. This in turn is split into either <lecturer> or <student>, both marked with * to indicate that there can be 0 or more occurances of either user type. The two user types share some common data, such as firstname, surname, title and username (which we group under an element <name>). You can see how at certain points in the diagram we create elements, such as <name>, <contact> and <address> to 'cleverly' structure our XML data. We colour some elements/attributes in blue to indicate that they have no sub-elements and are typically PCDATA type.

Figure 3.3. Choosing a Document Structure (Q2)


So let us write the XML structure for three users. Essentially at this stage, we just use the diagram (if we created one, or else simply use the problem specification if you are able) to create our XML document structure, inserting sample data as we go. The following XML document represents the structure as seen in the diagram:

<?xml version="1.0" encoding="UTF-16"?> <!DOCTYPE user SYSTEM "users.dtd"> <user> <student> <name> <firstname>Joe</firstname> <surname>Smith</surname> <title>Mr.</title> <username>smithj</username> </name> <contact> <address> <street>54 Maple Rise, Santry</street> <county>Dublin</county> <country>Ireland</country> </address> <email>smithj@dcu.ie</email> </contact> <programme active="true"> <progname>M.Eng in Electronic Systems</progname> <code>MEN</code> <year>1</year> </programme> <module semester="2"> <modid>EE557</modid> </module> <module semester="1"> <modid>EE553</modid> </module> </student> <student> <name> <firstname>Ann</firstname> <surname>Ryan</surname> <title>Ms.</title> <username>ryana</username> </name> <contact> <address> <street>13 The Heath, Artane,</street> <county>Dublin</county> <country>Ireland</country> </address> <email>ryana@ireland.com</email> </contact> </student> <lecturer> <name> <firstname>David</firstname> <surname>Molloy</surname> <title>Mr.</title> <username>molloyda</username> </name> <contact> <address> <street>123 Fake Street</street> <county>Dublin</county> <country>Ireland</country> </address> <email>David.Molloy@dcu.ie</email> </contact> <staffid>4853212</staffid> <module semester="2"> <modid>EE557</modid> </module> </lecturer> </user>

Source: users.xml

The XML document above follows the project data specification. You will notice that we grouped the physical and email address details into an element called <contact>. While this was not specified, it does not violate the specification and makes structural sense. Note: There is more than one possible set of answers to this question. Now that we have our sample XML document, we should construct the DTD. When you are working on the DTD, you should start with the root element (user in our case), and work from element to element down the tree. Naturally, the student element and lecturer elements come next in this tree.

<!ELEMENT user ( student*, lecturer* ) > <!ELEMENT student ( name, contact, programme?, module* ) > <!ELEMENT lecturer ( name, contact, staffid, module* ) > <!ELEMENT name ( firstname, surname, title?, username ) > <!ELEMENT contact ( address, email ) > <!ELEMENT programme ( progname, code, year ) > <!ATTLIST programme active CDATA "true" > <!ELEMENT module ( modid ) > <!ATTLIST module semester (1|2) #REQUIRED > <!ELEMENT staffid ( #PCDATA ) > <!ELEMENT firstname ( #PCDATA ) > <!ELEMENT surname ( #PCDATA ) > <!ELEMENT title ( #PCDATA ) > <!ELEMENT username ( #PCDATA ) > <!ELEMENT address ( street, county, country ) > <!ELEMENT street ( #PCDATA ) > <!ELEMENT county ( #PCDATA ) > <!ELEMENT country ( #PCDATA ) > <!ELEMENT email ( #PCDATA ) > <!ELEMENT progname (#PCDATA ) > <!ELEMENT code ( #PCDATA ) > <!ELEMENT year ( #PCDATA ) > <!ELEMENT modid ( #PCDATA ) >

Source: users.dtd

You should notice that the DTD is a direct textual description of the diagram we have seen above. When you are designing your XML structures spend a little time thinking about what structure would be best to use to represent your data before proceeding to create your files.

Once both the XML document and the DTD have been created, you should validate the XML document against the DTD. If you have made any syntax errors and the document is not well-formed, these will be immediately picked up. In addition, the parser will inform you if your document is 'valid', that it conforms with the specifications within the DTD. Once your XML document is determined to be 'valid' and you are happy that you have met the requirements of the problem specification, you are complete!


Self-Learning

And now some problems for you! You should try the following two examples and for each create a XML document and corresponding DTD. Test these in the XML Editor and experiment until you are happy that your structures and constraints satisfy the goals of the problems!

Problem 1

You are responsible for creating an XML database structure for a typical library system. Your XML document will be used to store information on all books within the library and they must adhere to certain specifications as follows:

  • Required Information: One or more authors (firstname, lastname, title(optional)), publisher, year, book title, ISBN Number

  • Optional Information: Revision Number, First print date

  • Optional zero or more keywords describing the book content

Using both elements and attributes, create an XML document and DTD constraint to represent this data model. Provide examples of data for three books to illustrate the required specifications.


Problem 2

You must create an XML data model which represents sales within a Motor Sales Company. Bearing real-life factors into account, decide upon a sensible structure for representing such data and impose your own restrictions. Use at least 20 elements or attributes to represent a possible sales database. Then create your XML document and DTD. Provide examples of data for three sales to illustrate the required specifications. Note: If you are completely lost on this, some sample data might include vehicle type, model etc. and details on the customer.





Comments