What is XML?XML (eXtensible Markup Language) was invented for the purpose of having a standard and powerful way of describing any kind of data. XML offers a widely adopted standard way of representing text and data in a format that can be processed without much human or machine intelligence. Information formated in XML can be exchanged across platforms, languages, and applications, and can be used with a wide range of development tools and utilities. HTML and XML - why two markup languages? Both markup languages were designed with different purposes in mind. XML is actually similar enough to HTML - they are both closely related to the SGML (Standard Generalized Markup Language) markup definition language that has been an ISO standard since 1986. SGML is an early attempt to combine the metadata (data about the data!) with the data and was used primarily in large document management systems. Because SGML is a very complex language, it has limited mass appeal and we will have no further need for it in this module! So let's deal with HTML first. HTML has been covered somewhat in earlier chapters, so you should already have a decent understanding of its structure. HTML (Hypertext Markup Language) is the most recognized application of SGML, and is devised to allow any Web browser or application which understands HTML to display information in a consistent form. A HTML document is effective when it comes to laying out and displaying data, but it is a fixed set of tags, and it does not have the flexibility to describe different document and data types. HTML, in conjunction with Cascading Style Sheets (CSS), is reasonably good at displaying data, but it is not as good as XML at transporting data that is meant to be viewed or parsed in dozens of different ways by a variety of devices. In essence, where HTML is a presentation language, we require a richer communication means, with means of exchanging information from one computer to another. The need to extract data and put structure around information led to the creation of XML. Since it was released in 1997, XML use has been growing rapidly. There are two major fundamental differences between HTML and XML:
XML is not intended as a replacement for HTML and both are complementary technologies. XML is a more general and better solution to the problem of sharing data on the Web than extending HTML. What about relational databases? Traditional databases may be well suited for data that fits into rows and columns, but cannot adequately handle rich data such as audio, video, nested data structures or complex documents, which are characteristic of typical Web content. In order to deal with XML, many traditional databases are typically retrofitted with external conversion layers that mimic XML storage by translating it between XML and some other data format. This conversion is error-prone and results in a great deal of overhead, particularly with increasing transaction rates and document complexity. XML databases, on the other hand, store XML data natively in its structured, hierarchical form. Queries can be resolved much faster because there is no need to map the XML data tree structure to tables. This preserves the hierarchy of the data and increases performance. However, over 90+% of all web applications use RDBMS systems for their persistence layer. XML on the other hand, has found it's niche in a number of different areas:
The following is a sample section from a possible XML document, to provide an example for you at this point. It is *not* a full XML document - we will discuss the structure of XML documents shortly and you will notice that we need a few extra lines to consider it to be a full document.
While not necessarily the optimum structure for information such as above, it illustrates a major point of XML. The tags are defined by individuals, rather than some predefined standard structure - so we can do what we want! There are two different kinds of information in the above example:
XML documents mix markup and text together into a single file: the markup describes the structure of the document, while the text is the document's content. Why XML? Key features!There are a range of benefits associated with XML:
XML Document StructureXML documents are intended to store data, not necessarily to be viewed. They follow a layout very similar to HTML. In HTML there are two main sections in a document defined by the HEAD and BODY tags. An XML document also contains two sections: the document prolog at the head of the document and the instance, or the body. Prolog SectionThe document prolog must be the first thing in an XML document - it is the introduction to the document. Here is a sample prolog of an XML document:
The specification states that both parts of the prolog are optional. The first part is called the XML declaration and the second part the Document Type Definition. A Document Type Definition (DTD) sets all the rules for the document regarding elements, attributes, and other components. This DTD may be either an external DTD or Internal DTD.
In the example prolog above, it refers to an external DTD which can be found in the local system path 'DTD/book.dtd'. Any time you use a relative or absolute file path or a URL, you must use the SYSTEM keyword. The other option is using the PUBLIC keyword, and following it with a public identifier. This means that the W3C or another consortium has defined a standard DTD that is associated with that public identifier. For example:
We will discuss DTDs in more detail shortly in this Chapter. Instance SectionThe instance contains the remaining parts of the XML document, including the actual contents of the document, such as characters, paragraphs, pages and graphics ElementsElements are the most important part of an XML document. An element consists of content enclosed in an opening tag and a closing tag. An element can contain several different types of content:
XML element names are case-sensitive, meaning that opening and closing tags must be written in the same case. XML documents require both a begin and an end tag. Althought frequently with some elements in HTML (such as <BR>) you can omit the closing tags, all XML elementsmust include an end tag. Otherwise the XML would not be properly structured and would result in an error. For example, the following is errored XML:
The correct format would be:
When dealing with elements such as empty elements it is possible to specify them using the following shorthand:
XML Documents must be well-formed. Firstly, this means that you must follow the rules regarding case-sensitivity and always including closing tags. Additionally, you cannot mix the order of the nested tags: the first opened element must always be the last closed element. If any of the rules for XML syntax are not followed in an XML document, the document is not well-formed. The following is an example of an XML fragment, which is not well-formed:
And now time to get a little confusing! A well-formed document is not necessarily valid . Valid XML must additionally follow the constraints set upon an XML document by its Document Type Definition or schema. In XML, you can only have a single root element. That root element has subelements which may also further have subelements. The structure of an XML document is a tree of elements. So if you think of an element as a container, an XML document becomes a container of containers. Containers have a name associated with them (the element name) and possible additional characteristics (called attributes). The containers hold the content (or data) of the document. The start and end tags define the boundaries of the container. Figure 9.1. XML Element Structure AttributesIn additional to content, elements may have attributes. XML attributes are identical to HTML attributes, allowing you to attach characteristics to an element. For example in HTML:
Attributes have a name and a value and are placed within the start tag. In the document type definition (DTD), you define the legal attributes for an element and what values are legal for that attribute. Again, we will cover creating DTDs shortly, so for now just consider that they define the structure of our XML document. An element can have multiple attributes. While you can get away with omitting quotes for attributes in HTML, in XML the value must be surrounded by single or double quotes. When you use one type of quotes, the other type is legal within the quotes - for example:
In addition to learning how to use attributes, there is an issue of when to use attributes. Because XML allows such a variety of data formatting, it is rare that an attribute cannot be represented by an element, or that an element could not be easily converted to an attribute. Although there's no specification or widely accepted standard for determining when to use an attribute and when to use an element, there is a good rule of thumb: use elements for multiple-valued data and attributes for single-valued data. If data can have multiple values, or is very lengthy, the data most likely belongs in an element. To understand this, let us consider two formats for storing phone information:
Using attributes in this case is obviously far simpler to write and less verbose. However, it would make searching our data for all phone numbers with an 800 prefix quite difficult. Equally, the multiple element format would make it easy to generate an internal phone book, only showing the local extensions. Both formats are correct data formats which can be used. Essentially, which you use comes down to your own decision. Entities References and ConstantsLet us consider a XML file where we wish to include the data <HTML>. This would be a certainty where we were writing notes in XML describing XML or HTML (like I am right now!). For example:
So what's the problem here? The problem here is that XML parsers will attempt to handle this data as an XML tag, and then generate an error because there is no closing tag! This is a common problem, as any use of angle brackets results in this behaviour. Entity References provide a way to overcome this problem. An entity reference is a special data type in XML used to refer to another piece of data. The entity reference consists of a unique name, preceded by an ampersand and followed by a semicolon: &[entityname];. When an XML parser sees an entity reference, the specified substitution value is inserted and no processing of that value occurs. XML defines five special entities to address this problem: < for <, > for >, & for &, " for " and ' for '. Using these entities it is possible to define the above example, in the following way:
Once this document is parsed, the data is interpreted as "<HTML> and <BODY>" and the document is still considered well-formed. Using entities is not restricted to simply handling difficult escape characters within data. It is possible to use entities to effectively define 'variables' or 'constants' within your XML data. Consider the case where we repeatedly use the data 'Royal Society for the Prevention of Cruelty to Animals (RSPCA)' in our XML document. Rather than repeatedly type this every time, at our XML document (or root XML document if we use multiple subdocs) we define the following:
Then, when we wish to use this text within our XML document at any subsequent stage, we simply use the following entity: &rspca; to represent our constant. Likewise, the 'variable' representing the author's current email could be defined as an entity and referenced throughout the rest of the document. If the authors email address changes at a later date, then a simple change to the entity would modify the data throughout the rest of the document. Unparsed DataIn XML, there are three kinds of data that will be ignored by the parser: comments, processing instructions (PIs) and character data (CDATA). When the parser encounters one of these, normal operation is suspended while the parser looks for the end marker. Comments in XML are exactly like comments in HTML. Typically, they are ignored by most XML parsers.
Character Data (CDATA) sections allow you to put information that might be recognised as markup anywhere characters may occur. CDATA sections begin with <![CDATA[ and end with ]]>. The parser will ignore everything within the CDATA section. CDATA is also used when a significant amount of data should be passed to the calling application without any XML parsing, or when spacing must be preserved. Throughout these notes, CDATA is almost always used when it comes to displaying listings of programs or samples of XML and HTML where brackets and ampersands etc. are frequently used. It would not be practical in these situations to also use entity references repeatedly (although it could be done!) so we just use CDATA to display the block of unparsed code. Additionally, when it comes to program listings, it allows us to preserve spacing and layout of our sample code. So for example:
Processing Instructions (PIs) allow XML documents to contain instructions for applications. Like comments, they are not part of the document's character data, so they are of little interest to the XML processor. However, they must be passed through to the proper application. The PI begins with <? and ends with ?>. The only PI we have encountered so far has been in the prolog:
We will come across PIs later in the notes, when we introduce stylesheets to our XML documents, but for now you should not be too concerned! Document Type DefinitionsWhy do we need constraints on our XML documents? Because XML is extensible and can represent data in hundreds and thousands of ways, constraints on a document provide meaning to those various formats. Without document contraints, it is typically impossible to tell what the data in a document means. A DTD is used to define the structure of an XML document as well as what content is allowed. An XML document is not very usable without an accompanying DTD (or schema). Just as XML can effectively describe data, the DTD makes this data usable for many different programs in a variety of ways by defining the structure of the data. A DTD declares all the legal elements in a document, the legal attributes those elements can have, the hierarchy, nesting and occurrance indicators for all elements. A well-written DTD will give the same look to documents that use that DTD, as well as help the browser and XML parser display the document properly. A DTD is a text file. As mentioned previously, the DTD information may either be embedded directly in the XML document (internal DTD) or linked to an external file containing the DTD (external DTD). Since an external DTD may be used by any document conforming to its definition, this method is generally preferred. This means that if you define a DTD, then you can write multiple XML documents which can use the same DTD. If you were to use the internal approach in this situation, then you would need to replicate the DTD information in each seperate XML document. We will concentrate on the two main types of markup declarations found in the DTD:
Element DeclarationsElement declarations identify the names of elements as well as the nature of their content. An element definition begins with the ELEMENT keyword, followed by the standard <! opening of a DTD tag, and then the name of the element. The following element declaration defines the name of the tag (Author), and the content model for the tag. The + notation means the <Author> tag must contain one or more <Name> tags.
This means that if the Author tag is used within our document, the following are both valid:
However, if there was not at least one <Name> entry, then our XML document would not be valid! There are four different DTD recurrance modifiers, with the following descriptions:
To help with further understanding these modifiers, let us consider another example; that of a more detailed <Author> situation. Let us assume that we require one or more <Name> elements as before. Now, the <Name> element must have one <Firstname> element, one <Lastname> element and an optional list of <Qualification>s. We could define these DTD declarations as:
Using this DTD, the following XML snippet would be valid XML:
The #PCDATA keyword signifies that the tag contains parsed character data. The XML parser will find only character data, not tags or entity references. The only descriptions of allowed content between tags are (#PCDATA), EMPTY, and ANY. The EMPTY description means that there must not be content after the opening tag. The ANY means that there can be any type of text as long as it is valid XML. Finally, it is possible to use the '|' symbol as an OR operation.
which means that a Figure element will have either a graphic, table or screen-shot sub-element. Attribute DeclarationsAttributes supply additional information about elements. They are created in the DTD when the elements are specified and are specified through an attribute list. Attributes identify additional data about an element. Attribute declarations define which attributes can appear inside a tag, as well as the kinds of data the attributes can contain. In a DTD, an attribute list declaration begins with the string literal <!ATTLIST and then followed by the element name these attributes are for. After the name, you add one or more attribute declarations. An attribute declaration consists of three parts: the attribute name, its type, and a default declaration. The general form of an attribute declaration is:
Most attributes with textual values will simply be of the type CDATA, as shown in the example below:
Remember, that we are using attributes now, as opposed to elements, so in this example DTD entry we are representing a structure within the XML document, which could appear as follows:
You can also specify a set of values that an attribute must take on for the document to be considered valid:
There are four attribute default declaration options:
Now as a further example, consider the following DTD statements:
Here is an XML statement that will satisfy all of the above DTD statements:
This XML statement does not satisfy DTD rule 2 which requires a value of "male".
This XML statement fails DTD rule 5 (and rule 2) because "unknown" is not an acceptable value.
XML SchemasA schema is a model that describes the structure of information. The term originated in the database field, describing the structure of data in relational tables (database schema). XML Schema is a newly finalized candidate recommendation from the W3C. It seeks to improve on DTDs by adding more types and quite a few more constructs than DTDs, as well as following the XML format. For example, in DTDs, there is no facility for describing data, such as numbers, dates, and currency values not is there the ability to express the data type of character data in elements. However, at this incarnation of the module, we will not be covering XML Schemas in further detail. They are largely replacing DTDs due to their greater range of functionality. However, with this greater range of functionality comes greater complexity and would require too much module coverage to cover in sufficient detail. The concepts remain the same as those provided by DTDs. Parsing and Validating XMLAn XML parser is used to read XML documents, providing access to their content and structure. The parser handles the important task of taking a raw XML document as input and making sense of the document; it will ensure that the document is well-formed, and if a DTD or schema is referenced, it may be able to ensure that the document is valid. Essential XML Editor (Trial Version)The Essential XML Editor is trial tool for XML document editing. It includes a built in XML wellformedness tester and a DTD validator. The home page for the Essential XML Editor can be found at http://www.philo.de/xmledit/, where students can download and install the application. For now, we are principally interested in using the Essential XML Editor to test for well-formedness and validation. In order to explain how to use the XML Editor, we will explain it through the next question example. ExamplesWe have been introducing large amounts of rules, suggestions and information regarding the creation of XML and corresponding DTD files. It is time to create some data structures of our own. Let us consider the following problems: Sample Question 1We wish to store data on customers of a bank. Each customer has the following structure:
Generate an XML document which could represent this information (make two sample customers) and a corresponding DTD against which you should validate the document. Sample Question 1: AnswerLet's first attempt to draw a diagram to show the structure we require. We can call the 'root element' <CUSTOMERS> and work from there. Figure 3.2. Choosing a Document Structure (Q1) Using this diagram, we can easily see the structure and inheritance of our data. Now, we need to decide whether we should write the sample XML document first or the DTD. Either is suitable, but as a preference I tend to write some data first and will follow that preference here. You should decide yourself, which you prefer to do - try both and make a decision. The first stage is obviously to open up our XML Editor application. This has been installed on the machines in the Masters laboratories or can be downloaded and easily installed as detailed above.
Below is my attempt at how the data should be structured within the XML document:
Source: customers.xml The corresponding DTD can be created by using either the XML document or the figure as a guideline. The following is my suggested DTD structure for this problem:
Source: customers.dtd After your DTD and XML document are both written, you should attempt to validate the XML document against the DTD. This process should highlight problems associated with either file. Once you get the document to validate, you can add in further XML data to test the various specifications made in the question. Once you are happy that the objectives are met, then you are complete!
Sample Question 2Your boss, within the University in which you work, has informed you that he wishes you to manage authentication for the new Virtual Learning System, which is being developed by a team of programmers. He tells you that he wants all user details to be held in one XML file called users.xml, which should be validated by a corresponding DTD file called users.dtd. The following data structures need to be followed:
Generate an XML file containing data for at least two users (one lecturer, one student) and create a corresponding and 'clever' document type definition (DTD) file. Use both elements and attributes to represent the data. Sample Question 2: AnswerThere are a few different approaches which could be taken on this question. Depending on how comfortable you are with XML structures, you may decide to write the Document Type Definition (DTD) first and then write the XML document. Alternatively, you can write the XML document first. My own approach would be to decide upon the XML document structure first, filling in some sample data. In order to make this easier, you may decide that you want to layout a diagram first, such as that below. In the diagram, you can see that we first choose the root element, which we call <user>. This in turn is split into either <lecturer> or <student>, both marked with * to indicate that there can be 0 or more occurances of either user type. The two user types share some common data, such as firstname, surname, title and username (which we group under an element <name>). You can see how at certain points in the diagram we create elements, such as <name>, <contact> and <address> to 'cleverly' structure our XML data. We colour some elements/attributes in blue to indicate that they have no sub-elements and are typically PCDATA type. Figure 3.3. Choosing a Document Structure (Q2) So let us write the XML structure for three users. Essentially at this stage, we just use the diagram (if we created one, or else simply use the problem specification if you are able) to create our XML document structure, inserting sample data as we go. The following XML document represents the structure as seen in the diagram:
Source: users.xml The XML document above follows the project data specification. You will notice that we grouped the physical and email address details into an element called <contact>. While this was not specified, it does not violate the specification and makes structural sense. Note: There is more than one possible set of answers to this question. Now that we have our sample XML document, we should construct the DTD. When you are working on the DTD, you should start with the root element (user in our case), and work from element to element down the tree. Naturally, the student element and lecturer elements come next in this tree.
Source: users.dtd You should notice that the DTD is a direct textual description of the diagram we have seen above. When you are designing your XML structures spend a little time thinking about what structure would be best to use to represent your data before proceeding to create your files. Once both the XML document and the DTD have been created, you should validate the XML document against the DTD. If you have made any syntax errors and the document is not well-formed, these will be immediately picked up. In addition, the parser will inform you if your document is 'valid', that it conforms with the specifications within the DTD. Once your XML document is determined to be 'valid' and you are happy that you have met the requirements of the problem specification, you are complete! Self-LearningAnd now some problems for you! You should try the following two examples and for each create a XML document and corresponding DTD. Test these in the XML Editor and experiment until you are happy that your structures and constraints satisfy the goals of the problems! Problem 1You are responsible for creating an XML database structure for a typical library system. Your XML document will be used to store information on all books within the library and they must adhere to certain specifications as follows:
Using both elements and attributes, create an XML document and DTD constraint to represent this data model. Provide examples of data for three books to illustrate the required specifications. Problem 2You must create an XML data model which represents sales within a Motor Sales Company. Bearing real-life factors into account, decide upon a sensible structure for representing such data and impose your own restrictions. Use at least 20 elements or attributes to represent a possible sales database. Then create your XML document and DTD. Provide examples of data for three sales to illustrate the required specifications. Note: If you are completely lost on this, some sample data might include vehicle type, model etc. and details on the customer. |
Retired Content >