Travis Vandersypen

Subscribe to Travis Vandersypen: eMailAlertsEmail Alerts
Get Travis Vandersypen: homepageHomepage mobileMobile rssRSS facebookFacebook twitterTwitter linkedinLinkedIn


Related Topics: XML Magazine

XML: Article

Managing Your XML Documents with Schemas

Managing Your XML Documents with Schemas

The XML Schema Definition Language solves a number of problems posed with Document Type Definitions. Because DTDs prompted much confusion and complaining among XML developers, the W3C set about creating a new standard for defining a document's structure.

What the W3C created is something even more complex and flexible than DTDs: the XML Schema Definition Language. In this article we'll look at many aspects of schemas and how you can build and use them.

A Little Background
Schemas, while more complex than DTDs, give an individual much more power and control over how XML documents are validated. For instance, with the new W3C standard a document definition can specify the data type of an element's contents, the range of values for elements, the minimum as well as maximum number of times an element may occur, annotations to schemas, and much more.

In May of 2001 the W3C finalized its recommendation for the XML Schema Definition Language. This standard allows an author to define simple and complex elements as well as the rules governing how those elements and their attributes may show up within an instance document. The author has a large amount of control over how the structure of a conforming XML document must be created. The author can apply various restrictions to the elements and attributes within the document, from specifying the length to specifying an enumerated set of acceptable values for the element or attribute. With the XML Schema Definition Language, an XML schema author possesses an incredible amount of control over the conformance of an associated XML document to the specified schema.

Sample XML Document
The remainder of this article is devoted to creating and understanding the XML schema for the XML document shown in Listing 1, which details a purchase order for various items that can commonly be found in a grocery store. This document allows one individual to receive the shipment of the goods and an entirely different individual to pay for the purchase. This document also contains specific information about the products ordered, such as how much each product cost, how many were ordered, and so on.

As you can see, the listing represents a fairly small and simple order that could be placed online. It contains the necessary information regarding how payment is to be made, how the order is to be shipped, and what day delivery should be. The listing should by no means be construed as an all-inclusive document for an online grocery store order; it has been constructed only for use as an example.

For the listing, an author might construct a DTD to describe the XML document. While such a DTD might require only 30 lines or so, it would provide a relatively inflexible definition of the XML document.

A Sample Schema
Creating an XML schema to describe this document is somewhat more complex than building a DTD. However, in exchange for the extra complexity, the schema gives the author virtually limitless control over how an XML document can be validated against it.

Authoring an XML schema consists of declaring elements and attributes as well as the "properties" of those elements and attributes. We will begin our look at authoring XML schemas by working our way from the least complex to the most complex example. Because attributes may not contain other attributes or elements, we will start there.

Declaring attributes
Attributes in an XML document are contained by elements. To indicate that a complex element has an attribute, use the <attribute> element of the XML Schema Definition Language. For instance, Listing 2 is from a hypothetical PurchaseOrder schema based on the XML document shown in Listing 1. You can see the basics for declaring an attribute.

From this you can see that, when declaring an attribute, you must specify a type. This type must be one of the simple types: anyURI, base64Binary, boolean, byte, date, dateTime, decimal, double, duration, ENTITIES, ENTITY, float, gDay, gMonth, gMonthDay, gYear, gYearMonth, hexBinary, ID, IDREF, IDREFS, int, integer, language, long, Name, NCName, negativeInteger, NMTOKEN, NMTOKENS, nonNegativeInteger, nonPositiveInteger, normalizedString, NOTATION, positiveInteger, QName, short, string, time, token, unsignedByte, unsignedInt, unsignedLong, unsignedShort. Each type can be further categorized as a "primitive" data type or a "derived" data type. The derived data types are "primitive" or other "derived" data types with restrictions placed on them, such as integer, positiveInteger, and byte.

From the simple types you may notice what appears to be a group of duplicate or unnecessary types, such as nonNegativeInteger and positiveInteger. If you look closely, you'll see that nonNegativeInteger is an integer whose value is greater than or equal to zero, whereas the positiveInteger type is an integer whose value is greater than zero, which means a positiveInteger type cannot be zero. Keep this in mind when deciding on the base data type for your elements and attributes - these small details can greatly influence their acceptable value ranges.

Aside from defining the type of an attribute, the <attribute> element within the XML Schema Definition Language contains attributes to assist in defining when an attribute is optional, whether its value is fixed, what its default value is, and so on. Here's the basic syntax for the <attribute> element:

<attribute name="" type="" [use=""] [fixed=""] [default=""] [ref=""]/>

The use attribute can contain one of the following possible values:

  • Optional
  • Prohibited
  • Required
If the use attribute is set to required, the parent element must have the attribute; otherwise the document will be considered invalid. A value of optional indicates the attribute may or may not occur in the document and the attribute may contain any value. By assigning a value of prohibited to the use attribute, you can indicate that the attribute may not appear at all within the parent element.

Specifying a value for the default attribute indicates that if the attribute does not appear within the specified element of the XML document, it is assumed to have the value. A value within the fixed attribute indicates the attribute has a constant value.

It's important to remember that if you specify a value for the fixed attribute of the <attribute> element, the resulting attribute must have the value specified for the attribute to be valid. If you mean to indicate that the attribute should have a default value of some sort, use the default attribute instead. It should be noted that the default and fixed attributes are mutually exclusive.

The ref attribute for the <attribute> element indicates that the attribute declaration exists somewhere else within the schema. This allows complex attribute declarations to be defined once and referenced when necessary. For instance, let's say you've "inherited" elements and attributes from another schema and would simply like to reuse one of the attribute declarations within the current schema; this would provide the perfect opportunity to take advantage of the ref attribute.

Just as attributes can be defined based on the simple data types included in the XML Schema Definition Language, they can also be defined based on <simpleType> elements. This can easily be accomplished by declaring an attribute that contains a <simpleType> element, as the following example demonstrates:

<xsd:attribute name="exampleattribute">
<xsd:simpleType base="string">
<xsd:length value="2"/>
</xsd:simpleType>
</xsd:attribute>

<xsd:complexType name="exampleelement">
<xsd:attribute ref="exampleattribute"/>
</xsd:complexType>
From this example you can see that the XML Schema Definition Language gives the schema author a great deal of control over how attributes are validated. One of the wonderful side effects of the XML Schema Definition Language is the similarity to object-oriented programming. Consider each attribute definition and element definition to be a class definition. These class definitions describe complex structures and behaviors among various different classes, so each individual class definition, whether it's a simple class or complex class, encapsulates everything necessary to perform its job. The same holds true for the declaration of attributes and elements within an XML document. Each item completely describes itself.

Declaring elements
Elements within an XML schema can be declared using the <element> element from the XML Schema Definition Language. The example in Listing 3 shows a simple element declaration using the XML Schema Definition Language.

From the example you can see that an element's type may be defined elsewhere within the schema. The location at which an element is defined determines certain characteristics about its availability within the schema. For instance, an element defined as a child of the <schema> element can be referenced anywhere within the schema document, whereas an element that is defined when it's declared can have that definition used only once.

An element's type can be defined with a <complexType> element, a <simpleType> element, a <complexContent> element, or a <simpleContent> element. The validation requirements for the document will influence the choice of an element's type. For instance, going back to our object-oriented analogy, let's say you define a high-level abstract class and then need to refine its definition for certain situations. In that case you would create a new class based on the existing one and change its definition as needed. The <complexContent> and <simpleContent> elements work much the same way: they provide a way to extend or restrict the existing simple or complex type definition as needed by the specific instance of the element declaration.

The basic construction of an element declaration using the <element> element within the XML Schema Definition Language is as follows:

<element name="" [type=""] [abstract=""] [block=""]
[default=""] [final=""] [fixed=""] [minOccurs=""]
[maxOccurs=""] [nillable=""] [ref=""] [substitutionGroup=""]/>
From this you can see that element declarations offer a myriad of possibilities to the author. For instance, the abstract attribute indicates whether the element being declared may show up directly within the XML document. If this attribute is true, the declared element may not show up directly. Instead, this element must be referenced by another element using the substitutionGroup attribute. This substitution works only if the element utilizing the substitutionGroup attribute occurs directly beneath the <schema> element.

In other words, for one element declaration to be substituted for another, the element using the substitutionGroup attribute must be a top-level element. Why would anyone in his right mind declare an element as abstract? The answer is really quite simple. Let's say you need to have multiple elements that have the same basic values specified for the attributes on the <element> element. A <complexType> element definition does not allow for those attributes. So, rather than define and set those attribute values for each element, you could make an "abstract" element declaration, set the values once, and substitute the abstract element definition as needed.

You may omit the type attribute from the <element> element, but you should have either the ref attribute or the substitutionGroup attribute specified.

The type attribute indicates that the element should be based on a complexType, simpleType, complexContent, or simpleContent element definition. By defining an element's structure using one of these other elements, the author can gain an incredible amount of control over the element's definition. We will cover these various element definitions in the "Declaring Complex Elements" and "Declaring Simple Types" sections later in this article.

The block attribute prevents any element with the specified derivation type from being used in place of the element. The block attribute may contain any of the following values:

#all
extension
restriction
substitution

If the value #all is specified within the block attribute, no elements derived from this element declaration may appear in place of this element. A value of extension prevents any element whose definition has been derived by extension from appearing in place on this element. If a value of restriction is assigned, an element derived by restriction from this element declaration is prevented from appearing in place of this element. Finally, a value of substitution indicates that an element derived through substitution cannot be used in place of this element.

The default attribute may be specified only for an element based on a simpleType or whose content is text only. This attribute assigns a default value to an element.

You cannot specify a value for both a default attribute and a fixed attribute; they are mutually exclusive. Also, if the element definition is based on a simpleType, the value must be a valid type of the data type.

The minOccurs and maxOccurs attributes specify the minimum and maximum number of times this element may appear within a valid XML document. Although you may explicitly set these attributes, they are not required. To indicate that an element's appearance within the parent element is optional, set the minOccurs attribute to 0. To indicate that the element may occur an unlimited number of times within the parent element, set the maxOccurs attribute to the string "unbounded". However, you may not specify the minOccurs attribute for an element whose parent element is the <schema> element.

The nillable attribute indicates whether an explicit null value can be assigned to the element. If this particular attribute is omitted, it is assumed to be false. If this attribute has a value of true, the nil attribute for the element will be true. So what exactly does this do for you, this nillable attribute? Well, let's say you are writing an application that uses a database that supports NULL values for fields and you are representing your data as XML. Now let's say you request the data from your database and convert it into some XML grammar. How do you tell the difference between those elements that are empty and those elements that are NULL? That's where the nillable attribute comes into play. By appending an attribute of nil to the element, you can tell whether it is empty or is actually NULL. Remember, the nillable attribute applies only to an element's contents and not the attributes of the element.

The fixed attribute specifies that the element has a constant, predetermined value. This attribute applies only to those elements whose type definitions are based on simpleType or whose content is text only.

Declaring complex elements
Many times within an XML document an element may contain child elements and/or attributes. To indicate this within the XML Schema Definition Language, you'll use the <complexType> element. If you examine the sample section from Listing 4, you'll see the basics used to define a complex element within an XML schema.

The sample section specifies the definition of PurchaseOrderType. This particular element contains three child elements - ShippingInformation, BillingInformation, and Order - as well as two attributes - Tax and Total. You should also notice the use of the maxOccurs and minOccurs attributes on the element declarations. With a value of 1 indicated for both attributes, the element declarations specify that they must occur one time within the PurchaseOrderType element.

The basic syntax for the <complexType> element is as follows:

<xsd:complexType name='' [abstract=''] [base='']
[block=''] [final=''] [mixed='']/>
The abstract attribute indicates whether an element may define its content directly from this type definition or from a type derived from this type definition. If this attribute is true, an element must define its content from a derived type definition. If this attribute is omitted or its value is false, an element may define its content directly based on this type definition.

The base attribute specifies the data type for the element. This attribute may hold any value from the included simple XML data types.

The block attribute indicates what types of derivation are prevented for this element definition. This attribute can contain any of the following values:

#all
extension
restriction

A value of #all prevents all complex types derived from this type definition from being used in place of this type definition. A value of extension prevents complex type definitions derived through extension from being used in place of this type definition. Assigning a value of restriction prevents a complex type definition derived through restriction from being used in place of this type definition. If this attribute is omitted, any type definition derived from this type definition may be used in place of this type definition.

The mixed attribute indicates whether character data is permitted to appear between the child elements of this type definition. If this attribute is false or is omitted, no character may appear. If the type definition contains a simpleContent type element, this value must be false. If the complexContent element appears as a child element, the mixed attribute on the complexContent element can override the value specified in the current type definition.

A <complexType> element in the XML Schema Definition Language may contain only one of the following elements:

all
choice
complexContent
group
sequence
simpleContent

Declaring simple types
Sometimes it's not necessary to declare a complex element type within an XML schema. In these cases you can use the <simpleType> element of the XML Schema Definition Language. These element type definitions support an element based on the simple XML data types or any simpleType declaration within the current schema. For example, consider the following example:

<xsd:simpleType name="PaymentMethodType">
<xsd:restriction base="xsd:string">
<xsd:enumeration value="Check"/>
<xsd:enumeration value="Cash"/>
<xsd:enumeration value="Credit Card"/>
<xsd:enumeration value="Debit Card"/>
<xsd:enumeration value="Other"/>
</xsd:restriction>
</xsd:simpleType>
This type definition defines the PaymentMethodType element definition, which is based on the string data type included in the XML Schema Definition Language. You may notice the use of the <enumeration> element. This particular element is referred to as a facet, which we'll cover in the next section.

The basic syntax for defining a simpleType element definition is as follows:

<xsd:simpleType name=''>
<xsd:restriction base=''/>
</xsd:simpleType>
The base attribute type may contain any simple XML data type or any simpleType declared within the schema. Specifying the value of this attribute determines the type of data it may contain. A simpleType may contain only a value, not other elements or attributes.

You may also notice the inclusion of the <restriction> element. This is probably the most common method in which to declare types, and it helps to set more stringent boundaries on the values an element or attribute based on this type definition may hold. So, to indicate that a type definition's value may hold only string values, you would declare a type definition as follows:

<xsd:simpleType name='mySimpleType'>
<xsd:restriction base='xsd:string'/>
</xsd:simpleType>
Two other methods are available to an XML schema author to "refine" a simple type definition: <list> and <union>. The <list> element allows an element or attribute based on the type definition to contain a list of values of a specified simple data type. The <union> element allows you to combine two or more simple type definitions to create a collection of values.

Putting It All Together
Now let's look at Listing 5, a complete schema for the document shown in Listing 1. You may notice the use of the <xsd:choice> element. This element can be used to indicate when one of a group of elements or attributes may show up, but not all, as is the case with the DeliveryDate and BillingDate attributes. Also, notice the use of the xsd namespace. This namespace can be anything, but we'll use xsd to indicate an XML Schema Definition Language element.

As we indicated earlier, the listing is substantially more complex than a DTD would be, but it provides much better control over your XML document. There are many additional facets to an XML schema, but the information and examples here should be enough to get your feet wet.

Summary
The XML Schema Definition Language provides a very powerful and flexible way in which to validate XML documents. It includes everything from declaring elements and attributes to "inheriting" elements from other schemas, from defining complex element definitions to defining restrictions for even the simplest of data types. This gives the XML schema author such control over specifying a valid construction for an XML document that there is almost nothing that cannot be defined with an XML schema.

Further Reading

  • Schmelzer, R., Vandersypen T., et al. (2002). XML and Web Services Unleashed. Sams Publishing.
  • Savourel, Y. (2001). XML Internationalization and Localization. Sams Publishing.
  • Rambhia, A.M. (2002). XML Distributed Systems Design. Sams Publishing.
  • More Stories By Ron Schmelzer

    Ron Schmelzer is founder and senior analyst of ZapThink. A well-known expert in the field of XML and XML-based standards and initiatives, Ron has been featured in and written for periodicals and has spoken on the subject of XML at numerous industry conferences.

    More Stories By Travis Vandersypen

    Travis Vandersypen, a programmer with EPS Software Corporation, has five years' development experience in XML, UML, XSLT, FoxPro, HTML, and other tools. He has authored a number of articles and is a frequent speaker at conferences.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.