Evolution of XML
Extensible Markup Languages (XML) history begins with the development of Standardised Generalised Markup Language (SGML) by Charles Goldfarb, along with Ed Mosher and Ray Lorie in the 1970s while working at IBM (Anderson, 2004). SGML despite the name is not a mark-up language in it’s own right, but is a language used to specify mark-up languages. The purpose of SGML was to create vocabularies which could be used to mark up documents with structural tags. It was imagined at the time, that certain machine readable documents should remain machine readable for perhaps decades.
One of the most popular applications of SGML came with the development of HyperText Markup Language (HTML) by Tim Berners Lee in the late 1980s (Raggett, Lam, Alexander & Kmiec, 1998). Since its development HTML has somewhat become a victim of it’s own popularity, as it was rapidly adopted and extended in many ways, beyond it’s original vision. It remains popular today, though as a presentation technology, and is considered unsuitable as a general purpose data storage format.
When it comes to data storage and interchange, HTML is a bad fit, as it was originally intended as a presentation technology, while SGML is considered too complex for general use. XML bridges this gap by being both human and machine readable, while being flexible enough to support platform and architecture independent data interchange.
Applications of XML
At it’s core, XML allows a software engineer the ability to create a vocabulary, and use this vocabulary to describe data. For example, when exchanging data between computers, the number 42 is meaningless unless you also exchange the meaning of the data, being the CPU temperature expressed in degrees Celsius.
Only when both sender and receiver of data have an agreed understanding of the data’s meaning, can they begin to do something useful with it. Before the development of XML, a certain amount of a-priori agreement on data and it’s meaning was required between systems. With the development of XML, data can be exchanged between systems without any prior agreement, so long as both systems understand the same vocabulary, that is, speak the same language.
Since the development of XML, several applications have arisen in the areas of (IBM, 2005):
- Web publishing: XML allows you to create interactive pages, allows the customer to customize those pages, and makes creating e-commerce applications more intuitive.
- Web searching and automating Web tasks: XML defines the type of information contained in a document, making it easier to return useful results when searching the Web.
- General applications: XML provides a standard method to access information, making it easier for applications and devices of all kinds to use, store, transmit, and display data.
- e-business applications: XML implementations make electronic data interchange (EDI) more accessible for information interchange, business-to-business transactions, and business-to-consumer transactions.
- Metadata applications: XML makes it easier to express metadata in a portable, reusable format.
- Pervasive computing: XML provides portable and structured information types for display on pervasive (wireless) computing devices such as personal digital assistants (PDAs), cellular phones, and others.
There are two ways to define the structure of an XML document. Those are Data Type Definition (DTD) documents and XML schema documents. DTD documents were introduced by SGML, and they conform to Extended Backus Naur Form (EBNF). XML Schema documents on the other hand are written using an XML syntax.
Both DTD and XML schema allow for the specification of constraint rules which are applied to the contents of XML instance documents. These take the form of rules for validating the structure of XML elements.
XML elements can contain data, other elements, a combination of both, or neither, as in an empty element. All XML documents have a single root element which contains sub elements, and their sub-elements and so on. This results in a hierarchical tree structure within the XML document.
Because of it’s development through SGML, Document Type Definitions (W3C, 1998) are more suited to document centric applications, such as HTML. HTML is specified using DTD. While DTD can define the structure of a document, it does not have the ability to specify rules which should apply to the data. That is, all data contained within the XML document, DTD treats it as a string. This suits document markup languages, but is not suitable when an application needs to control the contained data.
XML Schema Language was developed to overcome this shortfall (W3C, 2000). XML schema defines many more data types, and allows for the specification of rules which not only apply to the structure of the XML document, but to the contained data too. In this way an XML tag can be defined with a type of nonNegativeInteger, and a validating XML parser would fail the document if the data was other than an integer greater than zero.
Storing XML Data
Scardina, Chang and Wang (2004) offer three options when it comes to storage of XML data. They are store the XML document as is, extract the data from the document and store as relational data, or use XML database types to store the XML.
If you think of a content management system, which is managing XML documents then it is not hard to see that one could just store the XML file as it is, without any processing. Oracle and other databases offer character large object types which could support the storage of the XML file as a large string. Alternatively, one could store the XML files on the file system. The result is the same, all the application can do with the documents is to read and write them as a whole.
If this is an approach which is suitable to an application, then the designers should consider compressing the XML files before storage. XML has a verbose syntax, and can often expand the storage required significantly over the actual data being stored. The W3C Efficient XML Interchange Working Group have the responsibility of finding ways to overcome XML’s verbosity, and provide an efficient means of compressing XML for transport and storage (Cover, 2007). Harrusi, Averbuch, & Yehudai (2006), have also published an XML aware compression technique.
Since many applications are data centric and are interested in the contents of the XML document, then the first approach is not suitable. The second option is to parse the data from the XML document and store this as regular relational data. Since XML should be self-contained, in the sense that it contains both data, and the meaning of the data, it should be possible to parse the XML file to find the data points of interest to the application, and store these in relational or object-relational tables.
A consequence of this approach is that a significant amount of processing is required when parsing XML data. This is especially pronounced when large datasets are being transported as XML. Lu, Chiu, & Pan (2006) propose a parallel processing approach for such datasets, where the processing effort is distributed across several CPUs.
Finally Scardina, Chang and Wang (2004) recommend storing the XML into native database XML types. All that is required in this approach is to register the XML schema with Oracle database. This registration process allows the database to ‘know’ what types of data are being stored, and to create appropriate storage. Oracle’s XML developer kit (XDK) provides a rich set of application programming interfaces (APIs) for dealing with data in such storage.
This approach provides the best of both previous approaches with ease of storage, and relational style access to the contained data through SQL queries. This approach is only appropriate when a schema is defined for the stored XML so it makes this approach unsuitable where no schema exists, or where the schema is frequently modified.
Mapping Relational Data
Getting XML data into a database is only half of the issue when using XML. As older legacy system are being upgraded for service oriented architecture, and web services, making the contained data available in XML formats is becoming more important for engineers. This requires approaches for mapping existing relational data to XML formats.
Lv & Yan (2006) approach this problem by attempting to translate relational database schemas to DTD documents. They present a method to generate DTDs from relational schemas in the presence of keys, foreign keys, and functional dependencies, which can preserve the semantics implied by functional dependencies, keys and foreign keys of relational schemas and can convert multiple tables to XML DTDs. While this is a forward step towards full semantic conversion of relational schemas to XML DTDs, they note there is still work remaining in converting further relational semantics such as multi-valued dependencies
DTDs are the most commonly used XML schema definition documents, but as Lim, Joo, Kim & Choi (2007) note, it is a simple format, which does not have the resolution to take into account some of the finer points of relational data, such as maintaining primary and foreign key and other constraints.
DTD defines the structure of a well formed XML document using simple format expressions. These do not allow a sufficient level of detail to be used in XML to relational mapping. For example, DTD can define a list to contain zero or more, or one or more elements, though it cannot define other limits.
Lim, Joo, Kim & Choi (2007) base their XML mapping algorithm using the newer XML Schema Definition documents (XSD). This can specify a list to contain 2 to 5 elements for example, and can be used to ensure that an XML document is both well formed and valid against this schema. XML Schema allows for finer grained control of an XML format than DTD and is better suited to enabling automated mapping of an XML Schema to SQL Data Definition Language for relational database mapping.
XML has proved hugely successful in the areas of document mark-up, data and meta-data sharing, enabling interoperability, and transparently transporting and storing data. With the current level of interest in the next generation of enterprise systems, the use of XML is set to grow as it is a core technology to web services, portal development and service oriented architectures. XML is here to stay, and it’s future looks bright.
- Anderson, T. (2004) Introducing XML. Retrieved on February 29, 2008 from http://www.itwriting.com/xmlintro.php
- Cover R. (2007) XML and Compression. Retrieved on February 29, 2008 from http://xml.coverpages.org/xmlAndCompression.html
- Harrusi, S., Averbuch, A. & Yehudai, A. (2006) XML Syntax Conscious Compression. Proceedings of the 2006 Data Compression Conference. 10 – 19.
- IBM Corp. (2005) Uses of XML. Retrieved on February 29, 2008 from http://publib.boulder.ibm.com/infocenter/iseries/v5r3/index.jsp?topic=/rzamj/rzamjintrouses.htm
- Lim, J., Joo, K., Kim, K. & Choi, M. (2007) Design of Automatic Database Schema Generator Based on XML Schema. Proceedings of the 2007 International Conference on Computational Intelligence and Security. 1039 – 1043.
- Lu, W., Chiu, K. & Pan, Y. (2006) A Parallel Approach to XML Parsing. Proceedings of the 7th IEEE/ACM International Conference on Grid Computing. 223 – 230.
- Lv, T., & Yan, P. (2006) Mapping Relational Schemas to XML DTDs with Constraints. Proceedings of the 2006 IMCCS First International Multi-Symposiums on Computer and Computational Sciences. Vol. 2. 528 – 533.
- Raggett D., Lam J., Alexander I. & Kmiec M. (1998) A History of HTML. Retrieved on February 29, 2008 from http://www.w3.org/People/Raggett/book4/ch02.html
- Scardina M., Chang B. & Wang J. (2004) Oracle Database 10g XML & SQL: Design, Build & Manage XML Applications in Java, C, C++ & PL/SQL. McGraw-Hill: Emeryville, CA.
- W3C (1998) W3C XML Specification DTD. Retrieved on February 29, 2008 from http://www.w3.org/XML/1998/06/xmlspec-report-19980910.htm
- W3C (2000) XML Schema. Retrieved on February 29, 2008 from http://www.w3.org/XML/Schema