XML For N00bs, AJAX for GurusFiled: Wed, Jan 17 2007 under Programming|| Tags: ajax xml parse crawl traverse tutorial beginner If you look at the source code of any web page you'll see that it begins with <html> and ends with </html>. Everything in between is simply the content of the <html> element. One day, in a galaxy far, far away, the people who brought us HTML decided that what was good for the web was good for people's data. And they saw that it was good, and called it XML. The articles on this site have introduced AJAX, covered the basics, explored the caveats, and provided a simple framework for concurrent calls. But like most of the rest of the Internet they use AJAX to receive plain-text data from the server. People tend to forget that the X in AJAX stands for "XML" because responseText is easy, responseXML not so much. With this article the last elements of AJAX (synchronous and responseXML) will be covered and we'll begin exploring XML -- what it is, how to load it, how to parse it, and how to use it in javascript applications. The Extensible Markup Language (XML) is a W3C-recommended general-purpose markup language that supports a wide variety of applications. XML languages or 'dialects' are easy to design and to process. XML is also designed to be reasonably human-legible, and to this end, terseness was not considered essential in its structure. XML is a simplified subset of Standard Generalized Markup Language (SGML). Its primary purpose is to facilitate the sharing of data across different information systems, particularly systems connected via the Internet[1]. Formally defined languages based on XML (such as RSS, MathML, XHTML, Scalable Vector Graphics, MusicXML and thousands of other examples) allow diverse software to reliably understand information formatted and passed in these languages. -- Wikipedia IntroductionBefore the internet came along, most data was described in the programming language and then saved in a machine-readable binary format. When someone needed to work with that data they had to call up the source code of the program to figure out how the data was stored on the hard drive. The Internet, however, allowed data to be transferred between people who had no access to the original source code so there existed a need to be able to send the description of the data along with the data itself. Since the web was already doing this with HTML, it wasn't a great leap of human ingenuity to create XML from HTML. Indeed, XML is so similar to HTML that all the web is now slowly transitioning to what is known as XHTML. Additionally, XML has taken root in several places behind the scenes on the Internet. All RSS feeds, for instance are XML. A site offers its search engine to your browser with XML. Even Blizzard uses it as a framework for it's lua scripting language for its wildly popular World of Warcraft. For the purpose of this article we'll be using a snapshot of this site's RSS feed.
If you'll browse the RSS file above you'll see that, aside from the tag names, it looks like any HTML source, save extra care has been made to close the various elements (</title>). The element names are descriptive and even if you don't understand any programming languages you can follow along with the data and see what it does. Every element can have sub-elements, and those can have sub-elements etc, etc, etc... But they are all a part of the original element which was the very first tag named <rss>. The vocabularyThis method of displaying data is called a "tree" or "document tree". The root is the first tag (here, <rss> -- and in a web page <html>), from there the various elements branch off. If you've ever used a threaded message board or email application where the message replies are collapsible entities under the original message you should be very comfortable with this structure. The root branches off to its various children, just like a genealogy chart. <rss> is the parent of <channel> and likewise <channel> is a child of <rss>. An individual element is referred to as a "node". <rss> is a node which contains the entire document. <channel> is a node which contains all of the document except for <rss> . And that's pretty much all the vocabulary you need to understand how we deal with an XML file. Loading the XML FileTo begin working with XML data we'll first need a way to get it. Fortunately, another web 2.0 buzzword called AJAX is designed to do just that. Unfortunately, Microsoft -- who made AJAX possible -- seems to have a problem using AJAX to read XML files, but of course there's a workaround.
function loadXML(url) {
// This function accepts a file name and returns an XML document.
// Note we're doing no error checking.
if (window.ActiveXObject){
// Since IE has so much trouble with responseXML
// we'll load the XML with the XMLDOM activeX object
var xmlDoc=new ActiveXObject("Microsoft.XMLDOM");
// Handle ready state changes ( ignore them until readyState = 4 )
xmlDoc.onreadystatechange= function() { if (xmlDoc.readyState!=4) return false; }
// This is a synchronous call! The script will stall until the document
// has been loaded.
xmlDoc.async="false";
xmlDoc.load(url);
return xmlDoc;
}
// Initialize the AJAX object.
var AJAX=new XMLHttpRequest();
// Handle ready state changes ( ignore them until readyState = 4 )
AJAX.onreadystatechange= function() { if (AJAX.readyState!=4) return false; }
// we're passing false so this is a syncronous request.
// The script will stall until the document has been loaded.
AJAX.open("GET", url, false);
AJAX.send(null);
return AJAX.responseXML;
}
This, very simplified, function will accept a url and if the browser can do activeX (Internet Explorer) it loads the xml using the XMLDOM object which seems to have much fewer problems returning XML which is compatible with firefox's XML. If there's no activeX, the document is loaded through the XMLHttpRequest object. Note that both methods set up a syncrounous call which basically means javascript will stall at xmlDoc.load or AJAX.send until the requested document has been loaded. This is done by setting xmlDoc.async to "false" and setting the third parameter of AJAX.open to false. Both tell the browser to load the file synchronously. Crawl / Traverse the fileNow that the document has been loaded it's time to do something with it. You might be surprised at how easy it is to crawl the document and output something readable.
function crawlXML(doc) { // Crawls an XML document
if(doc.hasChildNodes()) { // If present element has children
_xmlStr+='<ul><li>'+doc.tagName+'> '; // Display current tag name
for(var i=0; i<doc.childNodes.length; i++) { // for each child node on current level
crawlXML(doc.childNodes[i]); // Call this function recursively
} // end for loop
_xmlStr+='</li></ul>'; // Close the list item.
} else { // current element has no children
_xmlStr+=doc.nodeValue; // So display the value of the data
} // End childNode check
} // End crawlXML
This function recursively crawls the XML document. Recursion is just a method where a function calls itself over and over, passing different data each time. Recursion is especially well suited to tree like data structures. The output is placed in a global variable called _xmlStr which can be displayed after crawlXML has finished running. As you can see, it would be fairly trivial to remove _xmlStr and replace it with code to initialize your own variables and arrays. doc.tagName becomes the name of the variable and doc.nodeValue becomes the value. Here's the completed code and a chance to see it execute below.
function loadXML(url) {
// This function accepts a file name and returns an XML document.
// Note we're doing no error checking.
var AJAX = null; // Initialize the AJAX variable.
if (window.ActiveXObject){
// Since IE has so much trouble with responseXML
// we'll load the XML with the XMLDOM activeX object
var xmlDoc=new ActiveXObject("Microsoft.XMLDOM");
// Handle ready state changes ( ignore them until readyState = 4 )
xmlDoc.onreadystatechange= function() { if (xmlDoc.readyState!=4) return false; }
// This is a synchronous call! The script will stall until the document
// has been loaded.
xmlDoc.async="false";
xmlDoc.load(url);
return xmlDoc;
}
// Initialize the AJAX object.
if (window.XMLHttpRequest) { AJAX=new XMLHttpRequest(); }
// Handle ready state changes ( ignore them until readyState = 4 )
AJAX.onreadystatechange= function() { if (AJAX.readyState!=4) return false; }
// we're passing false so this is a syncronous request.
// The script will stall until the document has been loaded.
AJAX.open("GET", url, false);
AJAX.send(null);
return AJAX.responseXML;
}
function crawlXML(doc) { // Crawls an XML document
if(doc.hasChildNodes()) { // If present element has children
_xmlStr+='<ul><li>'+doc.tagName+'> '; // Display current tag name
for(var i=0; i<doc.childNodes.length; i++) { // for each child node on current level
crawlXML(doc.childNodes[i]); // Call this function recursively
} // end for loop
_xmlStr+='</li></ul>'; // Close the list item.
} else { // current element has no children
_xmlStr+=doc.nodeValue; // So display the value of the data
} // End childNode check
} // End crawlXML
// Initialize a global variable
var _xmlStr='';
// Load the document
xmlDoc=loadXML('http://www.hunlock.com/feed.php');
// Crawl the document -- passing it the top element
crawlXML(xmlDoc.documentElement);
document.getElementById("someDivision").innerHTML=_xmlStr;
To see the script in action Click Here. This will populate the division below with the results of crawling this site's current RSS feed.
Waiting for user's request.
Crawling the XML document is a great way to extract the data if you're not sure of the structure or element names but if you know exactly what you want there is another way to extract the data. Because XML and HTML are two sides of the same coin, the functions which allow you to navigate HTML are available for you to navigate XML. Parse the XMLIn HTML you can extract all the IMG elements by simply doing document.getElementsByTagName("IMG"), likewise you can do xmlDoc.getElementsByTagName("item") and get back an array of all the item objects. For example...
items = XMLDoc.getElementsByTagName("item");
alert(items.length);
This will pop up an alert box showing the number of "item" elements that were found and put into "items". See Example. Each array element contains an XML object that effectively looks like this. <item> <title>Working around IE7s prompt bug, er feature</title> <link>http://www.hunlock.com/blogs/Working_around_IE7s_prompt_bug,_er_feature</link> <description><![CDATA[One of Internet Explorer's many gotcha's is the fact that Microsoft decided to create a security wall around the javascript "prompt" command. When a script tries to call a prompt, Internet Explorer will drop a security warning line at the top of the window -- hilarity ensues from there.]]></description> <author>Patrick</author> <guid isPermaLink='true'>http://www.hunlock.com/blogs/Working_around_IE7s_prompt_bug,_er_feature</guid> </item> So taking the concept a step further. We can extract the title from the first item with the following code.
tagTitle=items[0].getElementsByTagName("title")[0];
alert(tagTitle.tagName+'='+tagTitle.firstChild.nodeValue);
The above code will return this result. This leads us into some of the idiosyncrasies when traversing the xml object model. As you can see here even though item[0] has only one title, getElements will still return an array which we workaround by stating we want only the first array [0] associated with tagTitle. A simple check of tagTitle.tagName will show we correctly recovered the title element, but the only way to access the VALUE of title is by going through the firstChild property ( tagTitle.firstChild.nodeValue ). So using this knowledge we can extract the link with the following code:
tagLink=items[0].getElementsByTagName("link")[0];
alert(tagLink.tagName+'='+tagLink.firstChild.nodeValue);
Which will return this result. The guid tag has an attribute called "isPermalink". To retrieve an element's attributes we can explore the .attributes property. This will return an array of the object's attributes.
guidAttr=items[0].getElementsByTagName("guid")[0].attributes;
alert(guidAttr[0].name+'='+guidAttr[0].value);
Which will return this result. And of course guidAttr.length will show the number of attributes with 0 being none. Navigation ReferenceA few navigation methods and properties of the object model are as follows...
More ReadingFor a complete reference, Microsoft has a list of all the DOM methods and properties and the W3 has the official reference. In conclusionIn IE and Firefox an XML document is treated as DOM compatible, so all the properties and methods that work on html will work on XML. This is important because as you become familiar with the document object model not only will your understanding of web pages themselves increase but you'll also find the transition to XHTML a natural and fluid evolution. And there you have it -- a way to read XML, to crawl it, and to extract data
from the XML directly.
|