Writing dynamic XML sitemaps using PHP

by Richard Bradshaw

Graphic representation of a minute fraction of...

Since Google introduced sitemaps in 2005, they have grown to be accepted by the 4 main search engines: Google, Live Search, Yahoo and Ask.

As the offical sitemaps page describes:

Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.

So, basically it’s an XML file that simply describes what pages you have, when they were modified and how important you think they are.

An example would be:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
                            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
        <url>
                <loc>http://bradshawenterprises.com/blog</loc>
                <lastmod>2008-09-07T10:21:52+00:00</lastmod>
                <changefreq>weekly</changefreq>
                <priority>1.0</priority>
        </url>
</urlset>

In this example, all fields are used, but you can get away with just the loc information. lastmod is an ATOM type date, changefreq can be always, hourly, daily, weekly, monthly, yearly or never, and the priority goes from 0.0 to 1.0. This, and the rest of the protocol are described at the official webpage.

Publishing a sitemap lets the search engines examine deeper parts of your site that may not be linked to that well, as well as providing data on what’s new without them having to crawl the whole site.

Each search engine provides an interface to register your sitemap and check it’s status. The best of these in my experience is Google Webmaster Tools, though the others have something equivalent as well.

Dynamically generating a sitemap

This tutorial will go through reading urls from a database rather than from the file system. This is because the key point here is describing things that have changed or are new. In my sitemaps I manually type in the static pages, and then dynamically write in the rest for simplicity and speed. Why read through numerous directories if we know that things haven’t changed.

So, we start off with a connection to a database:

include("assets/dbconnect.php");
$blogs = mysql_query("SELECT * FROM blog_posts ORDER BY timestamp DESC");

Here, I’m just using a stock database connection script and then really simply querying for the blog posts. I’m ordering them by timestamp so it’s easy to check it’s working, as the newest post will be first.

Next, we sort out a content type header, and the xml prologue.

header ("Content-type: text/xml");
echo ("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n");

The header just tells the user agent that this is some xml so that it knows how to process it. If your browser gets confused and wants to download it, just comment this out whilst testing. Most modern browsers won’t do this anyway.

I’m echoing out the prologue, as PHP gets confused by the symbols.

Next, we set up the XML file and it’s namespaces:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">

This describes the XML to the user agent so that it knows how to interpret the various fields in the file.

Now we get to the bulk of the file:

<? while($current_post = mysql_fetch_array($blogs)) { ?>
<url>
	<loc><?= $current_post[url]) ?></loc>
	<lastmod><?= gmdate(DATE_ATOM, $current_post[timestamp]) ?></lastmod>
</url>
<? } ?>

This just loops through my blog file and spits out the url and a nicely formatted timestamp. I’m using gmdate here because my server is in a different timezone.

Underneath this, I just hand type the remaining files:

<url>
	<loc>http://YOURDOMAIN/about.php</loc>
	<priority>0.5</priority>
</url>
<url>
	<loc>http://YOURDOMAIN/contact.php</loc>
	<priority>0.5</priority>
</url>

Right at the bottom, just place a

</urlset>

To signify the end of the file.

That’s all you need for your sitemap file. Place it in a file in the root of your domain and call it sitemap.php.

Let search engines know it exists

Either create a file in the root of your domain called robots.txt, or open the existing one. At the bottom just add a line that says:

Sitemap: http://YOURDOMAIN/sitemap.php

and save it. This lets search engines find the file.

This is all good so far, now we have a map that updates as the site updates without any real hassle. The next step is to make sure that search engines are told every time a new file is added. For this, you need to find the code where you are saving new posts in a database. I’m using curl here because it seems to be available everywhere.

Add this code as soon as you’ve checked that the entry has been saved properly.

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, "THEURLYOUNEEDTOPING");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$output = curl_exec($ch);
curl_close($ch);

Each search engine mentioned above has an address you can use here, here’s a quick summary:

Google: http://www.google.com/webmasters/tools/ping?sitemap=
Yahoo: http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=
Ask: http://submissions.ask.com/ping?sitemap=
Live Search : http://webmaster.live.com/ping.aspx?siteMap=

All you do is add the full URL to your site map at the end, and use it in the code above. This will ensure that whenever you post anything, all the search engines are notified immediately.

I’d loop through these to do them all in one go like this:

$sitemap = "http://YOURDOMAIN/sitemap.php";

$pingurls = array(
	"http://www.google.com/webmasters/tools/ping?sitemap=",
	"http://search.yahooapis.com/SiteExplorerService/V1/ping?sitemap=",
	"http://submissions.ask.com/ping?sitemap=",
	"http://webmaster.live.com/ping.aspx?siteMap="
);

foreach ($pingurls as $pingurl) {
	$ch = curl_init();
	curl_setopt($ch, CURLOPT_URL, $pingurl.$sitemap);
	curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
	$output = curl_exec($ch);
	curl_close($ch);
}

An alternative is to use a site such as PingMyMap, they provide an URL for you to use that they then use to ping the same search engines. The benefit here is that if the addresses change then it will still work. Your call really!

Any more ideas on how to implement this? Let me know below!

Random Posts