Using XML Sitemaps
By Bruce Clay
White PaperSynopsis
:
An XML Sitemap provides a bulk list of Web page URLs to
the search engines, and is a fi le created according to a standard protocol.
Webmasters can use XML Sitemaps to proactively alert search engines to the
full depth of their Web sites. The site stands to gain by increasing the
number of Web pages indexed, which in turn can lead to improved rankings for
more keywords in the search engine results pages.
This paper covers the purpose and definition of XML
Sitemaps, detailed instructions for creating and submitting them to the
search engines, and information on the various types of specialized Sitemap
files.
Purpose of an XML Sitemap
Making sure the search engines have a Web site in their
indexes is the first essential step of any search engine optimization
project. If the search engines don�t even know a Web site exists, then it�s
futile to try to optimize it. That would be like trying to paint a picture
without paint, or throw a dinner party without food. Job number one is to
make sure the search engine spiders come to the Web site, crawl its pages
and include them in the search engine�s index. And not just a few pages �
having all of a site�s indexable pages included in the search engine�s index
positively affects search engine rankings. Not only does the increased
content allow for more potential matches to search queries, but also the
search engines generally see more depth of content as a sign of greater
expertise, and award higher search engine placement accordingly.
The usual way that search engine spiders find a Web site
is by following an inbound link coming from another site. Once the spider
arrives at the new Web page, it may continue exploring the other pages in
the site by following the site�s internal navigation links. (This assumes
that there is nothing blocking the search engines from crawling the site,
such as an incorrect disallow command in the site�s robots.txt file or
other.) However, waiting passively for the search engines to find the site
is not the fastest or most reliable method to get indexed. What�s XML
Sitemaps give webmasters an easy way to inform the search engines about
their Web site pages.needed is a better way, a way that proactively feeds
the information to the search engine spiders and invites them to come
crawling.
The XML Sitemap was developed to answer this need. XML
Sitemaps give webmasters an easy way to inform the search engines about
their Web site pages. Google, Yahoo! and Bing, among other search engines,
have all agreed to a common protocol for building an XML Sitemap, which is a
convenience for webmasters. The official site containing the full protocol
is http://sitemaps.org/
By enabling webmasters to alert search engines to their
Web pages, XML Sitemaps can help those pages be indexed sooner and more
completely.
The search engines do not guarantee they will index every
link, but providing this information to them greatly improves a site�s
chances. For SEO purposes, it is essential that site owners build a Sitemap
and keep it updated. XML Sitemaps are especially valuable for:
-
New Web sites
-
Sites with few or no inbound links
-
Sites that launch a redesign or large-scale update
-
Sites that have only a small percentage of its pages
indexed
-
Sites that have a large number of pages archived or
not well linked together
-
Sites that cannot be easily crawled because they have
dynamic content, a heavy use of Flash or AJAX, or poor internal
navigation
XML Sitemaps Defined
The following excerpt from
http://sitemaps.org/ explains what an XML Sitemap is in a nutshell: In
its simplest form, a Sitemap is an XML file that lists URLs for a site along
with additional Meta data about each URL (when it was last updated, how
often it usually changes, and how important it is, relative to other URLs in
the site) so that search engines can more intelligently crawl the site.
An XML Sitemap is merely a list of the pages on a Web
site, with some optional Meta data about them that helps spiders crawl the
site more intelligently. It is in XML (Extensible Markup Language) format,
which is code rather than text � unreadable by most humans, but very
efficient for search engine spiders.
An XML Sitemap can contain literally thousands of page
URLs (up to 50,000 according to Google�s guidelines). Its purpose is to give
the search engine a complete picture of the Web site so that it can be
crawled more fully. The best use of an XML Sitemap is to include all of the
pages the webmaster wants to be indexed on the site.
For SEO purposes, it is essential that site owners build
a Sitemap and keep it updated.
Note: In addition to the general Sitemap discussed so
far, Google has defined some specialized Sitemaps for mapping links to
specific types of content such as news, video, and geospatial content.
These will be covered later under the heading �Building
Specialized Sitemaps for Google.�
XML Sitemap vs. HTML Site Map
XML Sitemaps should not be confused with traditional HTML
site maps. Often sites have a �Site map� link to a page that looks like a
Web site table of contents. People can use this kind of site map to see how
a site is organized and locate hard-to-fi nd pages, which is a function XML
Sitemaps cannot offer.
http://www.apple.com/sitemap/
An HTML site map:
-
Is primarily intended for the site�s human visitors,
not spiders
-
Organizes the site�s contents into categories, rather
than just as an unstructured list
-
Contains no more than 100 links (per Google�s �Design
and Content
Guidelines� at
http://www.google.com/support/webmasters/bin/answer.py?hl=en&answer=35769
-
Contains links to important site pages only, due to
the limited number allowed
-
Uses anchor text for links (as opposed to straight
URLs)
-
Can pass PageRank (i.e., link popularity) through the
links, as opposed to an XML Sitemap which passes none
-
Reinforces site themes by the way links are
categorized, or for larger sites, by the way links are divided between
multiple HTML site map pages
Because of these differences, a Web site ideally should
have an HTML site map as well as an XML Sitemap. Though an HTML site map is
primarily for users, it also benefits the site�s search engine optimization
effort when search engine spiders crawl a site�s HTML site map page. Spiders
can follow the links and index the important site pages, if any have been
missed. But that�s not where the real SEO value lies.
The real SEO advantage of having an HTML site map is in
the link anchor text and organization. An HTML site map shows a search
engine spider what a site is really about. A search engine tries to discover
every Web site�s themes and keywords so it can deliver relevant results to
the people searching for those things. Links on a site map can contain the
keywords that identify exactly what each page is about. A link such as �Hot
Air Balloon Supplies� communicates a lot better than a URL path alone could.
So to summarize, a Web site should have both � an XML
Sitemap for full site coverage, and an HTML site map for establishing themes
� to provide search engines with the strongest, most complete understanding
of the site.
Building an XML Sitemap
Since only search engine spiders see XML Sitemaps, the
code can be extremely efficient. Formatting issues like font size and color
are not a concern. The point is to feed the spiders a batch of page links as
smoothly as possible, so they will crawl as much as possible through the
site.
The way the XML code is written matters. To ensure that
the Sitemap can be easily read by the majority of search engines, it should
conform to the accepted Sitemap protocol
http://sitemaps.org/
Google provides easy instructions on how to create a
Sitemap based on the Sitemap protocol. There are two options: create it
manually (explained at
https://www.google.com/support/webmasters/bin/answer.py?answer=34657
or use an automatic Sitemap generator.
The real SEO advantage of having an HTML site map is in
the link anchor text and organization.
Creating a Sitemap Manually
The format for creating a Sitemap fi le is not
complicated. It is a text file saved with a .xml extension. After that�s
done, filling in the list of URLs and optional Meta data follows a
repetitive structure. Figure 2 shows a simple example that has only one page
URL and some color-coding that will be explained below.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.example.com/</loc>
<lastmod>2009-03-21</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
</url>
</urlset>
Two lines of code must appear once at the top of the file
by the orange bracket. They are:
<?xml version=�1.0� encoding=�UTF-8�?>
<urlset xmlns=�http://www.sitemaps.org/schemas/sitemap/0.9�>
One line of code must appear at the very bottom of the
file this code is as follows:
</urlset>
Between the top and bottom code are all of the URL
entries. Each URL must contain, at a minimum, the following three lines:
<url>
<loc>http://www.yourdomain.com/</loc>
</url>
The <loc> tag identifi es the URL where the page is
located. In addition, for every URL the XML Sitemap can provide three
optional tags as follows:
<lastmod> The last modifi ed date for this page
<changefreq> How often the page changes
(e.g., hourly, daily, monthly, never)
<priority> How important the page is from 0 (lowest) to 1
(highest)
The code URL entry with all required and optional tags
included. The search engines may consult the additional tags when deciding
how often to spider a site, so using them provides another way to
potentially improve its spiderability.
Auto-Generating an XML Sitemap
For large sites or for site owners who don�t want to type
out all of their page entries manually, there is an easier way. Many Sitemap
generators are available that will �spider� a Web site and then build the
XML file automatically. Two services that are well respected are:
GSiteCrawler is available for free download from
http://gsitecrawler.com/ and is
widely used (for sites operating on Windows servers only).
Google Sitemap Generator is provided by Google.
Officially still in beta, this script requires that the Web server have
Python installed.
For more details, see ;
https://www.google.com/support/webmasters/bin/answer.py?answer=34634&cbid=110jzemo1voyn&src=cb&lev=answer
Many additional third-party Sitemap generators can be
found at
http://code.google.com/p/sitemap-generators/wiki/SitemapGenerators.
No matter which Sitemap generator is selected, it�s
important to set up the tool carefully. Proper settings will keep any pages
that the site owner does not want indexed out of the Sitemap, and
communicate the appropriate information in the Meta tags regarding how
frequently the page changes (<changefreq>) and how important the page is
(<priority>).
For an explanation of these Meta tags, see the previous
section, �Creating a Sitemap Manually.�
Guidelines for Building an XML Sitemap
Certain guidelines must be followed when building an XML
Sitemap. The list below is not exhaustive; see Google�s helpful page at
http://www.google.com/support/webmasters/bin/answer.py?answer=34654&cbid=-591fmsq0cxzj&src=cb&lev=answer
for additional information.
Size restrictions: A Sitemap file cannot have more than
50,000 URLs and cannot be larger than 10MB in size when uncompressed. Very
large Web sites that would exceed these restrictions should divide their
pages into smaller Sitemap files. The smaller files should be linked to from
a single Sitemap that serves as a Sitemap index. Be aware that the maximum
number of Sitemap files that any one Web site can have is 1,000.
Recommendations for submitting either all of the Sitemap files or just the
index fi le to the search engines are explained in the �Submitting an XML
Sitemap� section.
For more information on multiple Sitemaps, see
http://www.google.com/support/webmasters/bin/answer.py?answer=35654
URL syntax: Each URL must be a fully qualified link,
which means it must spell out the complete Web site address that should be
crawled. In addition, URLs listed in the Sitemap must be consistent in how
they refer to the site path. For example, there should not be URLs starting
with �http://www.yourdomain.com/�
and others beginning with �http://yourdomain.com/�
No image URLs: Direct image URLs should not be
included in a Sitemap, since the search engines index the HTML page that the
image appears on, not the image itself. No session IDs: If the site�s URLs
include session IDs (a type of parameter that a content management system
may append to a URL when delivering a requested page), they should be
stripped out for the XML Sitemap.
Must be readable: The Sitemap must be readable by
the Web server where the Sitemap is located, and may contain only ASCII
characters. If the XML Sitemap contains Upper ASCII characters, certain
control codes or special characters (such as * or {}), then the fi le will
generate and error and not be read successfully.
Submitting an XML Sitemap
Once the Sitemap has been created, it should be saved in
the root directory of the Web site (for an example see:
http://www.yourdomain.com/sitemap.xml. The next step is to invite the
search engines to come and spider it. Since most Web sites are updated
frequently with new pages and/or changed content, it is advisable to submit
an updated XML Sitemap fi le regularly, as often as necessary to keep up
with the frequency of updates.
There are three ways to point search engines to an XML
Sitemap: robots.txt, HTTP request, or direct submission to the search
engines. These may be used concurrently, if desired, to increase the chances
that the search engine spiders will fi nd the Sitemap promptly. The
following sections explain each method in detail.
Method 1: Robots.txt
Every Web site should place a text file named
�robots.txt� in its root directory. Search engines begin crawling a Web site
by reading the robots.txt file, because it gives search engines specific
instructions regarding which sections of the site they should disallow, or
not include, in their index. Webmasters can also direct spiders to the
site�s XML Sitemap within this file.
A robots.txt fi le must follow rigid syntax rules in
order to be compliant with the search engine spiders trying to read it. The
command to specify an XML Sitemap to simply Sitemap: followed by the URL,
which would look like this:
Sitemap:
http://www.yourdomain.com/sitemap.xml
Note that the search engine Ask.com requires the
robots.txt method for locating an XML Sitemap fi le. The other major search
engines (Google, Yahoo! and Microsoft Bing) all support this method in
addition to the ones listed below.
To read the complete instructions for writing robots.txt
files, go to http://www.robotstxt.org
Method 2: HTTP Request
Another way to submit an XML Sitemap to a search engine
is through an HTTP request. This is a technical solution that uses wget,
curl, or another mechanism, and can be set up as an automated job that
generates and submits an updated Sitemaps fi le on a regular basis.
The HTTP request would be issued in the following formats
for each of the major search engines.
Google:
http://www.google.com/webmasters/tools/ping?sitemap =
http://www.yourdomain.com/sitemap.xml
Yahoo!:
http://siteexplorer.search.yahoo.com/webmasters/tools/ping?sitemap =
http://www.yourdomain.com/sitemap.xml
Microsoft Bing:
http://www.bing.com/webmaster/webmasters/tools/ping?sitemap =
http://www.yourdomain.com/sitemap.xml
The URL following the equal sign (=) in each request
should identify the Sitemaps fi le location. For sites that have multiple
Sitemaps connected with a Sitemap index file (as discussed in the
�Guidelines for Building an XML Sitemap� section), only the Sitemap index
file should be sent in the request.
A successful request returns an HTTP 200 response code,
which indicates only that the Sitemap information was received, not that it
was validated in any way.
For more information, see
http://www.sitemaps.org/protocol.php#submit_ping
Method 3: Direct Submission
Google, Yahoo! and Bing offer ways for a site owner to
submit an XML Sitemap to them directly. This is a proactive approach that
can be used in addition to putting it in a robots.txt fi le or an HTTP
request. Google: Submit a Sitemap to Google using Google Webmaster Tools at
http://www.google.com/webmasters/tools/. Google Webmaster Tools is an
incredibly valuable source of information that can help diagnose potential
problems and provide a glimpse into the way Google views a Web site. The
Google interface shows when Google last downloaded the Sitemap and any
errors that may have occurred. Webmasters can validate their site, and also
view information such as Web crawl errors (including pages that were not
found or timed out), statistics (crawl rate, top search queries, etc.), and
external and internal links. There are also other useful tools like the
robots.txt analyzer.
Yahoo!: Submitting an XML Sitemap feed to Yahoo! simply
requires entering the Sitemap�s URL through Yahoo! Site Explorer
http://siteexplorer.search.yahoo.com/submit Yahoo! made Sitemaps a
little more confusing by introducing its own version that uses text fi les
named urllist.txt. Many of the Sitemap generators also build a urllist.txt
fi le simultaneously with the XML Sitemap feed. However, since Yahoo! also
recognizes the Sitemaps protocol, it�s enough to just provide a standard XML
Sitemaps and avoid having to update two files.
Bing: Microsoft has launched its own set of webmaster
tools called Bing Webmaster Center
http://www.bing.com/webmaster This site is similar to Google�s,
but not as robust. It allows webmasters to add an XML Sitemap feed and,
after their site is validated, view information about their site. The
information is currently limited to showing any blocked pages, top links,
and robots.txt validation.
Specialized Sitemaps
Google now offers specialized XML Sitemaps for Video,
Mobile, News, and Code Search (and expect more to be added in the future).
These specialized XML Sitemaps allow a site owner to tell Google about
particular types of content on their site � news articles, videos, pages
designed for mobile devices, and publicly accessible source code on their
Web site.
In turn, these special content pages may have a better
chance of inclusion in Google�s specialized, vertical search engines.
Building a Video Sitemap
A Video Sitemap is useful for Web sites that contain
videos. Submitting a Video Sitemap encourages Google to spider the site�s
video content and hopefully rank it in video search results.
This format adheres to the standard Sitemap protocol, but
includes additional video-specifi c tags. Once created, a Video Sitemap
should be submitted to Google directly (see instructions under �Method 3:
Direct Submission� above).
For an explanation of the special video tags and syntax
required for a Video Sitemap, see Google�s article �Creating a Video
Sitemap�
http://www.google.com/support/webmasters/bin/answer.py?answer=80472
Special content pages may have a better chance of
inclusion in Google�s specialized, vertical search engines.
All other company and product logos may be trademarks of
the respective companies with which they are associated.
Building a Mobile Sitemap
Web sites that have pages developed specifically for
mobile users can benefit from submitting a Mobile Sitemap to Google. This
type of Sitemap uses the standard Sitemap protocol, but includes a specifi c
<mobile> tag and an additional namespace requirement.
Google�s article �Creating Mobile Sitemaps�
http://www.google.com/support/webmasters/bin/answer.py?answer=34648
gives all of the particulars and technical details related to building a
Mobile Sitemap.
Building a News Sitemap
By definition, �news� requires immediate distribution in
order to be effective. Any Web site that develops a large amount of
news-type content needs search engines to index fresh content as quickly as
possible. A News Sitemap provides a proactive way for such a site to have
more control over what is submitted to Google News, since it can hasten the
search engine�s discovery of new pages.
A News Sitemap communicates the individual URLs of news
articles together with their publication date and keywords. It requires a
second namespace in addition to the schema requirements of the standard
Sitemaps protocol.
To read full details on creating and submitting a News
Sitemap to Google, start with the �News Sitemaps: Overview� article
http://www.google.com/support/news_pub/bin/answer.py?answer=75717 and
follow the links from there.
Building a Code Search Sitemap
The last type of specialized Google Sitemap is called a
Code Search Sitemap. People sometimes search for publicly accessible source
code using the Google Code Search
http://www.google.com/codesearch Submitting a Sitemap geared
for this vertical engine may be appropriate for sites that want to be found
for that type of content.
Bruce
M. Clay is a Founding and Charter Member of The Business Forum. He has
operated as an executive with several high-technology businesses,
and comes from a long career as a technical executive with leading
Silicon Valley firms, and since 1996 in the Internet Business
Consulting arena. Bruce holds a BS in Math and Computer Science and
also has his MBA from Pepperdine University, has had many articles
published, has been a speaker at over 100 sessions including Search
Engine Strategies, WebmasterWorld, ad:Tech, Search Marketing Expo,
and many more, and has been quoted in the Wall Street Journal, USA
Today, PC Week, Wired Magazine, Smart Money, several books, and many
other publications. He has also been featured on many podcasts and
WebmasterRadio shows, as well as appearing on the NHK 1-hour TV
special "Google's Deep Impact". He has personally authored many
advanced search engine optimization tools that are available from
the company Web sites. Bruce Clay is on the Board of Directors of
the SEMPO (Search Engine Marketing Professionals Organization).
In 1996 he founded Bruce Clay, Inc., that today is a leading provider of
Internet marketing solutions around the world with offices located internationally in
Los Angeles (headquarters), New York, Milan, Tokyo, New Delhi, Sao
Paulo and Sydney.
Visit the Authors Web Site
http://www.bruceclay.com
Search
Our Site
Search the ENTIRE Business
Forum site. Search includes the Business
Forum Library, The Business Forum Journal and the Calendar Pages.
Editorial Policy:
Nothing you read in
The Business Forum Journal
should ever be construed to
be the opinion of, statements condoned by, or advice
from, The Business Forum, its staff, workers, officers, members, directors, sponsors or shareholders. We pass no opinion whatsoever on the content
of what we publish, nor do we accept any responsibility for the claims, or
any of the statements made, within anything published herein. We merely
aim to provide an academic forum and an information sourcing vehicle for
the benefit of the business and the academic communities of the Pacific States of America
and the World.
Therefore, readers must always determine for themselves where the statistics, comments, statements and
advice that are published herein are gained from and act, or not act, upon such entirely and always at their own risk. We
accept absolutely no liability whatsoever, nor take any responsibility for
what anyone does, or does not do, based upon what is published herein, or
information gained through the use of links to other web sites included
herein.
Please refer to our:
legal
disclaimer
Home
Calendar The Business Forum Journal
Features
Concept
History
Library
Formats
Guest Testimonials
Client Testimonials
Search
News Wire
Why Sponsor
Tell-A-Friend
Join
Experts
Contact The Business Forum
The Business Forum
Beverly Hills, California United States of America
Email:
[email protected]
Graphics by
DawsonDesign
Webmaster:
bruceclay.com
�
Copyright The Business Forum Institute 1982 - 2010 All rights reserved.