homeMaking your Web pages appeal to search engines

Ok, so you've got the best Web page in the world, but just how do you go about making it easily accessible to the Web public?

High profile advertising campaigns, press releases and sponsorship deals are great, but they come at a high premium. Search engines and directory sites are still the best way of driving traffic to your site, and are generally free.

You might be wondering just what separates a search engine from a directory. Basically, directories are structured by category (Yahoo being a good example) and, importantly, require you to select a category for your submitted site. Search engine sites, like Google, are less structured and rely instead on the similarity between search terms and information from it's database. The key to success is how to best exploit these marvellous opportunities, and the best place to start is the construction of your HTML documents.

Many search engines use software agents known as spiders to examine submitted sites. Spiders act like a browser, visiting a page, reviewing it's content, following links and recording information they consider relevant. However, in many cases they are unlike modern browsers, and therefore do not understand some tags. Other technologies are deliberately ignored. Lets look at some of the things search engines don't particularly like:

Frames

Many spiders are not frames capable, so the use of frames can hinder your chances of success. It is essential, if you are using frames, to fully utilise the NOFRAMES area of your FRAMESET document. Even if no links are followed, this document will be processed and indexed. Other spiders will recognise frames, but will not determine which frames are content and which are menus or buffers. If you do not want a particular document, say your navigation frame, to be listed in it's own right, you should omit the TITLE and META keywords and description tags from that document, but do make use of the META robots tag (see below for details on META tag usage). You should also carefully consider whether you want individual content pages listed without their menu.

Dynamic content

By dynamic content I mean pages built on the fly, where URLs are generated in the form of:

http://www.yourdomain.com/page.php?id=foo

There is a risk of spiders getting trapped in an endless loop of dynamically generated URLs, so most will not follow this type of link.

Scripts

As with frames, spiders often will not understand client-side scripting such as JavaScript. If your links, or content, requires scripting to function you should provide an alternative route using NOSCRIPT tags. It is also best to comment out code within a SCRIPT area, to avoid this being read by the spider as page content.

Usage example:

<SCRIPT="javascript">
<!-- HIDE FROM SPIDERS & OLDER BROWSERS

your script

// END HIDING -->
</SCRIPT>
<NOSCRIPT>
alternative content
</NOSCRIPT>

Lack of content

Spiders will seek content, and many will ignore pages that contain little or no content. This is primarily to counter abuse, where unscrupulous Webmasters generate many pages that do little more than lead to their already-submitted pages, some using refresh techniques. If you have a page that genuinely has little textual content (a primarily graphical page, perhaps), make maximum use of image ALT attributes.

Ok, so we've looked at some of the things you shouldn't do. What about those you definitely should?

Robots.txt

Before trawling through your site, many spiders will first look for a file called robots.txt when visiting your site. This file tells spiders (also known as robots) what they can and can't do. There can be only one robots.txt for each Web site, and it should be placed in the document root folder (i.e. http://www.yourdomain.com/robots.txt). Also, be certain to keep the file name all lower-case.

The robots.txt file consists of one or more records of the following format:

User-agent: *
Disallow: /test

The above would tell all spiders referring to the robots.txt file that they should not enter the test directory. All other directories are available. Using / as the Disallow value will restrict the entire site, whilst entering no value makes the whole site available (as will an empty robots.txt file). The * can only be used as a value for the User-agent entry, and refers to all spiders. If you wish to restrict access to more than one directory, then you must use the following protocol:

User-agent: *
Disallow: /test
Disallow: /secret
Disallow: /foo

You can have multiple entries in a robots.txt file, giving instructions to specific spiders. For example:

User-agent: webcrawler
Disallow: /test

User-agent: lycos
Disallow: /secret

META tags:

There are 3 META tags that should be used in the HEAD section of any document you wish listed by search engines. These are:

<META name="Keywords" content="blah">

The META Keywords content should be a series of comma-separated words and phrases that are relevant to the site content, but more importantly the kind of search terms that you would like to return your site (e.g. if you want folks searching for architectural photography to get your site, use that as a key phrase). It is usually the case that the constituents of a key phrase are also listed separately as key words (using the previous example: architectural, photography, architectural photography) so long as no single word is repeated more than 5 times. For us, web, web design, web sites etc. are important, as are slight variations on the theme. So, internet, web and net would be used, as would website which is not considered a repeat of the word web. Try to keep to around 25 words or phrases - more are OK, but don't go too far! Also, do not abuse the tag by including words or phrases that do not relate to your site. Using terms such as sex, coke or competitors details may lead to legal action, and could see your site blacklisted by search and directory sites.

<META name="Description" content="blah">

The META Description content should be no more than 20-or-so words that describe the site content. Traditionally used by search engines to display a site description next to the URL / link, the content is also used in conjunction with the keywords and other information to build the site profile. As such, it is best to keep this concise, and include as many of the keywords and phrases as possible.

>META name="robots" content="index, follow">

The META robots tag, like the robots.txt file, tells search engine indexing software what they can and can't do, only it's not as explicit and isn't as widely recognised. For example, index tells it to register that page, whilst follow tells it to follow any local links. noindex and nofollow do the opposite. Any combination of index/noindex, follow/nofollow is allowed.

TITLE

The other important tag, which is perhaps as or more important than the META Keywords, is TITLE. The TITLE should be more than a standard home page or my site. Steer clear of the prefixes the and welcome, and make as descriptive as possible, whilst remaining concise, and include as many key words and phrases as possible.

Content

Page content, and the position of that content, also has a major bearing on search engine compliance. In many cases, the first piece of actual text in a Web document will be parsed by a spider. This could be the content of an images ALT attribute, some un-commented script, or hopefully some well formatted, concise and descriptive text. If making use of NOFRAMES, avoid making the mistake that leads to search engines listing sites with descriptions like This is a frames only site. Ouch.

Realnames

Many search engines, including Google and Alta Vista, and even browsers including Internet Explorer, use the RealNames Internet keywords database to facilitate searching. RealNames are not free, but can have a noticeable affect on your site's success in search result ranking and in making access to your site easier anyway. They work by assigning a name, such as egovision, with a URL, such as http://www.egovision.co.uk/. It should be stated that opinions on the validity of RealNames is split. Some have stated that they are a waste of money, whilst others swear by them. In my experience they have performed well and are a useful addition to any site awareness campaign. Try the RealNames Web site for more details.

Directories

Directory sites, such as Yahoo, require you to submit your site to a specific category. The more accurate your choice of category is, the more likely you are to be listed. If your site isn't related to the category you choose, even if it isn't as close a match as it could be, it probably won't be listed. Take the time required to get this right!

Submission regularity

One final piece of advice, this time regarding search engines themselves. The vast majority of search or directory sites have strict rules relating to the regularity of resubmission. Breaking those rules could lead to your site being blacklisted, removed from the index and ignored when later submitted.

And finally

Once your site is prepared and ready, it's time to alert the world to its existence. There are three basic methods available to you. The first would be to go to each search engine in turn and manually submit your site (Search engine watch contains lists of search sites, and other useful information besides). The advantage of this method is that you are able to see exactly what is submitted to a site, rather than clicking a button and hoping for the best.

The second option would be to use an online multiple submission service, such as Add me. This will save some time, and should reap rewards, but doesn't provide you with the comfort of knowing exactly what information is submitted.

Option three is to purchase a multiple submission application and let it do the work. The Exploit submission wizard is one I have had success with in the past. The advantages and disadvantages are much the same as the previous method, although the number of sites submitted to is generally significantly greater. If you are to submit to hundreds or thousands of sites, you might want to consider not providing your email address, as you are likely to be deluged with a combination of Spam, newsletters and thank yous.

Now, go forth and multiply (your site traffic).



John Lyons BA(Hons). egovision limited.

   
egovision


Articles and essays
As part of our on-going commitment to education and intellectual rigour, egovision are proud to make available a collection of written articles loosely related to the Web and the Internet in general.
Some of these essays will be the product of our own well worn keyboards, others are pieces made available to us, or submitted by those of you who share our interests.

If you have an article you feel would suit this resource, or even have an idea for an article please provide details via our response form and we will contact you as soon as possible.