How do you stop the
major search engines from indexing web pages
on your site that
you don't want made available to the public?
One easy method is
to create a robots.txt file that resides on
your web server
in the root directory.
It is actually quite
easy to do and all you need is a text
editor.
The basic syntax of a robots.txt file looks like the following:
User-Agent: [Spider
Name]
Disallow: [File
Name]
For example, Google's
spider is named googlebot. So if you
didn't want googlebot
to index your thankyou.html file, your
robots.txt file
would look like this:
User-Agent: googlebot
Disallow: /thankyou.html
If you want to prevent
all robots from spidering the file named
thankyou.html, you
can use the "*" which is the wildcard
character in the
User-Agent line. For example it would be
written like this:
User-Agent: *
Disallow: /thankyou.html
You may also specify directories:
Disallow: /cgi-bin/
This one bans googlebot from all files on the server:
User-agent: googlebot
Disallow: /
Unfortunately, you
cannot use the wildcard character for a file
in the "Disallow"
statement.
The robots.txt file
is also useful if you are creating
multiple web pages
to be indexed for a particular search
engine (i.e. Google,
Lycos, etc.), you could be penalized if the
searchbot indexes
all the pages.
These multiple web
pages tend to be similar and the major
search engines have
the ability to detect when a site is doing
this.
The searchbot might
label your web site as spam and you could be
permanently banned
from that search engine.
By using a robots.txt
file, you can tell googlebot to avoid
indexing a web page
that you created especially for Lycos.com.
When you put the
above lines in the robots.txt file, you
instruct each search
engine not to spider the files meant for
the other search
engines.
For more information
on robots.txt files and more complicated
examples, I suggest
going to:
http://www.searchengineworld.com/robots/robots_tutorial.htm