Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a robot applies some heuristic to the selection and order of documents to visit and spaces out requests over a long space of time, it is still a robot.
Normal Web browsers are not robots, because the are operated by a human, and don't automatically retrieve referenced documents (other than inline images).
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are a bit misleading as they give the impression the software itself moves between sites like a virus; this not the case, a robot simply visits sites by requesting documents from them.
So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.
Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be a "how to write a web robot" book, but it provides useful background reading and a good overview of the state-of-the-art, especially if you haven't got the time to find all the info yourself on the Web.
Published by New Riders, ISBN 1-56205-463-5.
The William's book 'Bots and other Internet Beasties' was quit disappointing. It claims to be a 'how to' book on writing robots, but my impression is that it is nothing more than a collection of chapters, written by various people involved in this area and subsequently bound together.
Published by Sam's, ISBN: 1-57521-016-9
While this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive collection of information which is maintained by Martijn Koster <m.koster@webcrawler.com>.
Of course the latest version of this FAQ is there.
You'll also find details and an archive of the robots mailing list, which is intended for technical discussions about robots.
Most indexing services also allow you to submit URLs manually, which will then be queued and visited by the robot.
Sometimes other sources for URLs are used, such as scanners through USENET postings, published mailing list achives etc.
Given those starting points a robot can select URLs to visit and index, and to parse and use as a source for new URLs.
We hope that as the Web evolves more facilities becomes available to efficiently associate meta data such as indexing information with a document. This is being worked on...
Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL: http://www.submit-it.com/> will do it for you.
If your server supports User-agent logging you can check for retrievals with unusual User-agent heder values.
Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.
If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But please don't tell me about every robot that happens to drop by!
First of all check if it is a problem by checking the load of your server, and monitoring your servers' error log, and concurrent connections if you can. If you have a medium or high performance server, it is quite likely to be able to cope a high load of even several requests per second, especially if the visits are quick.
However you may have problems if you have a low performance site, such as your own desktop PC or Mac you're working on, or you run low performance server software, or if you have many long retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused connections, a high load, performance slowdowns, or in extreme cases a system crash.
If this happens, there are a few things you should do. Most importantly, start logging information: when did you notice, what happened, what do your logs say, what are you doing in response etc; this helps investigating the problem later. Secondly, try and find out where the robot came from, what IP addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't help, try their own site for telephone numbers, or mail postmaster at their domain.
If the robot is not on the list, mail me with all the information you have collected, including actions on your part. If I can't help, at least I can make a note of it for others.
If you don't care about robots and want to prevent the messages in your error logs, simply create an empty file called robots.txt in the root level of your server.
Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never get read by anyone :-)
but its easy to be more selective than that.User-agent: * Disallow: /
The first two lines, starting with '#', specify a comment# /robots.txt file for http://webcrawler.com/ # mail webmaster@webcrawler.com for constructive criticism User-agent: webcrawler Disallow: User-agent: lycra Disallow: / User-agent: * Disallow: /tmp Disallow: /logs
The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go anywhere.
The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/' disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.
The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log. Note the '*' is a special token; its not a regular expression.
Two common errors:
The basic idea is that if you include a tag like:
in your HTML document, that document won't be indexed.<META NAME="ROBOTS" CONTENT="NOINDEX">
If you do:
the links in that document will not be parsed by the robot.<META NAME="ROBOTS" CONTENT="NOFOLLOW">
In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and Verity's.
Alternatively check out the libwww-perl5 package, that has a simple example.
×÷åäåîéå
Üôá óôáôøñ ÷ï÷óå îå ñ÷ìñåôóñ ðïðùôëïê ïâ®ñóîéôø, ëáë òáâïôáàô ðïéóëï÷ùå íáûéîù ÷ïïâýå (üôï know-how éè ðòïéú÷ïäéôåìåê). Ïäîáëï, ðï íïåíõ íîåîéà, ïîá ðïíïöåô ðïîñôø ëáë íïöîï õðòá÷ìñôø ðï÷åäåîéåí ðïéóëï÷ùè òïâïôï÷ (wanderers, spiders, robots - ðòïçòáííù, ó ðïíïýøà ëïôïòùè ôá éìé éîáñ ðïéóëï÷áñ óéóôåíá ïâûáòé÷áåô óåôø é éîäåëóéòõåô ÷óôòåþáàýéåóñ äïëõíåîôù) é ëáë ðòá÷éìøîï ðïóôòïéôø óôòõëôõòõ óåò÷åòá é óïäåòöáýéèóñ îá îåí äïëõíåîôï÷, þôïâù ×áû óåò÷åò ìåçëï é èïòïûï éîäåëóéòï÷áìóñ.
Ðåò÷ïê ðòéþéîïê ôïçï, þôï ñ òåûéìóñ îáðéóáôø üôõ óôáôøà, ñ÷éìóñ óìõþáê, ëïçäá ñ éóóìåäï÷áì æáêì ìïçï÷ äïóôõðá ë íïåíõ óåò÷åòõ é ïâîáòõöéì ôáí óìåäõàýéå ä÷å óôòïëé:
lycosidae.lycos.com - - [01/Mar/1997:21:27:32 -0500] "GET /robots.txt HTTP/1.0" 404 -
lycosidae.lycos.com - - [01/Mar/1997:21:27:39 -0500] "GET / HTTP/1.0" 200 3270
ôï åóôø Lycos ïâòáôéìóñ ë íïåíõ óåò÷åòõ, îá ðåò÷ùê úáðòïó ðïìõþéì, þôï æáêìá /robots.txt îåô, ïâîàèáì ðåò÷õà óôòáîéãõ, é ïô÷áìéì. Åóôåóô÷åîîï, íîå üôï îå ðïîòá÷éìïóø, é ñ îáþáì ÷ùñóîñôø þôï ë þåíõ.
Ïëáúù÷áåôóñ, ÷óå "õíîùå" ðïéóëï÷ùå íáûéîù óîáþáìá ïâòáýáàôóñ ë üôïíõ æáêìõ, ëïôïòùê äïìöåî ðòéóõôóô÷ï÷áôø îá ëáöäïí óåò÷åòå. Üôïô æáêì ïðéóù÷áåô ðòá÷á äïóôõðá äìñ ðïéóëï÷ùè òïâïôï÷, ðòéþåí óõýåóô÷õåô ÷ïúíïöîïóôø õëáúáôø äìñ òáúìéþîùè òïâïôï÷ òáúîùå ðòá÷á. Äìñ îåçï óõýåóô÷õåô óôáîäáòô ðïä îáú÷áîéåí Standart for Robot Exclusion.
Ðï íîåîéà Ìõéóá Íïîøå (Louis Monier, Altavista), ôïìøëï 5% ÷óåè óáêôï÷ ÷ îáóôïñýåå ÷òåíñ éíååô îå ðõóôùå æáêìù /robots.txt åóìé ÷ïïâýå ïîé (üôé æáêìù) ôáí óõýåóô÷õàô. Üôï ðïäô÷åòöäáåôóñ éîæïòíáãéåê, óïâòáîîïê ðòé îåäá÷îåí éóóìåäï÷áîéé ìïçï÷ òáâïôù òïâïôá Lycos. Ûáòìø Ëïììáò (Charles P.Kollar, Lycos) ðéûåô, þôï ôïìøëï 6% ïô ÷óåè úáðòïóï÷ îá ðòåäíåô /robots.txt éíåàô ëïä òåúõìøôáôá 200. ×ïô îåóëïìøëï ðòéþéî, ðï ëïôïòùí üôï ðòïéóèïäéô:
Æáêì /robots.txt ðòåäîáúîáþåî äìñ õëáúáîéñ ÷óåí ðïéóëï÷ùí òïâïôáí (spiders) éîäåëóéòï÷áôø éîæïòíáãéïîîùå óåò÷åòá ôáë, ëáë ïðòåäåìåîï ÷ üôïí æáêìå, ô.å. ôïìøëï ôå äéòåëôïòéé é æáêìù óåò÷åòá, ëïôïòùå ÎÅ ïðéóáîù ÷ /robots.txt. Üôï æáêì äïìöåî óïäåòöáôø 0 éìé âïìåå úáðéóåê, ëïôïòùå ó÷ñúáîù ó ôåí éìé éîùí òïâïôïí (þôï ïðòåäåìñåôóñ úîáþåîéåí ðïìñ agent_id), é õëáúù÷áàô äìñ ëáöäïçï òïâïôá éìé äìñ ÷óåè óòáúõ þôï éíåîîï éí ÎÅ ÎÁÄÏ éîäåëóéòï÷áôø. Ôïô, ëôï ðéûåô æáêì /robots.txt, äïìöåî õëáúáôø ðïäóôòïëõ Product Token ðïìñ User-Agent, ëïôïòõà ëáöäùê òïâïô ÷ùäáåô îá HTTP-úáðòïó éîäåëóéòõåíïçï óåò÷åòá. Îáðòéíåò, îùîåûîéê òïâïô Lycos îá ôáëïê úáðòïó ÷ùäáåô ÷ ëáþåóô÷å ðïìñ User-Agent:
Lycos_Spider_(Rex)/1.0 libwww/3.1
Åóìé òïâïô Lycos îå îáûåì ó÷ïåçï ïðéóáîéñ ÷ /robots.txt - ïî ðïóôõðáåô ôáë, ëáë óþéôáåô îõöîùí. Ëáë ôïìøëï òïâïô Lycos "õ÷éäåì" ÷ æáêìå /robots.txt ïðéóáîéå äìñ óåâñ - ïî ðïóôõðáåô ôáë, ëáë åíõ ðòåäðéóáîï.
Ðòé óïúäáîéé æáêìá /robots.txt óìåäõåô õþéôù÷áôø åýå ïäéî æáëôïò - òáúíåò æáêìá. Ðïóëïìøëõ ïðéóù÷áåôóñ ëáöäùê æáêì, ëïôïòùê îå óìåäõåô éîäåëóéòï÷áôø, äá åýå äìñ íîïçéè ôéðï÷ òïâïôï÷ ïôäåìøîï, ðòé âïìøûïí ëïìéþåóô÷å îå ðïäìåöáýéè éîäåëóéòï÷áîéà æáêìï÷ òáúíåò /robots.txt óôáîï÷éôóñ óìéûëïí âïìøûéí. × üôïí óìõþáå óìåäõåô ðòéíåîñôø ïäéî éìé îåóëïìøëï óìåäõàýéè óðïóïâï÷ óïëòáýåîéñ òáúíåòá /robots.txt:
Úáðéóé (records) æáêìá /robots.txt
Ïâýåå ïðéóáîéå æïòíáôá úáðéóé.
[ # comment string NL ]*
User-Agent: [ [ WS ]+ agent_id ]+ [ [ WS ]* # comment string ]? NL
[ # comment string NL ]*
Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL
[
# comment string NL
|
Disallow: [ [ WS ]+ path_root ]* [ [ WS ]* # comment string ]? NL
]*
[ NL ]+
Ïðéóáîéå ðáòáíåôòï÷, ðòéíåîñåíùè ÷ úáðéóñè /robots.txt
[...]+ Ë÷áäòáôîùå óëïâëé óï óìåäõàýéí úá îéíé úîáëïí + ïúîáþáàô, þôï ÷ ëáþåóô÷å ðáòáíåôòï÷ äïìöîù âùôø õëáúáîù ïäéî éìé îåóëïìøëï ôåòíéîï÷.
Îáðòéíåò, ðïóìå "User-Agent:" þåòåú ðòïâåì íïçõô âùôø õëáúáîù ïäéî éìé îåóëïìøëï agent_id.
[...]* Ë÷áäòáôîùå óëïâëé óï óìåäõàýéí úá îéíé úîáëïí * ïúîáþáàô, þôï ÷ ëáþåóô÷å ðáòáíåôòï÷ íïçõô âùôø õëáúáîù îïìø éìé îåóëïìøëï ôåòíéîï÷.
Îáðòéíåò, ×ù íïöåôå ðéóáôø éìé îå ðéóáôø ëïííåîôáòéé.
[...]? Ë÷áäòáôîùå óëïâëé óï óìåäõàýéí úá îéíé úîáëïí ? ïúîáþáàô, þôï ÷ ëáþåóô÷å ðáòáíåôòï÷ íïçõô âùôø õëáúáîù îïìø éìé ïäéî ôåòíéî.
Îáðòéíåò, ðïóìå "User-Agent: agent_id" íïöåô âùôø îáðéóáî ëïííåîôáòéê.
..|.. ïúîáþáåô éìé ôï, þôï äï þåòôù, éìé ôï, þôï ðïóìå.
WS ïäéî éú óéí÷ïìï÷ - ðòïâåì (011) éìé ôáâõìñãéñ (040)
NL ïäéî éú óéí÷ïìï÷ - ëïîåã óôòïëé (015) , ÷ïú÷òáô ëáòåôëé (012) éìé ïâá üôéè óéí÷ïìá (Enter)
User-Agent: ëìàþå÷ïå óìï÷ï (úáçìá÷îùå é ðòïðéóîùå âõë÷ù òïìé îå éçòáàô).
Ðáòáíåôòáíé ñ÷ìñàôóñ agent_id ðïéóëï÷ùè òïâïôï÷.
Disallow: ëìàþå÷ïå óìï÷ï (úáçìá÷îùå é ðòïðéóîùå âõë÷ù òïìé îå éçòáàô).
Ðáòáíåôòáíé ñ÷ìñàôóñ ðïìîùå ðõôé ë îåéîäåëóéòõåíùí æáêìáí éìé äéòåëôïòéñí
# îáþáìï óôòïëé ëïííåîôáòéå÷, comment string - óïâóô÷åîîï ôåìï ëïííåîôáòéñ.
agent_id ìàâïå ëïìéþåóô÷ï óéí÷ïìï÷, îå ÷ëìàþáàýéè WS é NL, ëïôïòùå ïðòåäåìñàô agent_id òáúìéþîùè ðïéóëï÷ùè òïâïôï÷. Úîáë * ïðòåäåìñåô ÷óåè òïâïôï÷ óòáúõ.
path_root ìàâïå ëïìéþåóô÷ï óéí÷ïìï÷, îå ÷ëìàþáàýéè WS é NL, ëïôïòùå ïðòåäåìñàô æáêìù é äéòåëôïòéé, îå ðïäìåöáýéå éîäåëóéòï÷áîéà.
Òáóûéòåîîùå ëïííåîôáòéé æïòíáôá.
Ëáöäáñ úáðéóø îáþéîáåôóñ óï óôòïëé User-Agent, ÷ ëïôïòïê ïðéóù÷áåôóñ ëáëéí éìé ëáëïíõ ðïéóëï÷ïíõ òïâïôõ üôá úáðéóø ðòåäîáúîáþáåôóñ. Óìåäõàýáñ óôòïëá: Disallow. Úäåóø ïðéóù÷áàôóñ îå ðïäìåöáýéå éîäåëóáãéé ðõôé é æáêìù. ËÁÖÄÁÑ úáðéóø ÄÏÌÖÎÁ éíåôø ëáë íéîéíõí üôé ä÷å óôòïëé (lines). ×óå ïóôáìøîùå óôòïëé ñ÷ìñàôóñ ïðãéñíé. Úáðéóø íïöåô óïäåòöáôø ìàâïå ëïìéþåóô÷ï óôòïë ëïííåîôáòéå÷. Ëáöäáñ óôòïëá ëïííåîôáòéñ äïìöîá îáþéîáôøóñ ó óéí÷ïìá # . Óôòïëé ëïííåîôáòéå÷ íïçõô âùôø ðïíåýåîù ÷ ëïîåã óôòïë User-Agent é Disallow. Óéí÷ïì # ÷ ëïîãå üôéè óôòïë éîïçäá äïâá÷ìñåôóñ äìñ ôïçï, þôïâù õëáúáôø ðïéóëï÷ïíõ òïâïôõ, þôï äìéîîáñ óôòïëá agent_id éìé path_root úáëïîþåîá. Åóìé ÷ óôòïëå User-Agent õëáúáîï îåóëïìøëï agent_id, ôï õóìï÷éå path_root ÷ óôòïëå Disallow âõäåô ÷ùðïìîåîï äìñ ÷óåè ïäéîáëï÷ï. Ïçòáîéþåîéê îá äìéîõ óôòïë User-Agent é Disallow îåô. Åóìé ðïéóëï÷ùê òïâïô îå ïâîáòõöéì ÷ æáêìå /robots.txt ó÷ïåçï agent_id, ôï ïî éçîïòéòõåô /robots.txt.
Åóìé îå õþéôù÷áôø óðåãéæéëõ òáâïôù ëáöäïçï ðïéóëï÷ïçï òïâïôá, íïöîï õëáúáôø éóëìàþåîéñ äìñ ÷óåè òïâïôï÷ óòáúõ. Üôï äïóôéçáåôóñ úáäáîéåí óôòïëé
User-Agent: *
Åóìé ðïéóëï÷ùê òïâïô ïâîáòõöéô ÷ æáêìå /robots.txt îåóëïìøëï úáðéóåê ó õäï÷ìåô÷ïòñàýéí åçï úîáþåîéåí agent_id, ôï òïâïô ÷ïìåî ÷ùâéòáôø ìàâõà éú îéè.
Ëáöäùê ðïéóëï÷ùê òïâïô âõäåô ïðòåäåìñôø áâóïìàôîùê URL äìñ þôåîéñ ó óåò÷åòá ó éóðïìøúï÷áîéåí úáðéóåê /robots.txt. Úáçìá÷îùå é óôòïþîùå óéí÷ïìù ÷ path_root ÉÍÅÀÔ úîáþåîéå.
Ðòéíåò 1:
User-Agent: *
Disallow: /
User-Agent: Lycos
Disallow: /cgi-bin/ /tmp/
× ðòéíåòå 1 æáêì /robots.txt óïäåòöéô ä÷å úáðéóé. Ðåò÷áñ ïôîïóéôóñ ëï ÷óåí ðïéóëï÷ùí òïâïôáí é úáðòåýáåô éîäåëóéòï÷áôø ÷óå æáêìù. ×ôïòáñ ïôîïóéôóñ ë ðïéóëï÷ïíõ òïâïôõ Lycos é ðòé éîäåëóéòï÷áîéé éí óåò÷åòá úáðòåýáåô äéòåëôïòéé /cgi-bin/ é /tmp/, á ïóôáìøîùå - òáúòåûáåô. Ôáëéí ïâòáúïí óåò÷åò âõäåô ðòïéîäåëóéòï÷áî ôïìøëï óéóôåíïê Lycos.
Ðòéíåò 2:
User-Agent: Copernicus Fred
Disallow:
User-Agent: * Rex
Disallow: /t
× ðòéíåòå 2 æáêì /robots.txt óïäåòöéô ä÷å úáðéóé. Ðåò÷áñ òáúòåûáåô ðïéóëï÷ùí òïâïôáí Copernicus é Fred éîäåëóéòï÷áôø ÷åóø óåò÷åò. ×ôïòáñ - úáðòåýáåô ÷óåí é ïóåâåîîï òïâïôõ Rex éîäåëóéòï÷áôø ôáëéå äéòåëôïòéé é æáêìù, ëáë /tmp/, /tea-time/, /top-cat.txt, /traverse.this é ô.ä. Üôï ëáë òáú óìõþáê úáäáîéñ íáóëé äìñ äéòåëôïòéê é æáêìï÷.
Ðòéíåò 3:
# This is for every spider!
User-Agent: *
# stay away from this
Disallow: /spiders/not/here/ #and everything in it
Disallow: # a little nothing
Disallow: #This could be habit forming!
# Don't comments make code much more readable!!!
× ðòéíåòå 3 - ïäîá úáðéóø. Úäåóø ÷óåí òïâïôáí úáðòåýáåôóñ éîäåëóéòï÷áôø äéòåëôïòéà /spiders/not/here/, ÷ëìàþáñ ôáëéå ðõôé é æáêìù ëáë /spiders/not/here/really/, /spiders/not/here/yes/even/me.html. Ïäîáëï óàäá îå ÷èïäñô /spiders/not/ éìé /spiders/not/her (÷ äéòåëôïòéé '/spiders/not/').
Îåëïôïòùå ðòïâìåíù, ó÷ñúáîîùå ó ðïéóëï÷ùíé òïâïôáíé.
Îåúáëïîþåîîïóôø óôáîäáòôá (Standart for Robot Exclusion).
Ë óïöáìåîéà, ðïóëïìøëõ ðïéóëï÷ùå óéóôåíù ðïñ÷éìéóø îå ôáë äá÷îï, óôáîäáòô äìñ òïâïôï÷ îáèïäéôóñ ÷ óôáäéé òáúòáâïôëé, äïòáâïôëé, îõ é ô.ä. Üôï ïúîáþáåô, þôï ÷ âõäõýåí óï÷óåí îåïâñúáôåìøîï ðïéóëï÷ùå íáûéîù âõäõô éí òõëï÷ïäóô÷ï÷áôøóñ.
Õ÷åìéþåîéå ôòáæéëá.
Üôá ðòïâìåíá îå óìéûëïí áëôõáìøîá äìñ òïóóéêóëïçï óåëôïòá Internet, ðïóëïìøëõ îå ôáë õö íîïçï ÷ Òïóóéé óåò÷åòï÷ ó ôáëéí óåòøåúîùí ôòáæéëïí, þôï ðïóåýåîéå éè ðïéóëï÷ùí òïâïôïí âõäåô íåûáôø ïâùþîùí ðïìøúï÷áôåìñí. Óïâóô÷åîîï, æáêì /robots.txt äìñ ôïçï é ðòåäîáúîáþåî, þôïâù ïçòáîéþé÷áôø äåêóô÷éñ òïâïôï÷.
Îå ÷óå ðïéóëï÷ùå òïâïôù éóðïìøúõàô /robots.txt.
Îá óåçïäîñûîéê äåîø üôïô æáêì ïâñúáôåìøîï úáðòáûé÷áåôóñ ðïéóëï÷ùíé òïâïôáíé ôïìøëï ôáëéè óéóôåí ëáë Altavista, Excite, Infoseek, Lycos, OpenText é WebCrawler.
Éóðïìøúï÷áîéå íåôá-ôáçï÷ HTML.
Îáþáìøîùê ðòïåëô, ëïôïòùê âùì óïúäáî ÷ òåúõìøôáôå óïçìáûåîéê íåöäõ ðòïçòáííéóôáíé îåëïôïòïçï þéóìá ëïííåòþåóëéè éîäåëóéòõàýéè ïòçáîéúáãéê (Excite, Infoseek, Lycos, Opentext é WebCrawler) îá îåäá÷îåí óïâòáîéé Distributing Indexing Workshop (W3C) , îéöå.
Îá üôïí óïâòáîéé ïâóõöäáìïóø éóðïìøúï÷áîéå íåôá-ôáçï÷ HTML äìñ õðòá÷ìåîéñ ðï÷åäåîéåí ðïéóëï÷ùè òïâïôï÷, îï ïëïîþáôåìøîïçï óïçìáûåîéñ äïóôéçîõôï îå âùìï. Âùìé ïðòåäåìåîù óìåäõàýéå ðòïâìåíù äìñ ïâóõöäåîéñ ÷ âõäõýåí:
Üôïô ôáç ðòåäîáúîáþåî äìñ ðïìøúï÷áôåìåê, ëïôïòùå îå íïçõô ëïîôòïìéòï÷áôø æáêì /robots.txt îá ó÷ïéè ÷åâ-óáêôáè. Ôáç ðïú÷ïìñåô úáäáôø ðï÷åäåîéå ðïéóëï÷ïçï òïâïôá äìñ ëáöäïê HTML-óôòáîéãù, ïäîáëï ðòé üôïí îåìøúñ óï÷óåí éúâåöáôø ïâòáýåîéñ òïâïôá ë îåê (ëáë ÷ïúíïöîï õëáúáôø ÷ æáêìå /robots.txt).
<META NAME="ROBOTS" CONTENT="robot_terms">
robot_terms - üôï òáúäåìåîîùê úáðñôùíé óðéóïë óìåäõàýéè ëìàþå÷ùè
óìï÷ (úáçìá÷îùå éìé óôòïþîùå óéí÷ïìù òïìé îå éçòáàô): ALL, NONE,
INDEX, NOINDEX, FOLLOW, NOFOLLOW.
NONE - çï÷ïòéô ÷óåí òïâïôáí éçîïòéòï÷áôø üôõ óôòáîéãõ ðòé éîäåëóáãéé (üë÷é÷áìåîôîï ïäîï÷òåíåîîïíõ éóðïìøúï÷áîéà ëìàþå÷ùè óìï÷ NOINDEX, NOFOLLOW).
ALL - òáúòåûáåô éîäåëóéòï÷áôø üôõ óôòáîéãõ é ÷óå óóùìëé éú îåå (üë÷é÷áìåîôîï ïäîï÷òåíåîîïíõ éóðïìøúï÷áîéà ëìàþå÷ùè óìï÷ INDEX, FOLLOW).
INDEX - òáúòåûáåô éîäåëóéòï÷áôø üôõ óôòáîéãõ
NOINDEX - îåòáúòåûáåô éîäåëóéòï÷áôø üôõ óôòáîéãõ
FOLLOW - òáúòåûáåô éîäåëóéòï÷áôø ÷óå óóùìëé éú üôïê óôòáîéãù
NOFOLLOW - îåòáúòåûáåô éîäåëóéòï÷áôø óóùìëé éú üôïê óôòáîéãù
Åóìé üôïô íåôá-ôáç ðòïðõýåî éìé îå õëáúáîù robot_terms, ôï ðï õíïìþáîéà ðïéóëï÷ùê òïâïô ðïóôõðáåô ëáë åóìé âù âùìé õëáúáîù robot_terms= INDEX, FOLLOW (ô.å. ALL). Åóìé ÷ CONTENT ïâîáòõöåîï ëìàþå÷ïå óìï÷ï ALL, ôï òïâïô ðïóôõðáåô óïïô÷åôóô÷åîîï, éçîïòéòõñ ÷ïúíïöîï õëáúáîîùå äòõçéå ëìàþå÷ùå óìï÷á.. Åóìé ÷ CONTENT éíåàôóñ ðòïôé÷ïðïìïöîùå ðï óíùóìõ ëìàþå÷ùå óìï÷á, îáðòéíåò, FOLLOW, NOFOLLOW, ôï òïâïô ðïóôõðáåô ðï ó÷ïåíõ õóíïôòåîéà (÷ üôïí óìõþáå FOLLOW).
Åóìé robot_terms óïäåòöéô ôïìøëï NOINDEX, ôï óóùìëé ó üôïê óôòáîéãù îå éîäåëóéòõàôóñ. Åóìé robot_terms óïäåòöéô ôïìøëï NOFOLLOW, ôï óôòáîéãá éîäåëóéòõåôóñ, á óóùìëé, óïïô÷åôóô÷åîîï, éçîïòéòõàôóñ.
<META NAME="KEYWORDS" CONTENT="phrases">
phrases - òáúäåìåîîùê úáðñôùíé óðéóïë óìï÷ éìé óìï÷ïóïþåôáîéê (úáçìá÷îùå é óôòïþîùå óéí÷ïìù òïìé îå éçòáàô), ëïôïòùå ðïíïçáàô éîäåëóéòï÷áôø óôòáîéãõ (ô.å. ïôòáöáàô óïäåòöáîéå óôòáîéãù). Çòõâï çï÷ïòñ, üôï ôå óìï÷á, ÷ ïô÷åô îá ëïôïòùå ðïéóëï÷áñ óéóôåíá ÷ùäáóô üôïô äïëõíåîô.
<META NAME="DESCRIPTION" CONTENT="text">
text - ôïô ôåëóô, ëïôïòùê âõäåô ÷ù÷ïäéôøóñ ÷ óõííáòîïí ïô÷åôå îá úáðòïó ðïìøúï÷áôåìñ ë ðïéóëï÷ïê óéóôåíå. Óåê ôåëóô îå äïìöåî óïäåòöáôø ôáçï÷ òáúíåôëé é ìïçéþîåå ÷óåçï ÷ðéóáôø ÷ îåçï óíùóì äáîîïçï äïëõíåîôá îá ðáòõ-ôòïêëõ óôòïë.
Ðòåäðïìáçáåíùå ÷áòéáîôù éóëìàþåîéñ ðï÷ôïòîùè ðïóåýåîéê ó ðïíïýøà íåôá-ôáçï÷ HTML
Îåëïôïòùå ëïííåòþåóëéå ðïéóëï÷ùå òïâïôù õöå éóðïìøúõàô íåôá-ôáçé, ðïú÷ïìñàýéå ïóõýåóô÷ìñôø "ó÷ñúø" íåöäõ òïâïôïí é ÷åâíáóôåòïí. Altavista éóðïìøúõåô KEYWORDS íåôá-ôáç, á Infoseek éóðïìøúõåô KEYWORDS é DESCRIPTION íåôá-ôáçé.
Éîäåëóéòï÷áôø äïëõíåîô ïäéî òáú éìé äåìáôø üôï òåçõìñòîï?
×åâíáóôåò íïöåô "óëáúáôø" ðïéóëï÷ïíõ òïâïôõ éìé æáêìõ bookmark ðïìøúï÷áôåìñ, þôï óïäåòöéíïå ôïçï éìé éîïçï æáêìá âõäåô éúíåîñôøóñ. × üôïí óìõþáå òïâïô îå âõäåô óïèòáîñôø URL, á âòïõúåò ðïìøúï÷áôåìñ ÷îåóåô éìé îå ÷îåóåô üôï æáêì ÷ bookmark. Ðïëá üôá éîæïòíáãéñ ïðéóù÷áåôóñ ôïìøëï ÷ æáêìå /robots.txt, ðïìøúï÷áôåìø îå âõäåô úîáôø ï ôïí, þôï üôá óôòáîéãá âõäåô éúíåîñôøóñ.
Íåôá-ôáç DOCUMENT-STATE íïöåô âùôø ðïìåúåî äìñ üôïçï. Ðï õíïìþáîéà, üôïô íåôá-ôáç ðòéîéíáåôóñ ó CONTENT=STATIC.
<META NAME="DOCUMENT-STATE" CONTENT="STATIC">
<META NAME="DOCUMENT-STATE" CONTENT="DYNAMIC">
Ëáë éóëìàþéôø éîäåëóéòï÷áîéå çåîåòéòõåíùè óôòáîéã éìé äõâìéòï÷áîéå äïëõíåîôï÷, åóìé åóôø úåòëáìá óåò÷åòá?
Çåîåòéòõåíùå óôòáîéãù - óôòáîéãù, ðïòïöäáåíùå äåêóô÷éåí CGI-óëòéðôï÷. Éè îá÷åòîñëá îå óìåäõåô éîäåëóéòï÷áôø, ðïóëïìøëõ åóìé ðïðòïâï÷áôø ðòï÷áìéôøóñ ÷ îéè éú ðïéóëï÷ïê óéóôåíù, âõäåô ÷ùäáîá ïûéâëá. Þôï ëáóáåôóñ úåòëáì, ôï îåçïöå, ëïçäá ÷ùäáàôóñ ä÷å òáúîùå óóùìëé îá òáúîùå óåò÷åòá, îï ó ïäîéí é ôåí öå óïäåòöéíùí. Þôïâù üôïçï éúâåöáôø, óìåäõåô éóðïìøúï÷áôø íåôá-ôáç URL ó õëáúáîéåí áâóïìàôîïçï URL üôïçï äïëõíåîôá (÷ óìõþáå úåòëáì - îá óïïô÷åôóô÷õàýõà óôòáîéãõ çìá÷îïçï óåò÷åòá).
<META NAME="URL" CONTENT="absolute_url">
Martijn Koster , ðåòå÷ïä Á. Áìéëâåòï÷á
Üôïô äïëõíåîô óïóôá÷ìåî 30 éàìñ 1994 çïäá ðï íáôåòéáìáí ïâóõöäåîéê ÷ ôåìåëïîæåòåîãéé robots-request@nexor.co.uk (óåêþáó ëïîæåòåîãéñ ðåòåîåóåîá îá WebCrawler. Ðïäòïâîïóôé óí. Robots pages at WebCrawler info.webcrawler.com/mak/projects/robots/) íåöäõ âïìøûéîóô÷ïí ðòïéú÷ïäéôåìåê ðïéóëï÷ùè òïâïôï÷ é äòõçéíé úáéîôåòåóï÷áîîùíé ìàäøíé.Ôáëöå üôá ôåíá ïôëòùôá äìñ ïâóõöäåîéñ ÷ ôåìåëïîæåòåîãéé Technical World Wide Web www-talk@info.cern.ch Óåê äïëõíåîô ïóîï÷áî îá ðòåäùäõýåí òáâïþåí ðòïåëôå ðïä ôáëéí öå îáú÷áîéåí.
Üôïô äïëõíåîô îå ñ÷ìñåôóñ ïæéãéáìøîùí éìé þøéí-ìéâï ëïòðïòáôé÷îùí óôáîäáòôïí, é îå çáòáîôéòõåô ôïçï, þôï ÷óå îùîåûîéå é âõäõýéå ðïéóëï÷ùå òïâïôù âõäõô éóðïìøúï÷áôø åçï. × óïïô÷åôóô÷éé ó îéí âïìøûéîóô÷ï ðòïéú÷ïäéôåìåê òïâïôï÷ ðòåäìáçáåô ÷ïúíïöîïóôø úáýéôéôø ×åâ-óåò÷åòù ïô îåöåìáôåìøîïçï ðïóåýåîéñ éè ðïéóëï÷ùíé òïâïôáíé.
Ðïóìåäîàà ÷åòóéà üôïçï äïëõíåîôá íïöîï îáêôé ðï áäòåóõ info.webcrawler.com/mak/projects/robots/robots.html
Ðïéóëï÷ùå òïâïôù (wanderers, spiders) - üôï ðòïçòáííù, ëïôïòùå éîäåëóéòõàô ÷åâ-óôòáîéãù ÷ óåôé Internet.
× 1993 é 1994 çïäáè ÷ùñóîéìïóø, þôï éîäåëóéòï÷áîéå òïâïôáíé óåò÷åòï÷ ðïòïê ðòïéóèïäéô ðòïôé÷ öåìáîéñ ÷ìáäåìøãå÷ üôéè óåò÷åòï÷. × þáóôîïóôé, éîïçäá òáâïôá òïâïôï÷ úáôòõäîñåô òáâïôõ ó óåò÷åòïí ïâùþîùè ðïìøúï÷áôåìåê, éîïçäá ïäîé é ôå öå æáêìù éîäåëóéòõàôóñ îåóëïìøëï òáú. × äòõçéè óìõþáñè òïâïôù éîäåëóéòõàô îå ôï, þôï îáäï, îáðòéíåò, ïþåîø "çìõâïëéå" ÷éòôõáìøîùå äéòåëôïòéé, ÷òåíåîîõà éîæïòíáãéà éìé CGI-óëòéðôù. Üôïô óôáîäáòô ðòéú÷áî òåûéôø ðïäïâîùå ðòïâìåíù.
Äìñ ôïçï, þôïâù éóëìàþéôø ðïóåýåîéå óåò÷åòá éìé åçï þáóôåê òïâïôïí îåïâèïäéíï óïúäáôø îá óåò÷åòå æáêì, óïäåòöáýéê éîæïòíáãéà äìñ õðòá÷ìåîéñ ðï÷åäåîéåí ðïéóëï÷ïçï òïâïôá. Üôïô æáêì äïìöåî âùôø äïóôõðåî ðï ðòïôïëïìõ HTTP ðï ìïëáìøîïíõ URL /robots.txt. Óïäåòöáîéå üôïçï æáêìá óí. îéöå.
Ôáëïå òåûåîéå âùìï ðòéîñôï äìñ ôïçï, þôïâù ðïéóëï÷ùê òïâïô íïç îáêôé ðòá÷éìá, ïðéóù÷áàýéå ôòåâõåíùå ïô îåçï äåêóô÷éñ, ÷óåçï ìéûø ðòïóôùí úáðòïóïí ïäîïçï æáêìá. Ëòïíå ôïçï æáêì /robots.txt ìåçëï óïúäáôø îá ìàâïí éú óõýåóô÷õàýéè ×åâ-óåò÷åòï÷.
×ùâïò éíåîîï ôáëïçï URL íïôé÷éòï÷áî îåóëïìøëéíé ëòéôåòéñíé:
Æïòíáô é óåíáîôéëá æáêìá /robots.txt óìåäõàýéå:
Æáêì äïìöåî óïäåòöáôø ïäîõ éìé îåóëïìøëï úáðéóåê (records), òáúäåìåîîùè ïäîïê éìé îåóëïìøëéíé ðõóôùíé óôòïëáíé (ïëáîþé÷áàýéíéóñ CR, CR/NL éìé NL). Ëáöäáñ úáðéóø äïìöîá óïäåòöáôø óôòïëé (lines) ÷ æïòíå:
"<field>:<optional_space><value><optional_space>".
Ðïìå <field> ñ÷ìñåôóñ òåçéóôòïîåúá÷éóéíùí.
Ëïííåîôáòéé íïçõô âùôø ÷ëìàþåîù ÷ æáêì ÷ ïâùþîïê äìñ UNIX æïòíå: óéí÷ïì # ïúîáþáåô îáþáìï ëïííåîôáòéñ, ëïîåã óôòïëé - ëïîåã ëïííåîôáòéñ.
Úáðéóø äïìöîá îáþéîáôøóñ ó ïäîïê éìé îåóëïìøëéè óôòïë User-Agent, óìåäïí äïìöîá âùôø ïäîá éìé îåóëïìøëï óôòïë Disallow, æïòíáô ëïôïòùè ðòé÷åäåî îéöå. Îåòáóðïúîáîîùå óôòïëé éçîïòéòõàôóñ.
User-Agent
Disallow
Ìàâáñ úáðéóø (record) äïìöîá óïóôïñôø èïôñ âù éú ïäîïê óôòïëé (line) User-Agent é ïäîïê - Disallow
Åóìé æáêì /robots.txt ðõóô, éìé îå ïô÷åþáåô úáäáîîïíõ æïòíáôõ é óåíáîôéëå, éìé åçï îå óõýåóô÷õåô, ìàâïê ðïéóëï÷ùê òïâïô âõäåô òáâïôáôø ðï ó÷ïåíõ áìçïòéôíõ.
Ðòéíåò 1:
# robots.txt for http://www.site.com User-Agent: * Disallow: /cyberworld/map/ # this is an infinite virtual URL space Disallow: /tmp/ # these will soon disappear
× ðòéíåòå 1 úáëòù÷áåôóñ ïô éîäåëóáãéé óïäåòöéíïå äéòåëôïòéê /cyberworld/map/ é /tmp/.
Ðòéíåò 2:
# robots.txt for http://www.site.com User-Agent: * Disallow: /cyberworld/map/ # this is an infinite virtual URL space # Cybermapper knows where to go User-Agent: cybermapper Disallow:
× ðòéíåòå 2 úáëòù÷áåôóñ ïô éîäåëóáãéé óïäåòöéíïå äéòåëôïòéé /cyberworld/map/, ïäîáëï ðïéóëï÷ïíõ òïâïôõ cybermapper ÷óå òáúòåûåîï.
Ðòéíåò 3:
# robots.txt for http://www.site.com User-Agent: * Disallow: /
× ðòéíåòå 3 ìàâïíõ ðïéóëï÷ïíõ òïâïôõ úáðòåýáåôóñ éîäåëóéòï÷áôø óåò÷åò.
× îáóôïñýåå ÷òåíñ óôáîäáòô îåóëïìøëï éúíåîéìóñ, îáðòéíåò, íïöîï úáðéóù÷áôø ÷ óôòïëå User-Agent îåóëïìøëï éíåî òïâïôï÷, òáúäåìåîîùè ðòïâåìáíé éìé ôáâõìñôïòáíé.
Martijn Koster, m.koster@webcrawler.com
Ðåòå÷ïä: Áîäòåê Áìéëâåòï÷, info@citmgu.ru
Last-modified: Thu, 28 May 1998 14:14:26 GMT