Chrysophylax was his name, for he was of ancient and imperial lineage, and very rich. He was cunning, inquisitive, greedy, well-armoured, but not over bold. And he was mortally hungry.
— J. R. R. Tolkien, Farmer Giles of Ham
Chrysophylax-Search an experimental web robot—experimental in the sense that it isn't expected to work perfectly. It was written by Edmund Horner in 2002, and follows numerous earlier attempts in the same vein.
This page is maintained in the spirit of the "Share results" section of Martijn Koster's Guidelines for Robot Writers. The robot itself is also meant to be good citizen: it traverses the web slowly; keeps full results; obeys robots.txt; and provides contact information in the HTTP headers. As well as this, its operator has paid close attention to it and has fixed numerous bugs, and has oberseved proper operation for thousands of pages since the last bug fix.
The robot is written in PHP and uses a MySQL database. Running the robot is typically done from the command line, since the output of the script is pure plain text without even a Content-Type header. A variety of scripts for querying the database are also in development.
Download: search-refresh.php, search-create.sql. These scripts have only been tested on the author's system, and in all liklihood will not work on yours without alteration.
(This snapshot was taken 6 March 2003.) The main area of focus at the moment is the distribution of server software in various domains, particularly in New Zealand. This table shows the number of sites running the major web server software in chosen domains. Progress has become slow in the .nz domains, and I believe that chrysophylax-search know knows about the majority of New Zealand web servers.
| New Zealand | Generic Top-Level Domains | All Domains | ||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| .ac.nz | .co.nz | .cri.nz | .gen.nz | .govt.nz | .iwi.nz | .mil.nz | .net.nz | .org.nz | .school.nz | Total | .biz | .com | .edu | .gov | .info | .int | .mil | .net | .org | |||
| Apache | 1 (Unix) | 194 | 894 | 1 | 13 | 73 | 1 | 98 | 189 | 7 | 1470 | 8 | 1371 | 83 | 175 | 30 | 3 | 5 | 337 | 463 | 5978 | |
| 1 (Win32) | 2 | 17 | 1 | 20 | 9 | 1 | 3 | 2 | 58 | |||||||||||||
| 2 (Unix) | 3 | 2 | 5 | 20 | 3 | 4 | 1 | 7 | 25 | 97 | ||||||||||||
| 2 (Win32) | 1 | 1 | 2 | 1 | 8 | 1 | 1 | 15 | ||||||||||||||
| Total | 237 | 1058 | 1 | 15 | 94 | 1 | 102 | 229 | 7 | 1744 | 10 | 1969 | 105 | 204 | 32 | 5 | 5 | 382 | 537 | 7383 | ||
| Microsoft | IIS/3 | 1 | 1 | 1 | 3 | |||||||||||||||||
| IIS/4 | 64 | 263 | 1 | 2 | 37 | 1 | 10 | 53 | 431 | 72 | 10 | 72 | 6 | 2 | 14 | 1141 | ||||||
| IIS/5 | 71 | 615 | 4 | 1 | 126 | 1 | 3 | 18 | 82 | 1 | 922 | 3 | 378 | 23 | 132 | 2 | 8 | 37 | 58 | 2620 | ||
| IIS/6 | 5 | 5 | ||||||||||||||||||||
| Total | 136 | 879 | 5 | 3 | 163 | 1 | 4 | 28 | 135 | 1 | 1355 | 3 | 455 | 34 | 204 | 2 | 14 | 39 | 72 | 3771 | ||
| Netscape | Enterprise/3 | 1 | 48 | 4 | 2 | 5 | 60 | 32 | 2 | 41 | 10 | 1 | 6 | 216 | ||||||||
| Enterprise/4 | 16 | 1 | 17 | 44 | 3 | 69 | 1 | 7 | 1 | 1 | 169 | |||||||||||
| Enterprise/6 | 4 | 4 | 11 | 2 | 14 | 5 | 1 | 43 | ||||||||||||||
| Total | 1 | 68 | 4 | 2 | 6 | 81 | 88 | 9 | 127 | 1 | 22 | 2 | 8 | 436 | ||||||||
| Others | 15 | 187 | 2 | 26 | 13 | 23 | 6 | 272 | 186 | 3 | 41 | 5 | 2 | 23 | 29 | 899 | ||||||
| All Servers | 389 | 2192 | 8 | 18 | 287 | 4 | 4 | 143 | 393 | 14 | 3452 | 13 | 2698 | 151 | 576 | 39 | 6 | 43 | 446 | 646 | 12489 | |
The author is actively working on scripts to summarise the database. He also intends to make some raw data available. (As the search engine has relatively customisable behaviour, and has been mostly used for indexing an non-public site, some data cannot be published without consideration to privacy.)
The TODO list for Chrysophylax-Search is roughly as follows: