The story is that during several months I was using search engines to look for scientific information. I have tried different search engines and revealed that the volume of information gathered by them may differ by several orders.
Unfortunately, most documents on search engines I saw in the Internet were either lists of hyperlinks without any comments, or discussions about how many documents are in the database of ... (here is the name of a search engine) and what method of counting of the number of documents was used. No words about the efficiency, i.e. how many documents on some subject I can find using this search engine, especially in comparison with the other ones.
As I couldn't find such a comparison list, I decided to create my own one and publish it in the hope it will be useful for others.
After having written the first version of my document, I found the paper on Lycos search engine in the 1/1996 issue of PCWorld. The authors used a similar method to evaluate the ability of several search engines to find information on WWW servers as in my test, but... forgot to indicate what kind of information they were looking for (I remember only the recipe of beer in one of their examples). I understand the beer is important detail in our life, but evidently it's too far from the science.
Remark: There is an interesting link that points to four other WWW pages which compare different search engines. These documents mainly concern to the technical details of some search engines.
The following is the result of the experiments performed on May 14, 1996, and on August 3.
Eleven different search engines were scanned for eight different terms (used as keywords) taken from physics and chemistry. The number of found documents were registered. I think that the number of documents is the primary parameter when you're looking for scientific information. My estimates show that the maximum number of documents which can be found in the Net is less than 10% of the number that can be find using a good scientific database like INSPEC or CAS. Thus, the bigger is the database of a search engine, the higher are chances that you won't miss something important.
The results are shown in the table below. The numbers in parentheses correspond to the last experiment (May, 14).
KeyWord -----------------------------------S E A R C H E N G I N E S ---------------------------------- AltaVista Inktomi NlightN OpenText Lycos Excite Infoseek Magellan WebCrawler Galaxy Yahoo crystallography 31186(31827) 3562 3472 3927(1718) 3945(4301) 24975(83) 1464(*) 2771(2768) 659(632) 20(20) 22(19)
catalysis 21841(22373) 1963 1086 2351(831) 2802(3178) 18061(37) 550(*) 2083(2082) 307(283) 5(7) 17(10)
benzene 24764(22591) 1169 845 1404(525) 2191(2569) 17304(11) 374(*) 1005(1005) 178(166) 11(11) 0(0)
luminescence 7597(8038) 662 452 689(318) 1553(1886) 7231(15) 206(*) 558(557) 111(106) 3(3) 3(3)
ferroelectric 5622(6075) 419 362 409(171) 869(1072) 4362(30) 166(*) 388(388) 77(75) 3(3) 3(3)
Ising 4872(5164) 3281 226 612(361) 744(784) 6505(12) 144(*) 427(427) 57(66) 0(28) 0(0)
EXAFS 2677(3118) 158 115 228(99) 397(428) 2167(24) 64(*) 293(293) 41(37) 1(1) 0(0)
ferromagnetism 892(889) 320 42 119(49) 225(254) 1328(4) 26(26) 92(92) 19(17) 0(0) 0(0)
Unexpectedly high number of occurrences for "Ising" keyword for Inktomi is probably due to finding this combination of letters in words like "advertising". The asterisks for Infoseek are due to the fact that Infoseek limited the maximum number of occurrences by 50 a few months ago, and the corresponding results for the previous experiment for it are unknown.
The evolution of the number of documents returned by search engines enables to make some conclusions on the dynamics of search engines.
For the period under test (since May, 14) the database of two search engines, Inktomi and NlightN, has not been updated at all. The database of Inktomi was updated for the last time in December 1995 while that of NlightN - 6 months ago. The number of documents returned by AltaVista and Lycos slightly decreased (AltaVista has stopped its robot). Magellan keeps at the same level (they created a new search mode in May, and then the number of the documents increased fantastically - for some keywords by a factor of 10!). The number of documents returned by WebCrawler and especially by OpenText increased noticeably, but the most exciting result is for Excite: now it returns the number of documents compartible to that of AltaVista!
The results shown in the table are very clear. The most efficient search engine is evidently Alta Vista.
To estimate the completeness of information found by other search engines, I calculated the index of search engine. It was taken as a ratio of the number of documents found by particular search engine to that found by AltaVista, averaged for all keywords under test.
The indexes are presented in the following table. The numbers in parentheses are those estimated in the last experiment (on May 14) and on February, 17:
Note. Please keep in mind that search engines with low index are not so bad, they simply aren't good for search of scientific information.
After having published this document I have received a few e-mails pointing me to Metacrawler as a tool that enables to start the search at several search engines simultaneously and obtain a more complete list of hyperlinks. I have tested several such multi-threaded engines and revealed that indeed, among them the MetaCrawler is the most powerful multi-threaded search engine for finding scientific information. To obtain the maximum number of documents you have to choose comprehensive search mode on the beta-page. In this mode Metacrawler will send your request to AltaVista, OpenText, Infoseek, Excite and WebCrawler. You can look for a single word, a group of words joined by AND or OR boolean operators, or for a phrase.
However the number of references I have found in my test for the keyword "EXAFS" (62) was fewer than that found using conventional technique! After analyzing the output I realized that the origin of this problem is the limited number of references returned by search engines in its default mode (which doesn't depend on the maximum number of documents you have indicated as an option for Metacrawler). Look at these numbers:
AltaVista: found 2656 occurrences for the keyword
"EXAFS", returned 10 of them
OpenText: found 99, returned 10
Infoseek: found 50 or more, returned 10
WebCrawler: found 37, returned 25
Excite: found 24, returned 10
After removing 3 coinciding references from the 65 returned references we get 62, the result of my test.
This result brings me to the conclusion that the using of multi-threaded search engine for finding scientific information is not the best idea. Search engines are very different now and use different algorithms for search, and therefore it's very difficult to tune optimally each search engine which is used in the multi-threaded search. I suspect that Metacrawler doesn't use Lycos database because Lycos cannot search for a phrase.
Before you start your search, you should be aware of the completeness and the sort of information you can find in the Net. The following is the result of analysis of the contents of 176 URLs found by Lycos and Infoseek search engines for the keyword "EXAFS". The documents can be classified as follows:
I believe the situation will improve soon because many editors of scientific journals start to publish contents of their editions, and sometimes the full papers in the Internet. Low frequency of updating of databases of search engines and imperfect algorithms for locating and adding new URLs into these databases is now becoming the problem. In my analysis of the contents of URLs I saw many links to interesting scientific documents that were not indexed by most search engines.
You can ask me why I continue to search information in the Net instead of using INSPEC, CAS and other professional databases? The answer is very simple: in the Net I can found many interesting supplementary information: on authors, their works and research projects, on the foundations supporting these works and so on. You can't find this information in professional databases.
Alta Vista
is a creature of Digital Equipment Corporation.
In "Advanced Query" you can search documents containing different
words joined by AND, OR, NOT, NEAR boolean operators. There is a limit of 200
for the number of returned references. This search engine has a few interesting
options: for example, you can find all documents that are linked to your site
and page. The result of the search is a list of hyperlinks with short abstract
for each document, so it's quite easy to estimate which of found documents
may be useful to you. The database of this search engine is now updated after
a long period of beta testing. The latest news: Scooter, the robot of
AltaVista, is now temporarily stopped.
Excite Netsearch
This search engine uses so-called fuzzy AND strategy which combines
both AND and OR options when searching for group of words. It does not
support "pure" boolean options. The authors are expecting to add
more options in the future. The results of search is a list of references
with a short summary for each document. The database is updated once a week,
but can lose data.
After updating their database, Excite moved to the second place among the
most powerful search engines for finding scientific information but to my
opinion, its speed is not as fast as needed (network problems?).
Lycos, Inc., the
search engine developed at Carnegie Mellon University. It claims to be the
biggest on the Net. I'm not sure of it.
Remark: If your browser crashes after connecting with Lycos home
page (indicated as a hyperlink in this document), don't blame at me. They
have a lot of bugs in their starting page as well as in other pages (75 bugs
in their starting page according to weblint 1.016 checker program). I wrote
a letter about it to the webmaster of Lycos.
Lycos search engine enables to look for a single word or words joined by
boolean operators AND and OR, and also NOT option (included by a negative
sign), but cannot search for a phrase. It returns a list of hyperlinks to
the documents found with a short abstract for each document (first lines
taken from documents), thus simplifying the analysis of results. The database
(catalog) of this search engine is updated weekly, but can lose information.
Inktomi Web Services
is a search engine at the University of California at Berkeley.
This search engine enable you to use OR, AND and NOT boolean logic with words.
The search algorithm deletes common endings (such as -ing or -ed)
in searched words thus enabling you to expand the search area. As a result you
get a list of hyperlinks without abstracts. The lack of abstracts makes the
work with this list too difficult. The drawbacks of Inktomi are: (i) its
database is not updated often (it was in December 1995 last time!); (ii)
it contains the documents preferably coming from the USA. The Inktomi robot
started to collect new information in the Net at the beginning of April,
but we can't see the updated database yet. Instead, they have created a new
very promising commercial search engine,
HotBot. Try it!
Open Text
Index, the search engine of the Open Text Corporation.
This search engine enables you a variety of search strategies: from simple
search of a single word, group of words (using AND, OR boolean operators) or an
exact phrase of any length, to the power search with operators AND, OR,
BUT NOT, NEAR, FOLLOWED BY. Similar to AltaVista and Lycos, the result of
the search is a list of hyperlinks with a short abstract for each found
document. It is unknown how often the database is updated. It looks like
the database of OpenText can lose data.
Magellan, The
McKinley Group's Internet Guide.
First of all, Magellan (like Yahoo) is a directory of Internet
sites. Magellan's search engine enable you to search in the Net. To get the
number of documents different from 0 you should go directly to Magellan's
web-spider option. You can use OR, AND, NOT, NEAR boolean operators in your
query, you can search for phrases and use wildcards (* and ?). You'll
get a list of hyperlinks with a short abstract for each found document.
Recently Magellan has created a new search mode (Search Magellan), and the
efficiency of this search engine was greatly improved.
NlightN
This search engine enables you to look for a single word, group of words and
for a phrase. You can use AND, OR, NOT boolean operators and parenthesis for
complex search. The search engine returns a list of hyperlinks without
abstracts. Moreover, if you haven't registered, you'll see no more than 10-20
references. To see all of them you have to register first. Nlightn is a
commercial site, but provides the search in the Internet free of
charge. The Internet index for this search engine is presently
provided by Lycos.
NlightN has a few hundred of public domain and proprietary databases containing
a huge collection of documents, the number of which is comparable with that
found by AltaVista. But the using of them is not free.
Infoseek Net Search
Infoseek can look for a word, group of words or a phrase. When performing
the search, it takes into account different word forms (such as singular and
plural forms). Infoseek returns a list of hyperlinks with a short abstract
for each found document. Now Infoseek doesn't limit the number of found
references (as it did a few months ago), and you can see all of them.
If you want to look for the information in public and commercial databases,
you should use Infoseek
Professional, but not free. The database is updated
every 1-2 weeks.
WebCrawler is
a search engine operated by America Online, Inc.
The database of this search engine is not large. You can search for a single
word or a group of words joined by AND, OR, NOT operators, and for a phrase.
The result of the search is a list of hyperlinks. The output of WebCrawler
was recently changed, and now you can choose whether you want to see abstracts
for each document. The maintainers of the database are striving to update it
every day.
Galaxy
The number of documents returned by this search engine is slightly above that
from Yahoo, i.e. less than 0.1% of the number of documents returned by
AltaVista. I don't think that it's worth to describe all the details of search
strategy for this search engine.
Yahoo
This search engine is not good for finding scientific information. Instead,
it is the best place if you're looking for general information in the Internet.
Yahoo can serve as a hierarchically organized directory of Web
documents.
In addition to search engines listed above, I tested several more search engines. Among them:
The Electric Library
The Electric Library is a part of activity of Americans for Smarter Kids
Fund, created by Infonautics Corp. The search engine of the Electric Library
can help you to find review articles on the subject you're looking for in many
popular (not necessarily scientific) editions.