Best search engines for finding scientific information in the Web

Alexander Lebedev, Moscow State University

Version: August 9, 1996

The place where you can find the latest version: http://scon155.phys.msu.su/~swan/comparison.html

The main purpose of this document is to get you informed about the most effective search engines which can help you in finding scientific information in the Net.

History

The story is that during several months I was using search engines to look for scientific information. I have tried different search engines and revealed that the volume of information gathered by them may differ by several orders.

Unfortunately, most documents on search engines I saw in the Internet were either lists of hyperlinks without any comments, or discussions about how many documents are in the database of ... (here is the name of a search engine) and what method of counting of the number of documents was used. No words about the efficiency, i.e. how many documents on some subject I can find using this search engine, especially in comparison with the other ones.

As I couldn't find such a comparison list, I decided to create my own one and publish it in the hope it will be useful for others.

After having written the first version of my document, I found the paper on Lycos search engine in the 1/1996 issue of PCWorld. The authors used a similar method to evaluate the ability of several search engines to find information on WWW servers as in my test, but... forgot to indicate what kind of information they were looking for (I remember only the recipe of beer in one of their examples). I understand the beer is important detail in our life, but evidently it's too far from the science.

Remark: There is an interesting link that points to four other WWW pages which compare different search engines. These documents mainly concern to the technical details of some search engines.

Experiment

The following is the result of the experiments performed on May 14, 1996, and on August 3.

Eleven different search engines were scanned for eight different terms (used as keywords) taken from physics and chemistry. The number of found documents were registered. I think that the number of documents is the primary parameter when you're looking for scientific information. My estimates show that the maximum number of documents which can be found in the Net is less than 10% of the number that can be find using a good scientific database like INSPEC or CAS. Thus, the bigger is the database of a search engine, the higher are chances that you won't miss something important.

The results are shown in the table below. The numbers in parentheses correspond to the last experiment (May, 14).

KeyWord   -----------------------------------S E A R C H     E N G I N E S ----------------------------------
     AltaVista  Inktomi  NlightN  OpenText   Lycos     Excite  Infoseek  Magellan  WebCrawler  Galaxy  Yahoo

crystallography
   31186(31827)   3562    3472  3927(1718) 3945(4301) 24975(83) 1464(*) 2771(2768)  659(632)   20(20)  22(19)

catalysis 21841(22373) 1963 1086 2351(831) 2802(3178) 18061(37) 550(*) 2083(2082) 307(283) 5(7) 17(10)
benzene 24764(22591) 1169 845 1404(525) 2191(2569) 17304(11) 374(*) 1005(1005) 178(166) 11(11) 0(0)
luminescence 7597(8038) 662 452 689(318) 1553(1886) 7231(15) 206(*) 558(557) 111(106) 3(3) 3(3)
ferroelectric 5622(6075) 419 362 409(171) 869(1072) 4362(30) 166(*) 388(388) 77(75) 3(3) 3(3)
Ising 4872(5164) 3281 226 612(361) 744(784) 6505(12) 144(*) 427(427) 57(66) 0(28) 0(0)
EXAFS 2677(3118) 158 115 228(99) 397(428) 2167(24) 64(*) 293(293) 41(37) 1(1) 0(0)
ferromagnetism 892(889) 320 42 119(49) 225(254) 1328(4) 26(26) 92(92) 19(17) 0(0) 0(0)

Unexpectedly high number of occurrences for "Ising" keyword for Inktomi is probably due to finding this combination of letters in words like "advertising". The asterisks for Infoseek are due to the fact that Infoseek limited the maximum number of occurrences by 50 a few months ago, and the corresponding results for the previous experiment for it are unknown.

The evolution of the number of documents returned by search engines enables to make some conclusions on the dynamics of search engines.

For the period under test (since May, 14) the database of two search engines, Inktomi and NlightN, has not been updated at all. The database of Inktomi was updated for the last time in December 1995 while that of NlightN - 6 months ago. The number of documents returned by AltaVista and Lycos slightly decreased (AltaVista has stopped its robot). Magellan keeps at the same level (they created a new search mode in May, and then the number of the documents increased fantastically - for some keywords by a factor of 10!). The number of documents returned by WebCrawler and especially by OpenText increased noticeably, but the most exciting result is for Excite: now it returns the number of documents compartible to that of AltaVista!

Results and discussion

The results shown in the table are very clear. The most efficient search engine is evidently Alta Vista.

To estimate the completeness of information found by other search engines, I calculated the index of search engine. It was taken as a ratio of the number of documents found by particular search engine to that found by AltaVista, averaged for all keywords under test.

The indexes are presented in the following table. The numbers in parentheses are those estimated in the last experiment (on May 14) and on February, 17:

Note. Please keep in mind that search engines with low index are not so bad, they simply aren't good for search of scientific information.

Multi-threaded search engines

After having published this document I have received a few e-mails pointing me to Metacrawler as a tool that enables to start the search at several search engines simultaneously and obtain a more complete list of hyperlinks. I have tested several such multi-threaded engines and revealed that indeed, among them the MetaCrawler is the most powerful multi-threaded search engine for finding scientific information. To obtain the maximum number of documents you have to choose comprehensive search mode on the beta-page. In this mode Metacrawler will send your request to AltaVista, OpenText, Infoseek, Excite and WebCrawler. You can look for a single word, a group of words joined by AND or OR boolean operators, or for a phrase.

However the number of references I have found in my test for the keyword "EXAFS" (62) was fewer than that found using conventional technique! After analyzing the output I realized that the origin of this problem is the limited number of references returned by search engines in its default mode (which doesn't depend on the maximum number of documents you have indicated as an option for Metacrawler). Look at these numbers:

AltaVista: found 2656 occurrences for the keyword "EXAFS", returned 10 of them
OpenText: found 99, returned 10
Infoseek: found 50 or more, returned 10
WebCrawler: found 37, returned 25
Excite: found 24, returned 10

After removing 3 coinciding references from the 65 returned references we get 62, the result of my test.

This result brings me to the conclusion that the using of multi-threaded search engine for finding scientific information is not the best idea. Search engines are very different now and use different algorithms for search, and therefore it's very difficult to tune optimally each search engine which is used in the multi-threaded search. I suspect that Metacrawler doesn't use Lycos database because Lycos cannot search for a phrase.

Search engines and professional databases

Before you start your search, you should be aware of the completeness and the sort of information you can find in the Net. The following is the result of analysis of the contents of 176 URLs found by Lycos and Infoseek search engines for the keyword "EXAFS". The documents can be classified as follows:

One can see that the number of scientific publications is about 10-20% of total number of documents found by search engines. At the same time the search for the "EXAFS" keyword in the INSPEC database has found 493 publications in 1995 and 245 publications in 1994. So, my optimistic estimate for the number of scientific publications which can be found in the Net is 5-10% of the its full number. That is why I think that the number of documents is the primary parameter when you're looking for scientific information in the Net: the bigger is the database, the higher are the chances that you won't miss something interesting.

I believe the situation will improve soon because many editors of scientific journals start to publish contents of their editions, and sometimes the full papers in the Internet. Low frequency of updating of databases of search engines and imperfect algorithms for locating and adding new URLs into these databases is now becoming the problem. In my analysis of the contents of URLs I saw many links to interesting scientific documents that were not indexed by most search engines.

You can ask me why I continue to search information in the Net instead of using INSPEC, CAS and other professional databases? The answer is very simple: in the Net I can found many interesting supplementary information: on authors, their works and research projects, on the foundations supporting these works and so on. You can't find this information in professional databases.

Information about different search engines

Alta Vista is a creature of Digital Equipment Corporation.
In "Advanced Query" you can search documents containing different words joined by AND, OR, NOT, NEAR boolean operators. There is a limit of 200 for the number of returned references. This search engine has a few interesting options: for example, you can find all documents that are linked to your site and page. The result of the search is a list of hyperlinks with short abstract for each document, so it's quite easy to estimate which of found documents may be useful to you. The database of this search engine is now updated after a long period of beta testing. The latest news: Scooter, the robot of AltaVista, is now temporarily stopped.

Excite Netsearch
This search engine uses so-called fuzzy AND strategy which combines both AND and OR options when searching for group of words. It does not support "pure" boolean options. The authors are expecting to add more options in the future. The results of search is a list of references with a short summary for each document. The database is updated once a week, but can lose data.
After updating their database, Excite moved to the second place among the most powerful search engines for finding scientific information but to my opinion, its speed is not as fast as needed (network problems?).

Lycos, Inc., the search engine developed at Carnegie Mellon University. It claims to be the biggest on the Net. I'm not sure of it.
Remark: If your browser crashes after connecting with Lycos home page (indicated as a hyperlink in this document), don't blame at me. They have a lot of bugs in their starting page as well as in other pages (75 bugs in their starting page according to weblint 1.016 checker program). I wrote a letter about it to the webmaster of Lycos.
Lycos search engine enables to look for a single word or words joined by boolean operators AND and OR, and also NOT option (included by a negative sign), but cannot search for a phrase. It returns a list of hyperlinks to the documents found with a short abstract for each document (first lines taken from documents), thus simplifying the analysis of results. The database (catalog) of this search engine is updated weekly, but can lose information.

Inktomi Web Services is a search engine at the University of California at Berkeley.
This search engine enable you to use OR, AND and NOT boolean logic with words. The search algorithm deletes common endings (such as -ing or -ed) in searched words thus enabling you to expand the search area. As a result you get a list of hyperlinks without abstracts. The lack of abstracts makes the work with this list too difficult. The drawbacks of Inktomi are: (i) its database is not updated often (it was in December 1995 last time!); (ii) it contains the documents preferably coming from the USA. The Inktomi robot started to collect new information in the Net at the beginning of April, but we can't see the updated database yet. Instead, they have created a new very promising commercial search engine, HotBot. Try it!

Open Text Index, the search engine of the Open Text Corporation.
This search engine enables you a variety of search strategies: from simple search of a single word, group of words (using AND, OR boolean operators) or an exact phrase of any length, to the power search with operators AND, OR, BUT NOT, NEAR, FOLLOWED BY. Similar to AltaVista and Lycos, the result of the search is a list of hyperlinks with a short abstract for each found document. It is unknown how often the database is updated. It looks like the database of OpenText can lose data.

Magellan, The McKinley Group's Internet Guide.
First of all, Magellan (like Yahoo) is a directory of Internet sites. Magellan's search engine enable you to search in the Net. To get the number of documents different from 0 you should go directly to Magellan's web-spider option. You can use OR, AND, NOT, NEAR boolean operators in your query, you can search for phrases and use wildcards (* and ?). You'll get a list of hyperlinks with a short abstract for each found document. Recently Magellan has created a new search mode (Search Magellan), and the efficiency of this search engine was greatly improved.

NlightN
This search engine enables you to look for a single word, group of words and for a phrase. You can use AND, OR, NOT boolean operators and parenthesis for complex search. The search engine returns a list of hyperlinks without abstracts. Moreover, if you haven't registered, you'll see no more than 10-20 references. To see all of them you have to register first. Nlightn is a commercial site, but provides the search in the Internet free of charge. The Internet index for this search engine is presently provided by Lycos.
NlightN has a few hundred of public domain and proprietary databases containing a huge collection of documents, the number of which is comparable with that found by AltaVista. But the using of them is not free.

Infoseek Net Search
Infoseek can look for a word, group of words or a phrase. When performing the search, it takes into account different word forms (such as singular and plural forms). Infoseek returns a list of hyperlinks with a short abstract for each found document. Now Infoseek doesn't limit the number of found references (as it did a few months ago), and you can see all of them. If you want to look for the information in public and commercial databases, you should use Infoseek Professional, but not free. The database is updated every 1-2 weeks.

WebCrawler is a search engine operated by America Online, Inc.
The database of this search engine is not large. You can search for a single word or a group of words joined by AND, OR, NOT operators, and for a phrase. The result of the search is a list of hyperlinks. The output of WebCrawler was recently changed, and now you can choose whether you want to see abstracts for each document. The maintainers of the database are striving to update it every day.

Galaxy
The number of documents returned by this search engine is slightly above that from Yahoo, i.e. less than 0.1% of the number of documents returned by AltaVista. I don't think that it's worth to describe all the details of search strategy for this search engine.

Yahoo
This search engine is not good for finding scientific information. Instead, it is the best place if you're looking for general information in the Internet. Yahoo can serve as a hierarchically organized directory of Web documents.

In addition to search engines listed above, I tested several more search engines. Among them:

The Electric Library
The Electric Library is a part of activity of Americans for Smarter Kids Fund, created by Infonautics Corp. The search engine of the Electric Library can help you to find review articles on the subject you're looking for in many popular (not necessarily scientific) editions.


Hope you have found this information useful,
Alexander Lebedev