Key Words: Invisible Web, Internet, search engines
Just when we already feel overwhelmed by the number of results returned when using Web search engines, there is mounting evidence that there is a great deal that is not being searched--namely the "Invisible" or "Deep" Web. In fact, the argument is made that standard search engines are not indexing most of the information on the Web. Not only is a majority of Web content submerged within these sites, but it is also some of the best information on the Internet.
Two publications in particular have explored what the Invisible Web is and the implications of it for us as Web searchers. Michael K. Berman (2001) published a white paper, "The Deep Web-Surfacing Hidden Value." Although the paper has a commercial influence because the research he presents showcases technology available from his employer BrightPlanet, the issues raised have provoked considerable attention. The second publication of note is a book by Chris Sherman and Gary Price (2001), The Invisible Web-Uncovering Information Sources Search Engines Can't See.
This column will explore some of the issues raised and how to plumb the depths of the Invisible Web for health and other needed information. Bergman prefers the terminology of the "Deep Web." However, this column will use the more pervasive phrase of the "Invisible Web."
Just what is it and why don't conventional search engines reach it?
The Invisible Web is that portion of the Web that is not reached by standard search engines such as Alta Vista or Google. It includes specialized databases and search engines, archives of documents, directories and locators, dictionaries, library catalogs, and gated resources requiring a password or login. Some sites may have a hybrid status with some content visible and other that is not.
- Much of the content on the Invisible Web is in databases that have their own search interface and retrieve customized results dynamically. Because of these "on the fly" responses, the resulting pages are not static Web pages and are not indexed by the search engines.
- Search engines run computer-driven "spiders" to find Web pages and make them available for indexing. Spiders can only find pages if links to them exist from pages already identified for that search engine. Other than this, a search engine indexes Web sites if the producer of a Web site submits it for indexing. If information about a Web page is not obtained in either of these ways, the search engine will not index it.
- Aside from technological limitations they may face, search engines also make conscious decisions not to index certain material. This includes pages that predominantly have content that is something other than HTML text, which is the standard format for static Web pages. These could be pages that are comprised largely of images, those that are in PDF or a word processing format, or those that have been written with specialized software such as Flash or Shockwave.
- A Web producer may also block all or part of a site from being retrieved by a spider. This is particularly common for sites that offer timely content (newspapers, stock tickers, flight trackers).
Why does this matter?
Content accessible on the Web continues to grow as government agencies, organizations, and corporations commit to doing business through this medium and are aided by advances in Web technology and the increasingly favorable economics of computer storage. Bergman's paper on the Deep Web was based on research done in March 2000, so one can imagine that the findings would be even more dramatic today. His findings on the Deep Web included:
- The publicly available information found there is 400 to 550 times greater than what is on the surface Web. More than half of this information is found in topic-specific databases. It exceeds the total volume of printed works by an estimated factor of seven.
- The number of sites exceeds 200,000. Of these, 60 Web sites alone were found to be nearly 40 times the size of the entire surface Web.
- This is the largest growing category of new information on the Web.
- Subject content extends across the entire spectrum of human enterprise with an estimated 5.5% of the Deep Web devoted to health.
- Some Deep Web sites are very popular and have many links to them (e.g., Amazon.com), but most are not well known. An estimated 97.4% of these sites are available without restriction.
- Quality is more pervasive on the Deep Web than on the surface Web. Also, quality results are possible here that cannot be obtained elsewhere.
What are examples of Invisible Web sources?
American Medical Association Physician Select
Library Online Catalog
Library of Congress Catalog
Combined Health Information Database (CHID)
CNN News Search
How do I find Invisible Web sites?
Specialized finding aids have been created to help searchers identify sources such as these that will not turn up using standard search engines. What characterizes these tools is that they are not just automated retrievers, but the product of human effort and judgment.
BrightPlanet Corporation provides this directory of 103,000 searchable databases and engines on the Internet, organized by categories.
Infomine: Scholarly Internet Resource Collections
Over 20,000 sites selected by librarians as "significant, core and/or reference level resources of a scholarly or educational nature on the Internet." It is possible to search or browse any of the 10 broad categories.
InvisibleWeb.com: The Search Engine of Search Engines
"... our subject matter experts (human editors) have discovered indexed, described, and categorized thousands of invisible sources on the Web in a directory (taxonomy) of 800+ categories."
Resource Discovery Network (RDN)
Web directory of over 32,000 sites, compiled by subject experts in the United Kingdom and organized by independent hubs that are searchable independently or by using a unified interface to search all.
One hub of particular interest is:
BIOME - Health and Life Sciences
www.Invisible-web.net -- The Invisible Web Directory
Online Web directory by Sherman and Price that complements their book cited earlier. The site is updated regularly and is designed to serve as a starting point for learning more about resources on the Invisible Web. It is organized into 18 categories.
Tools available from standard search engines
AltaVista has created "shortcuts" that direct a user's search to selected types of Invisible Web resources that would not normally be retrieved. To make this happen, AltaVista has created specialized indices for commonly sought resources relating to such things as local information, maps, news, recipes, stocks, white pages, yellow pages. A retrieved shortcut shows on the first page of Alta Vista search results, below "products and services," but before other search results. The shortcut is marked with a small blue arrow icon.
For example, a search on "Agent Orange news" turned up a shortcut that led to 35 headline articles from sources such as the BBC, MSNBC, Miami Herald, ABC Online, USA Today. For more information on AltaVista shortcuts and how they work, go to www.altavista.com/sites/search/shortcuts_overview
Google now makes it possible to search twelve different file types in addition to the standard HTML formatted Web page. This includes PDF files, as well as to Adobe Postscript, Microsoft Word, Excel, or PowerPoint, and Rich Text. These files will show up when there is a match on any search with the file format given in blue text in brackets (e.g., [PDF]). Also the name of the file format appears below the title (e.g., "PDF/Adobe Portable Document Format"). Google offers the choice to open one of these documents as a Web page by clicking on "View as HTML." This prevents opening a file that may contain a virus. If you elect to open in another format, it is important you check the file with virus scanning software.
If you want to search only for a particular file type, use Google's Advanced Search page. A drop-down menu is provided to allow you to restrict your search to a specific file type. For more information on searching by file type, go to www.google.com/help/faq_filetypes.html.
Finding it once, retrieving it again
Web users are often confounded by URLs that are uncomfortably lengthy, have embedded punctuation, and are impossible to remember. Such URLs typically reflect pages that have been retrieved from a search of a database on the Invisible Web. The URL is unique to the search that was done on that database and to the item retrieved. Reusing this URL would retrieve the same document.
For example, a search on PubMed for "Nightingale and evidence-based nursing" retrieved the following reference. Citing or bookmarking this URL will refer the user to this same item.
Evid Based Nurs 2001 Jul;4(3):68-9
Florence Nightingale and the early origins of evidence-based nursing. McDonald L.
Department of Sociology and Anthropology, University of Guelph, Guelph, Ontario, Canada.
PMID: 11708232 [PubMed - in process]
The library's gated resources on the Invisible Web
Libraries have traditionally served as the repository of our cultural heritage and of the scholarly record. These archives in the past were in print and sometimes in other media such as microform. Now the move, of course, is toward providing digitized resources available on the Web. However, many of these resources--such as the majority of electronically available journals and research databases (CINAHL, PsycINFO)-can be obtained only through subscription or licensing agreements. These contractual arrangements include restrictions from the publishers as to how these electronic resources can be used and by whom. To insure compliance with these agreements, libraries limit access to these resources through some validation mechanism.
The bottom line is that some of the most valuable and timely information that libraries have to offer is on the Invisible Web and in fact is "gated." Users will not gain access to this material by searching on the Web at large, nor by using the tools mentioned above. It is incumbent upon users with privileges in a given library to seek out what that library has to offer electronically and what paths to that material are available.
Barbara F. Schloman, PhD, AHIP
Assistant Dean, Library Information Services
Libraries & Media Services
Kent State University
Kent, OH 44242
Email Address: firstname.lastname@example.org
Disclaimer: Mention of a Web site does not imply endorsement by the author, OJIN, or NursingWorld.
Bergman, M. K. (2001). The deep Web: Surfacing hidden value. Retrieved March 11, 2002, from http://beta.brightplanet.com/deepcontent/tutorials/DeepWeb/index.asp
Sherman, C., & Price, G. (2001). The invisible Web: Uncovering information sources search engines can't see. Medford, NJ: Information Today.