Key Words: World Wide Web, Internet, digital information, preservation, archiving
How often have you sought to return to a previously visited Web site, only to discover it is no longer there and to receive a 404 error instead? Similarly, do you have document files from a decade ago that you created on an earlier version of a word processor — or on one that no longer exists? Perhaps that document is on a storage medium, such as the 5.25 inch diskette. If so, can you access that document today? Do you still have a copy of the first e-mail message that you ever sent?
The reality is that the digital world is far more fragile than the print one it replaces. While we still can read an original copy of the Declaration of Independence or Gutenberg’s Bible, digital information created in the last forty years has already been lost. This is made even more numbing given an estimate made in 1998 that half to three-quarters of all data produced is "born digital," that is, it never existed on paper or in an analog form (Stepanek, 1998).
Laura Tangley (1998) relates data from the National Media Lab on expected life expectancy under optimum conditions for certain media: magnetic tapes (including VHS) up to 20 years and optical disks (such as CD-ROMs) 50 years or less. Average quality CD-ROMs may become unreadable after just five years. In contrast, microfilm can last over 100 years and paper for 500 years or more. In addition to the problem of the limited lifespan of the storage medium, there is the issue of obsolescence of the software and hardware that created the document and is needed for the content to be read. Tangley relates the following as examples of problems to date:
- Magnetic tapes from the 1976 Viking mission to Mars were missing 10 to 20 percent of their data when attempts were made to read the data in the past few years.
- A Yale University project to transfer 2,000 books from microfilm to optical disk stopped midway when the software being used became obsolete, making the disks difficult to read.
- Mapping data of land use in the State of New York recorded on magnetic tapes in the 1960s are no longer readable.
- Data needed to track deforestation of the Brazilian Amazon cannot be read from magnetic tapes storing the satellite photos taken in the 1970s.
These problems continue to confront us. A problem may surface when a favorite software is no longer "backward compatible" in reading files. Or the difficulty may be due to ongoing changes in the hardware. For example, in February 2003, Dell Computer announced that by the end of the year it will no longer equip desktop or laptop models with floppy drives, although they will continue to be available as an option (Minkel, 2003). Instead of saving to a floppy disk, users will need to burn files onto CD and DVD disks or save them on reusable Flash memory units. Other Windows PC manufacturers are expected to follow suit. Apple stopped putting floppy drives in its Macs in 1998.
Proposed solutions have included refreshing the raw data, developing emulators of the software and operating systems to enable the data to be carried forward and still be used in its original format, or migrating to a new generation of hardware/software. Each of these approaches poses difficulties (Rothenberg, 1995). Refreshed data can become increasingly difficult to interpret. Developing and sustaining emulators over time requires preserving the specifications of the outdated hardware in a format that can itself be carried forward. Migration is time-consuming, costly, and prone to loss of data in the transfer. For example, pharmaceutical companies found errors were introduced when migrating drug-testing data that support claims of drug safety and effectiveness (Stepanek, 1998).
What about the Web?
We know the Web to be huge and growing. Lyman (2002) reports more than 7 million pages are added daily, with many others disappearing. By his figures, the average life span of a Web page is only 44 days, and 44 percent of the Web that existed in 1998 was gone in 1999. Because the Web represents the culture of our time, its ephemeral nature raises serious issues of preserving this digital record for the future.
Lyman (2002) describes the preservation problem as being cultural, technical, economic, and legal. The cultural questions are what and how much of this vast reservoir to save. What will ultimately have historic value? The technical problems of hardware and software facing the preservation of other digital media exist for the Web also. In addition, there is the particular need to develop means to collect continuously due to the changing nature of the content and to define content given that it lacks discrete boundaries due to its embedded linkages. Economic issues are real here also—who has the responsibility and who will pay? Intellectual property rights present another issue for archiving given that copyright protection extends to the Web.
A number of initiatives are underway to preserve our digital heritage as reflected on the World Wide Web. The following are particularly worth noting.
The Internet Archive: Wayback Machine (www.archive.org/web/web.php)
In 1996 Brewster Kahle founded the Internet Archive project (www.archive.org). With the use of robots, the entire, publicly available Internet has been archived every two months over that time period. This vast archive of more than 10 billion pages is available at the project site through the Wayback Machine. It is possible to search on a specific URL or by a specific collection. For example, a search on www.whitehouse.gov provides links for pages from 1996 to the present. It is also possible to search on a URL for a personal Web page.
Three of the collections are collaborative efforts with the Library of Congress and include Web pages preserved from the U.S. elections of 2000 and 2002, as well a collection of archived Web sites covering September 11, 2001 and the period immediately following. There is also a collection of "Web Pioneers"—11 sites that are credited with shaping the character of the Internet. Included is a Web snapshot from 1996 of leaders such as Yahoo!, the Internet Movie Database, Amazon.com, and NASA.
Library of Congress: Minerva (www.loc.gov/minerva/)
The Library of Congress (LC) has the responsibility to collect and preserve our cultural and intellectual artifacts. In December 2000, Congress authorized LC to develop and execute a National Digital Information Infrastructure and Preservation Program (NDIIPP). This included collaborating with other Federal and non-Federal entities to collect and provide access to digital materials and developing a strategy for the policies and technological infrastructure needed to insure long-term preservation. On February 14, 2003, the Librarian of Congress announced that Congress had approved the NDIIPP plan. Already developed is a prototype system to collect and preserve materials from the Web. Known as "Minerva" (Mapping the Internet: the Electronic Resources Virtual Archive), it is concerned with Web materials that have been made publicly available without restriction. News services, such as CNN, are examples of sites that, although publicly available, do request that Internet robots exclude them from their information harvesting sweeps.
Arms (2001), in his description of the Minerva prototype, highlights the need for both bulk and selective collecting. With bulk collecting, a snapshot is taken of every site with a set frequency. Selective collecting, however, makes it possible to vary the collection frequency based on the events taking place. His example is the daily snapshot record of the 2002 Presidential candidates’ tactics as revealed on their Web sites in the days leading up to the election and during the Florida recount. Similarly, the selective recording of Web sites on September 11, 2001, and in the weeks following provides a dramatic account of that historic time. Arms offers three preservation objectives: 1) preserve the bits (the exact bit sequence of the original); 2) preserve the content (the text or image, but not the full interactive nature of the Web site); and 3) preserve the experience (the entire experience of interacting with the digital material and its dynamic elements).
At present, the Minerva project includes the three projects done in collaboration with the Internet Archive and mentioned above. Coming soon are collections for the 2002 Winter Olympics, the 2002 September 11th Remembrance, and the 107th Congress.
The CyberCemetery began in 1997 to provide permanent public access to the electronic Web sites and publications of defunct U.S. government agencies and commissions. It is a joint project of the University of North Texas Libraries and the U.S. Government Printing Office, as part of its Federal Depository Library Program. The site provides for searching by site name and by category. Among the sites archived here are the National Partnership for Reinventing Government and the White House Commission on Complementary and Alternative Medicine Policy.
U.S. Government Printing Office--Permanent Public Access/PPA (www.gpo.gov/ppa/)
Some background is needed on other issues facing the regular dissemination of and access to information from the Federal Government. The founding fathers believed an informed citizenry was central to the checks and balances needed to sustain a democratic form of government. Therefore, the U.S. Government Printing Office (GPO) was created to produce and distribute Federal information. A key mechanism for insuring that information flowed from government to the individual was GPO’s creation of a system of federal depository libraries located in each state (Federal Depository Library Program or FDLP). For example, in Ohio 58 libraries are so designated. Their function is to commit to acquiring information in areas appropriate to their communities and to provide open and free access to that information. Each library establishes a profile of materials it wants to receive from any or all agencies of the Federal Government on an ongoing basis. This worked well in the print environment. However, online electronic information is now the fastest growing component of the information disseminated, accounting for over 53% of the titles profiled by the depository libraries in fiscal year 2000 (U.S. Government Printing Office, 2001).
These online products are, of course, not physically distributed to the depository libraries for retention. The U.S. Code (44 U.S.C. Section 1911) mandates that public access to documents disseminated through the FLDP must be maintained permanently. To deal with this permanent access requirement for digital information, GPO created its Permanent Public Access (PPA) Working Group. As a result, GPO has assumed responsibility for permanent access to Federal information residing on its Web servers. Additionally, for agency information not on its servers, GPO copies the Internet resources and places those copies in a digital archive. Persistent URLS (or PURLs) are assigned to those resources so that users will be redirected automatically to the archived products if the information becomes unavailable on the agency Web site.
Removal of Health-Related Information on Federal Web Sites
Since the early 1980s, various government agencies on occasion asked depository libraries to destroy or return certain documents because of military security, administrative and operation security, falsified data, outright censorship, and environmental security (Lynch, 1995). By 2001, much of the available Federal information was no longer in print, but had moved to the Web. Following the September 11 attacks, a number of Federal agencies removed documents from their Web sites that might aid terrorist groups.
More recently, removals of a different sort have been reported. The New York Times (Clymer, 2002) and other news sources reported three removals dealing with condom use, the relationship between abortion and breast cancer, and ways to reduce sex among teenagers. Critics accused the Department of Health and Human Services of censoring information to advance the philosophy of sexual abstinence. Rep. Henry Waxman (D-CA) and a group of House Democrats sent a letter to Tommy Thompson, Secretary of Health and Human Services, protesting the removal of the health information from Federal Web sites based on "ideology rather than science." A Department spokesperson responded that the removals were done to update the information.
In the case of the breast cancer/abortion relationship, a scientific panel convened by the National Cancer Institute (NCI) reported unanimous agreement on March 4, 2003, that there is no evidence suggesting that having had an abortion increases the risk of breast cancer (U.S. National Cancer Institute, 2003). This would be in keeping with the patient information that was pulled in October 2002. In its place was a revision dated November 25, 2002, that suggests that the scientific data were inconclusive. Critics are watching to see if the official fact sheet will be updated to reflect the recent determination by the scientific panel.
The information relating to condom use and sex among teenagers was on the Centers for Disease Control and Prevention site. A fact sheet had stated that condoms were "highly effective" in preventing HIV and other sexually transmitted diseases. That has now been revised to provide a more neutral summary of the pros and cons of condom use. Regarding the information about intervention programs that included condoms in addition to a discussion of abstinence, a CDC spokesperson indicated that was pulled because some communities did not like the combined message. The debate continues.
In a world where digital information plays an ever-greater role, we lack an adequate appreciation for how fleeting it might be. Until preservation issues are resolved, it is possible that at least part of the record of this period in time will be lost. How have you stored your digital information? What are your hopes for being able to access it into the future?
Barbara F. Schloman, PhD, AHIP
Assistant Dean, Library Information Services
Libraries & Media Services
Kent State University
Kent, OH 44242
Email Address: email@example.com
Disclaimer: Mention of a Web site does not imply endorsement by the author, OJIN, or NursingWorld.
Arms, W. Y. (2001). Collecting and preserving the Web: The Minerva prototype. RLG DigiNews, 5(2). Retrieved March 15, 2003, from www.rlg.org/preserv/diginews/diginews5-2.html#feature1
Clymer, A. (2002, November 25). Critics say government deleted sexual material from Web sites to push abstinence [Electronic version]. New York Times, p.A18.
Lyman, P. (2002). Archiving the World Wide Web. In Building a national strategy for digital preservation: Issues in digital media archiving (pp.38-51). Washington, DC: Council on Library and Information Resources. Also available at: www.clir.org/pubs/reports/pub106/web.html .
Lynch, S. R. (1995). GPO recalls of depository documents: A review [Electronic version]. Journal of Government Information, 22, 23-31.
Minkel, W. (2003). Floppy drives soon to be extinct [Electronic version]. School Library Journal, 49(3), 32.
Rothenberg, J. (1995). Ensuring the longevity of digital documents [Electronic version]. Scientific American, 272(1), 42-47.
Stepanek, M. (1998, April 20). From digits to dust [Electronic version]. Business Week, 3574, 128-129.
Tangley, L. (1998, February 16). Whoops, there goes another CD-ROM [Electronic version]. U.S. News & World Report, 124(6), 67-68.
U.S. Government Printing Office. (2001). "A report on meetings hosted by the U.S. Government Printing Office, 1999-2000." Retrieved March 15, 2003, from www.gpo.gov/ppa/report.html
U.S. National Cancer Institute. (2003, March 4). "Summary report: Early reproductive events and breast cancer workshop." Retrieved March 17, 2003, from http://cancer.gov/cancerinfo/ere-workshop-report