*** Qualitative Assessment of the Internet Archive's Wayback Machine***
 
 


Background

Executive Summary

As described above, the Internet Archive (IA) is a comprehensive web-based digital library that continually crawls and archives all public internet pages, which are then publicly accessible through the Wayback Machine search service. Some of the common principles revealed about digital archives are also defining motivations for IA while others are important underlying differences. The project team recognized a need to identify, understand and satisfy the requirements of the Internet Archive's growing user base. Improvements in the archives usefulness and benefit to its users could then be extracted to maximize the collective benefit. A qualitative data-gathering process was implemented and the results of this process resulted in a list of recommendations that provide a platform for future work related to this information service. We feel that addressing these concerns will enhance the user's experience with the Wayback Machine.

back to top

Research Team

The research team for this project consisted of three graduate students - Pallavi Aravind, Vanessa Arce, and Peter Roessler - of the School of Information Management and Systems at the University of California, Berkeley. This research was carried out under the direction of Professor Peter Lyman, whose focus is on the ethnographic study of communication and social formation in digital and networked environments.

back to top

Introduction to Digital Archives

The role of archives has traditionally been for the preservation of cultural heritage through artifacts (i.e. libraries (focused on books and allied media), and museums (focused on carefully selected textual documents, graphical objects like paintings, and structures like sculpture)). A binding characteristic of these artifacts is that they "move through life cycles. They are created, edited, described and indexed, disseminated, acquired, used, annotated, revised, re-created, modified and retained for future use or destroyed by a complex, interwoven community of creators and other owners, disseminators, value-added services, and institutional and individual users." Without some form of static repository, these and other factors clearly limit what can be understood, at least from an anthropological perspective, about the cultural context of the archive in question. It is clear that society then "has a vital interest in preserving materials that document issues, concerns, ideas, discourse and events" within such contexts.

A digital archive, then, also preserves cultural and historical information- artifacts in digital formats- and similarly unites "communit[ies] of actors in their various information-based activities [and] their common purpose. Today, for example, we rely on digital archives to "track our genealogies, to understand what science has discovered, to appreciate the stories people told a hundred years ago, and to know how we educated our children during the Depression." The value-add provided by digital archives generally reflects its "individual purpose, … tailored to the necessities of different user groups." In addition to aggregating resources for a specific purpose, digital archiving also provides a way to alleviate the common problem of accessibility, the locating of relevant items, in large collections. Digital archives therefore serve the same humanistic functions as traditional archives while their technical characteristics provide a novel way for a user to access the information contained within it.

In order to corroborate these traits, we felt it was necessary to conduct a broad survey of existing digital archives. The survey primarily looked for commonalities reflecting the motivation, the reasons, for aggregating their components. We reasoned that the patterns may or may not be obvious, but that careful thought on a representative sampling should suffice for our purposes. We performed some simple searches on popular Web search engines for our sample. At first glance, every major digital archive we explored maintained, without exception, some content specificity. Representative examples were as follows:

· NASA's digital image collection3
· The Digital National Security Archive4- the most comprehensive digital collection of declassified primary documents defining U.S. government policy
· USGenWeb Project5- offers transcriptions of public domain records for genealogical research
· Swiziland Digital Archive6- focuses on the country's historical photographs
· Japanese American Relocation Digital Archives (JARDA)7 - a "thematic collection" documenting the experience of Japanese Americans in World War II internment camps
· UCSF Tobacco Control Archives8 - provides papers, unpublished documents and electronic resources relevant to tobacco control
· The Pandora Archive of the National Library of Australia9

It seemed each archive was created purposefully to support specific tasks and in many cases provide topic-focused content to their audience, "rubrics for coordinating a user's group of common activities". They had specific users in mind, who all had closely related possible usage scenarios to satisfy their needs. This has been called an 'actor-network' scenario, "linking people and things in the environment"10. There was an implicit principle of historic preservation illustrated in each example. All were deliberate collections of specialized digital artifacts created to ensure their availability.

back to top

Introduction to Internet Archive

It seems most appropriate to begin describing the motivation for the Internet Archive, the Wayback Machine service, and ultimately the subsequent research described here, by presenting the statement offered up by the service itself.

"The Internet Archive Wayback Machine is a service that allows people to visit archived versions of stored Web sites. Visitors to the Wayback Machine can type in a URL, select a date range, and then begin surfing on an archived version of the Web. Imagine surfing circa 1999 and looking at all the Y2K hype, or revisiting an older copy of your favorite Web site. The Internet Archive Wayback Machine can make all of this possible."11

As straightforward as this statement might seem, it precludes many curiosities about what the Archive actually is. The Web, we know, is the largest document ever written" and " ninety-five percent of Web pages are publicly-accessible"12
This makes this Internet Archive absolutely unique in content and scope. One might wonder how a comprehensive archive such as this is possible or, more importantly, what the motivations behind such an endeavor might be in the first place, in order to better understand its utility. We learned that this "Internet equivalent of the Library of Congress has been capturing and archiving every public Web page since 1996" and that there are certainly two clearly stated motivations. That of its use for documenting the provenance of the Internet, as "a historical record of cyberspace… [and] as part of an innovative search tool that lets users call up ``out-of-print'' Web pages." This is coupled with grander plans to then "make [the Archive] part of the infrastructure of the Internet.''13 Given this high-level context and purpose, there naturally seems to be more to the story of the Archive under the surface. A collection and service of this magnitude couldn't possibly be summed up within the text of a few sentences. These motivations described here are certainly plausible ones, even noble. Yet, one is left to wonder about what the Archive is actually making possible. One knows now that they can go back and look at an old version of a Web site, but one might wonder, as we at the School of Information Management and Systems did, if that is realistically the Wayback Machine's sole use, or if there are also unexplored, undocumented, or unrealized uses beyond what is touted.

The sheer scope of this digital collection incites many new questions. What exactly has been archived in the public domain? Is everything that was ever out there really available to view? What is the ultimate purpose for collecting all of this information? Who is using this on a regular basis? Why? What for? Before we could determine the Archive's relevance and usefulness to anyone (or at least to better define it) and focus the scope of our research pursuits, we felt it was imperative to critique the Archive in terms of what we learned about other digital collections. It seemed an ambitious and misdirected task to assess any of the described motivations without taking a look at similar digital archives.

Given that archives serve as historical artifacts within the context of a specific topic, we recognized that despite the unprecedented scale in collection size, the Internet Archive had no specific topic of focus to speak of. One could say that the Archive uniquely attempts to capture all possible topics at once. Should the focus for this unique archive be, then, simply to continue to preserve valuable social and cultural artifacts, to provide a variety of topic-specific content for academic, research or other purposes like other digital archives, or is it intended to be all-encompassing? There is also the possibility of future benefits, not yet known, that would result from its usefulness in conjunction with technological innovation or some future social context. 14

Initial Evaluation by Research Team

We therefore observed, to a large extent, examples of digital archives that collectively encompassed many varied topics and that were each content-centric. The implicit motivations behind many of them were quite similar to what has been stated about the Archive. Yet, all were also largely defined by their specialized contents, a characteristic missing from the Internet Archive. What we were seeing was a major diversion from the common threads of most other digital archives. Here is a vast collection mirroring the Internet itself across time, something wholly unique. If any user of any of the other archives would, by default, be engaged in some specific interest or need for them to use it in the first place, then what of the users of this 'Wayback Machine'? Perhaps there are some useful general trends to extract about its use that just aren't so obvious, as is otherwise the case.

Perhaps less obvious was how we might juxtapose the idea of a 'users with a common purpose' from the survey with any user of the Internet Archive. The literature makes the point that concerning archiving, "the intellectual integrity of [the artifacts of the archives] is maintained and [the] individual [artifacts] are always contextualized."10 It was obvious that this assumed contextualization was uniquely missing from the Internet Archive altogether as it encompassed n number of possible categories for its terabytes of Web pages and associated metadata. We then wondered what the most common contexts might be for current users of the Archive and the myriad ways their needs are possibly overlooked. Due to the breadth of the collection itself, we thought some sort of user study would have to be carried out to define any 'users with a common purpose'. Work needed to be done to collect the missing user information that is normally obvious with respect to other digital collections.

We wanted to map out a process for identifying these user communities in order to look at the tool and whether or not it supports those user communities and their usage patterns. Gilliland-Swetland commented on this approach, stating that "it is important to understand the societal roles of archives because it is in the fulfillment of these roles that archivists provide the necessary skills and knowledge to contribute to the [current] paradigm."15 Might we identify and define ways, beyond what is described here at the surface, in which the Archive could be better utilized? Are there demands for potential use that are not currently satisfied? Would we be improving the current and future usefulness of such a unique information service if we address its pitfalls and potential alike with a user-centered approach?

We, therefore, made some initial decisions about the necessity of qualitative data collection. This was grounded in our recognition not only of the inherent qualities of other existing digital archives but also of our resulting suspicion that there must then exist some common threads worth documenting among the total population of users. It seemed logical to carry out some foundation research here for future related projects and for larger-scale user population sampling and data collection in order to support our findings.

back to top
 

 


   

***

Pallavi Aravind, Vanessa Arce, Peter Roessler
Copyright © 2002 Last Modified: May 31, 2002