Websites offer an unobtrusive data source for developing and analyzing information about various types of social science phenomena. In this paper, we provide a methodological resource for social scientists looking to expand their toolkit using unstructured web-based text, and in particular, with the Wayback Machine, to access historical website data. After providing a literature review of existing research that uses the Wayback Machine, we put forward a step-by-step description of how the analyst can design a research project using archived websites. We draw on the example of a project that analyzes indicators of innovation activities and strategies in 300 U.S. small- and medium-sized enterprises in green goods industries. We present six steps to access historical Wayback website data: (a) sampling, (b) organizing and defining the boundaries of the web crawl, (c) crawling, (d) website variable operationalization, (e) integration with other data sources, and (f) analysis. Although our examples draw on specific types of firms in green goods industries, the method can be generalized to other areas of research. In discussing the limitations and benefits of using the Wayback Machine, we note that both machine and human effort are essential to developing a high-quality data set from archived web information.
Author(s): Sanjay K. Arora, Yin Li, Jan Youtie, and Philip Shapira
Organization(s): Georgia Institute of Technology and University of Manchester
Source: Journal of the Association for Information Science and Technology
As enterprises expand and post increasing information about their business activities on their websites, website data promises to be a valuable source for investigating innovation. This article examines the practicalities and effectiveness of web mining as a research method for innovation studies. We use web mining to explore the R&D activities of 296 UK-based green goods small and mid-size enterprises. We find that website data offers additional insights when compared with other traditional unobtrusive research methods, such as patent and publication analysis. We examine the strengths and limitations of enterprise innovation web mining in terms of a wide range of data quality dimensions, including accuracy, completeness, currency, quantity, flexibility and accessibility. We observe that far more companies in our sample report undertaking R&D activities on their web sites than would be suggested by looking only at conventional data sources. While traditional methods offer information about the early phases of R&D and invention through publications and patents, web mining offers insights that are more downstream in the innovation process. Handling website data is not as easy as alternative data sources, and care needs to be taken in executing search strategies. Website information is also self-reported and companies may vary in their motivations for posting (or not posting) information about their activities on websites. Nonetheless, we find that web mining is a significant and useful complement to current methods, as well as offering novel insights not easily obtained from other unobtrusive sources.
Open Access doi:10.1007/s11192-014-1434-0
Author(s): Abdullah Gök, Alec Waterworth, Philip Shapira
Organization(s): MIoIR-University of Manchester