Extended Abstract – NEW S,T&I VISUALIZATIONS session at “1st Global TechMining Conference” 2011
Author(s):Scott W. Cunningham and J. H. Kwakkel (Delft University of Technology)
This paper provides a range of alternatives for analysts when dealing with geo-spatial data addressing activities in science, technology and innovation (STI). We address the theoretical role of regions and districts in innovation policy. We then discuss an evolving body of analytical methods for addressing theory and delivering useful policy advice. These analytical methods may be implemented using a range of tools. We discuss open source scripting languages and libraries in this paper. First however, we begin by assessing geographic information in available science and technology databases.
There is an increasing volume of science and technology data; most recently this data is being accurately structured with geographic information. With the rise of the internet, new data sources have emerged, such as Yahoo finance and Google trends. Yahoo finance provides data on stock prices etc. and can easily be queried through an application programming interface (API). Google trends provides the possibility to get insight into a wide variety of phenomena and how they are expressed in search queries in Google. Other, upcoming social media, such as twitter, linked in, etc. provide similar data, paying increasing attention to the geographical character of the data. For example, twitter can be monitored to reveal trending topics in particular parts of the world. With respect to well-known science and technology databases, such as Thomson Reuter’s web of science or Scopus, these are also increasing the data fields that are being offered.
For instance, in the last two years, Thomson Reuters now accurately tags each authors with a complete institutional address, affording analysts new opportunities to track knowledge flows and collaboration. Previously the data was incompletely attributed, requiring analysts to impute the correct address for a given author, or to deal with incomplete information. Still, the Thomson Reuters data must be supplemented with a range of geographic information including territorial units and coordinates. This new source of STI data presents challenges both new and old for the analyst wishing to develop policy-relevant advice.
With the rapid rise of new data sources, comes the challenge of assessing the quality of the data. For example, Scholar Google, is a good search engine for finding publications, however, the returns of a query also contains substantial noise. Publications are accounted for more than once, if the name of an author is common across the world, a lot of irrelevant results are returned, and Google Scholar offers limited capability for reducing this noise as compared to established databases such as web of science or Scopus. Similarly, relying on internet related data sources also has the potential to introduce unwanted bias in the data. For example, in monitoring twitter feeds, one is looking at the opinions of one particular subset of the general population. It is not evident that one can generalize from this subset to the whole population. Another challenge of the new data sources is that in order to extract useful information from it, sources have to be combined. For instance, by combing linked in profiles with twitter feeds, a significantly more accurate geographic picture on trending topics in the world can emerge. The increasing volume of available data also brings to the front again the problem of information overload. There now is such a variety of data sources available, that questions such as which source(s) should be used, how to deal with conflicting sources, and how to extract policy-relevant advice from the data require careful consideration on part of the analyst.
Theoretically, the role of geography and innovation districts have long been important in innovation policy. Firms co-locate and mutually adapt to one another, gaining the benefits of comparative specialization and pools of skilled labor. On the other hand, also firms exist in vast networks designed to tap into the global marketplace. There has therefore been a theoretical tension in finding between the appropriate balance between global forces and local assets. Contact networks are a broadly accepted resource for firms, and an important instrument for innovation policy.
Furthermore, there is an evolving body of methods for analyzing social and economic networks. Or, to be more precise, in various computer science fields methods and techniques for analyzing network data are being developed, that could be adapted to analyzing social and economic networks. So, Cunningham (2010) used random hierarchical graphs to anticipate technological innovation. Similarly, Kwakkel and Cunningham (2009), used mixtures of factor analyzers, a machine learning technique used for speech and face recognition, to identify semantically distinct discourses in the scientific literature. Xu et al (2011) show that the PageRank, the algorithm used by Google for ordering web pages – essentially it calculates the eigenvalue centrality, provides a potentially more insightful ranking of journal importance then the more traditional citation metrics. Jorge-Botana et al (2010) use the PathFinder algorithm in combination with Latent Semantic Indexing to visualize semantic networks. There is a wide variety of other techniques available that might similarly be adapted to the use of analyzing social and economic networks, including block-modeling which is a form of clustering of nodes in a network, Gaussian mixture models, and signal processing techniques such as Kalman filters for handling the temporal character of network formation over time.
In practice however, decision-makers have been offered very limited capabilities for assessing the geographic impacts of STI policies. We have the capability with data to evaluate extended networks of contact and collaboration. We also have the analytical tools to evaluate the role of distance and industrial districts on the formation and maintenance of networks. Added insight is needed regarding the formulation and development of the best networks to meet specific regional or innovation goals. New visualization capabilities for spatial and network data are key to meeting this challenge. The power of good visualizations in providing insight to decision-makers is well known; visualizations complete a full cycle of decision-making involving analysis, design, action and further monitoring.
Many well -established displays, including tables, graphs, and histograms, are established techniques because they can powerfully communicate a wealth of information. For spatial and network data, such established visualizations have not emerged yet. There is a wide variety of network visualizations available, such as ‘the map of science’ (http://mapofscience.com/index.html), various maps of the internet, and many maps provided by social media such as linked in. These maps are first and foremost esthetically pleasing, but it is less clear how they can be used to inform decision-making. Moreover, many of the visualizations are data driven. That is, how can the data be displayed best. There is limited attention to using theoretical ideas about networks, network dynamics, and network formation such as those that can be found in geographical economics, or innovation policy, in designing displays. Using these theories could provide healthy cross fertilization by connecting real world data with existing theories, offering help in assessing the adequacy of the theories, and helping in providing displays that can be used to support decision-making.
This paper systematically outlines the current best practices and alternatives for visualizing geographic data. Data visualization seems to be a missing piece of any software architecture for analysis and policy support. Part of the problem may be the often very customized requirements for visualization which exist across domains. This results in few ready-made solutions from the private sector, in part perhaps because the visualization market is very fragmented. Geographic data is broadly handled within geographical information systems (GIS), although the use of such systems has tended to focus on land-use and planning, rather than corporate networks and innovation policy. As a result we look to open source solutions, and loose frameworks from which a complete solution for analysis and visualization can be developed.
There is a wide variety of open source tools available for the analysis and visualization of networks. For Java, there is the well-known Java Universal Network/Graph (JUNG) framework, InfoViz, and GUESSS, among others. For R, there is the Social Network Analysis package. For python, the combination of NetworkX and matplotlib offers a powerful network analysis and visualization solution, which can be further extended through Orange a data mining suite with python bindings. However, these solutions are mainly focused on network lay outing and analysis, ignoring the geographical aspects. We have experimented with NetworkX, and the matlotlib.basemap toolkit. The basemap toolkit extents matplotlib by offering functionality for plotting data on 2d maps. It offers 23 different projections and comes with the GEOS library for data pertaining to country borders, shorelines, and rivers. Moreover it can translate GPS coordinates to x-y coordinates in the figure, depending on the used projection. Another option is to use Google maps and transform the networks into Keyhole Markup Language (KML) files. KML is a file format for displaying geographic data in Google earth and Google maps. KML is an XML grammar and file format and can be used to store points, lines, images etc.. KML files can be generated directly from Python, by using pyKML, or through the Geospatial Data Abstraction Library (GDAL/OGR) and its python binding. This last option offers the possibility to transform geospatial network data not only to the KML format, but it can also be used to create files that can be used by a wide variety of vector and raster GIS. Three examples, drawn from European nanotechnology, and world science, are provided for illustrative purposes.
The examples differ according to their range of focus. The first example, world science, draws upon the country as the unit of analysis. The second example, European nanotechnology districts, looks less to networks and more towards regional density and knowledge agglomeration. The third and final example involves European collaboration networks, where organizations are the key nodes in the geographical network.