The Access and Benefi-Sharing Patent Index: Large Scale Text Mining for Biodiversity using High End Computing

Extended Abstract – NEW INDICATORS session at “1st Global TechMining Conference” 2011

Author(s): P. Oldham and S. Hall (CESAGEN, Faculty of Arts and Social Science, Lancaster University)

We have developed an index of biological species names appearing in the USPTO and PCT patent collections of 9 million patent documents. The Access and Benefit-Sharing Patent Index (ABSPAT) is intended to assist countries with monitoring trends in research and development involving genetic resources under the Nagoya
Protocol of the United Nations Convention on Biological Diversity. The ABSPAT index will also provide important insights into the role of biodiversity in developments in science and technology.

The ABSPAT index was developed using a multi-phased approach. The generation phase involved creating a set of search patterns from the 1.9 million Latin species names derived from the Species 2000 & ITIS Catalogue of Life Annual Check-list 2011. These patterns were captured in regular expressions (RE) created using atomic grouping to minimise the number of required character matches with the RE engine. The regular expressions were delineated using a expression grouping technique based on the Levenshtein Distance to promote adjacent term clustering. Finally, we optimized the search patterns by adopting a 2-phase matcher to cluster expressions on both the genera and species terms.

The indexing phase required the use of the Lancaster University High End Computing facility (HEC) to enable the timely scanning of 9 million documents. A map-reduce technique was used to create a map of document files so that the indexing task could be divided amongst the 400 parallel cores. Each core produced an individual index that was, in turn, reduced to a master index. A master index consisted of a mapping of species name, patent identifier, occurrence segment indicator (i.e. title, abstract, description, claims) and an optional match-context string.

A cleaning phase was required for two reasons. Firstly, the Catalogue of Life source data was insufficient to capture all known Latin species names. To expand the scope of species capture we performed a second indexing phase to capture occurrences based on the genus name only (i.e. Escherichia rather than Escherichia coli). Secondly species are often referred to by abbreviations, for example Escherichia coli is most commonly known as E. coli. During generation we included abbreviations in the search patterns, hence the master indexes also included abbreviations that needed be resolved into actual species. Both the genera based index and abbreviations within the species index contained unwanted noise such as common English words. An anti-thesaurus was applied to both indexes using a grep-like tool based on the same algorithms as the indexer. Non-abbreviations were checked against the online Global Names Index. The majority of the abbreviations were resolved by looking for a related full-species binomials in the same patent document, a list of top known terms or finally in the Catalogue of Life check list. Some abbreviations were resolved manually and a small proportion have yet to be resolved.

The federation phase consisted of loading the master indexes into a relational database to be cross referenced with the EPO World Patent Statistical Database (PATSTAT). This allowed us to analyse the relationships between species, publications, patent families, assignees, citations and technology areas. Using web-services provided by the Global Biodiversity Information Facility, itself a federation of biological databases, we were able to imbue the ABSPAT index with taxonomic and geographic attributes such as biological kingdoms, distribution and actual species occurrences.

To ensure the validity of our work we regularly checked the results. To verify the correctness of the indexing phase we performed sample checks of referenced publications using the Thompson Innovation online patent database. Using Vantage Point we manually reviewed the species list obtained from the Catalogue of Life, the non-resolved abbreviations from the master index and also verified that GBIF distribution data obtained through web services was consistent with actual GBIF data portal content.

Our primary results are a set of interactive workbooks published in Tableau that allow the exploration and analysis of biodiversity in the patent system. This data is directly linked to a report version of the ABSPAT index allowing the information to be filtered. Additional outputs from our work include the production of network graphs, manipulated with Gephi tool to provide further analysis of the results.

Our research is directed to public policy analysis for the global issue of access to genetic resources and benefit-sharing. As large scale datasets and high end computing become increasingly accessible to researchers we also hope to contribute to best practice in text mining and visualization of patent data.