2006 DISSERTATION: Scaling the Technology Opportunity Analysis Text Data Mining Methodology: Data Extraction, Cleaning, Online Analytical Processing Analysis, and Reporting of Large Multi-source Datasets

Because the existing applications of Technology Opportunity Analysis (TOA) text data mining framework developed by Alan Porter and other researchers used small datasets, previous research never pushed the limits of the methodology and failed to identify areas for future research associated with using larger datasets. This research developed extensions to the TOA framework to improve its performance and scalability and proved that the Technology Opportunity Analysis text data mining framework could be successfully scaled to analyze large datasets. The work included the development of a comprehensive set of new or significantly improved data extraction filters and data cleaning thesauruses, a data model and architecture based on relational database and online analytical processing technologies that provides an open platform provides easy, standards-compliant access to browsing, reporting, and data mining software that support either SQL or MDX queries, and a report distribution framework that does not require the end-users of the output of Technology Opportunity Analysis to use any specialized or prohibitively expensive client applications beyond the standard Microsoft Office applications and Adobe Acrobat Reader. In addition, it demonstrated that the time necessary to complete the data acquisition, cleaning, and transformation tasks can be reduced by at least 75% by creating libraries of import filters for commonly used data sources, eliminating unnecessary steps, using 64-bit native databases and extraction filters, improving the data model and architecture, and using significantly better data cleaning thesauruses. This work is significant because it enables a variety of research paths applying alternative statistical or data mining algorithms that previously would have been impossible to undertake. Thesauruses and fuzzy logic routines to clean and group the data are presented and their accuracy is tested on gene expression, energy storage, photovoltaics, smart materials, bioinformatics, quantum computing, wind turbine, nanotube, global warming, and data fusion data sets and benchmarked against existing thesauruses and fuzzy logic routines. A database on photovoltaic solar cell research that integrates data from 116,240 records from thirteen bibliographic, patent, and funding abstract databases was used to illustrate the concepts developed and tested in this dissertation.

Doctoral candidate: Richard Peyton George
University: Capella University
Degree program: Doctor of Philosophy – Due Diligence / Data Mining
Year: 2006


Leave a Reply

Your email address will not be published. Required fields are marked *