Topic modeling is a type of statistical model for discovering the latent “topics” that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.
Author(s): Chyi-Kwei Yau, Alan Porter, Nils Newman, and Arho Suominen
Organization(s): Georgia Institute of Technology, Search Technology, Inc., VTT Technical Research Centre of Finland