Journal of Data and Information Science ›› 2021, Vol. 6 ›› Issue (3): 99-122.doi: 10.2478/jdis-2021-0024

• Research Papers • Previous Articles     Next Articles

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Sahand Vahidnia(), Alireza Abbasi, Hussein A. Abbass   

  1. University of New South Wales, Canberra, 2612, ACT, Australia
  • Received:2020-11-30 Revised:2021-04-26 Accepted:2021-04-26 Online:2021-08-20 Published:2021-06-09
  • Contact: Sahand Vahidnia;


Purpose: Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem.

Design/methodology/approach: To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications.

Findings: Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics.

Research limitations: We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers' opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited.

Practical implications: As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics.

Originality/value: In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.

Key words: Dynamics of science, Science mapping, Document clustering, Artificial intelligence, Deep learning