Journal of Data and Information Science ›› 2022, Vol. 7 ›› Issue (2): 6-30.doi: 10.2478/jdis-2022-0008

• Research Papers • Previous Articles     Next Articles

Extracting and Measuring Uncertain Biomedical Knowledge from Scientific Statements

Xin Guo1,2,3, Yuming Chen5, Jian Du3,7,(), Erdan Dong1,2,3,4,8   

  1. 1Department of Cardiology and Institute of Vascular Medicine, Peking University Third Hospital, Beijing, China
    2NHC Key Laboratory of Cardiovascular Molecular Biology and Regulatory Peptides, Beijing, China
    3Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Beijing, China
    4Beijing Key Laboratory of Cardiovascular Receptors Research, Beijing, China
    5Department of Epidemiology and Biostatistics, School of Public Health, Peking University, Beijing, China
    6Medical Informatics Center, Peking University, Beijing, China
    7National Institute of Health Data Science, Peking University, Beijing, China
    8Institute of Cardiovascular Sciences, Peking University, Beijing, China
  • Received:2021-10-25 Revised:2022-02-28 Accepted:2022-03-05 Online:2022-05-20 Published:2022-04-19
  • Contact: Jian Du


Purpose: Given the information overload of scientific literature, there is an increasing need for computable biomedical knowledge buried in free text. This study aimed to develop a novel approach to extracting and measuring uncertain biomedical knowledge from scientific statements.

Design/methodology/approach: Taking cardiovascular research publications in China as a sample, we extracted subject-predicate-object triples (SPO triples) as knowledge units and unknown/hedging/conflicting uncertainties as the knowledge context. We introduced information entropy (IE) as potential metric to quantify the uncertainty of epistemic status of scientific knowledge represented at subject-object pairs (SO pairs) levels.

Findings: The results indicated an extraordinary growth of cardiovascular publications in China while only a modest growth of the novel SPO triples. After evaluating the uncertainty of biomedical knowledge with IE, we identified the Top 10 SO pairs with highest IE, which implied the epistemic status pluralism. Visual presentation of the SO pairs overlaid with uncertainty provided a comprehensive overview of clusters of biomedical knowledge and contending topics in cardiovascular research.

Research limitations: The current methods didn't distinguish the specificity and probabilities of uncertainty cue words. The number of sentences surrounding a given triple may also influence the value of IE.

Practical implications: Our approach identified major uncertain knowledge areas such as diagnostic biomarkers, genetic polymorphism and co-existing risk factors related to cardiovascular diseases in China. These areas are suggested to be prioritized; new hypotheses need to be verified, while disputes, conflicts, and contradictions need to be settled.

Originality/value: We provided a novel approach by combining natural language processing and computational linguistics with informetric methods to extract and measure uncertain knowledge from scientific statements.

Key words: Uncertain knowledge, Information entropy, Natural language processing, Cardiovascular diseases, China