Research Paper published in JDIS
    Published in last 1 year |  In last 2 years |  In last 3 years |  All
Please wait a minute...
For Selected: Toggle Thumbnails
Progress and Knowledge Transfer from Science to Technology in the Research Frontier of CRISPR Based on the LDA Model
Yushuang Lyu, Muqi Yin, Fangjie Xi, Xiaojun Hu
Journal of Data and Information Science    2022, 7 (1): 1-19.   doi:10.2478/jdis-2022-0004
Abstract60)   HTML332)    PDF (4807KB)(82)      

Purpose: This study explores the underlying research topics regarding CRISPR based on the LDA model and figures out trends in knowledge transfer from science to technology in this area over the latest 10 years.
Design/methodology/approach: We collected publications on CRISPR between 2011 and 2020 from the Web of Science, and traced all the patents citing them from 15,904 articles and 18,985 patents in total are downloaded and analyzed. The LDA model was applied to identify underlying research topics in related research. In addition, some indicators were introduced to measure the knowledge transfer from research topics of scientific publications to IPC-4 classes of patents.
Findings: The emerging research topics on CRISPR were identified and their evolution over time displayed. Furthermore, a big picture of knowledge transition from research topics to technological classes of patents was presented. We found that for all topics on CRISPR, the average first transition year, the ratio of articles cited by patents, the NPR transition rate are respectively 1.08, 15.57%, and 1.19, extremely shorter and more intensive than those of general fields. Moreover, the transition patterns are different among research topics.
Research limitations: Our research is limited to publications retrieved from the Web of Science and their citing patents indexed in A limitation inherent with LDA analysis is in the manual interpretation and labeling of “topics”.
Practical implications: Our study provides good references for policy-makers on allocating scientific resources and regulating financial budgets to face challenges related to the transformative technology of CRISPR.
Originality/value: The LDA model here is applied to topic identification in the area of transformative researches for the first time, as exemplified on CRISPR. Additionally, the dataset of all citing patents in this area helps to provide a full picture to detect the knowledge transition between S&T.

Table and Figures | Reference | Related Articles | Metrics
The Three-Step Workflow: A Pragmatic Approach to Allocating Academic Hospitals’ Affiliations for Bibliometric Purposes
Andrea Reyes Elizondo, Clara Calero-Medina, Martijn S. Visser
Journal of Data and Information Science    2022, 7 (1): 20-36.   doi:10.2478/jdis-2022-0006
Abstract78)   HTML321)    PDF (648KB)(60)      

Purpose: A key question when ranking universities is whether or not to allocate the publication output of affiliated hospitals to universities. This paper presents a method for classifying the varying degrees of interdependency between academic hospitals and universities in the context of the Leiden Ranking.
Design/methodology/approach: Hospital nomenclatures vary worldwide to denote some form of collaboration with a university, however they do not correspond to universally standard definitions. Thus, rather than seeking a normative definition of academic hospitals, we propose a three-step workflow that aligns the university-hospital relationship with one of three general models: full integration of the hospital and the medical faculty into a single organization; health science centres in which hospitals and medical faculty remain separate entities albeit within the same governance structure; and structures in which universities and hospitals are separate entities which collaborate with one another. This classification system provides a standard through which publications which mention affiliations with academic hospitals can be better allocated.
Findings: In the paper we illustrate how the three-step workflow effectively translates the three above-mentioned models into two types of instrumental relationships for the assignation of publications: “associate” and “component”. When a hospital and a medical faculty are fully integrated or when a hospital is part of a health science centre, the relationship is classified as component. When a hospital follows the model of collaboration and support, the relationship is classified as associate. The compilation of data following these standards allows for a more uniform comparison between worldwide educational and research systems.
Research limitations: The workflow is resource intensive, depends heavily on the information provided by universities and hospitals, and is more challenging for languages that use non-Latin characters. Further, the application of the workflow demands a careful evaluation of different types of input which can result in ambiguity and makes it difficult to automatize.
Practical implications: Determining the type of affiliation an academic hospital has with a university can have a substantial impact on the publication counts for universities. This workflow can also aid in analysing collaborations among the two types of organizations.
Originality/value: The three-step workflow is a unique way to establish the type of relationship an academic hospital has with a university accounting for national and regional differences on nomenclature.

Table and Figures | Reference | Related Articles | Metrics
Academic Collaborator Recommendation Based on Attributed Network Embedding
Ouxia Du, Ya Li
Journal of Data and Information Science    2022, 7 (1): 37-56.   doi:10.2478/jdis-2022-0005
Abstract44)   HTML320)    PDF (3315KB)(31)      

Purpose: Based on real-world academic data, this study aims to use network embedding technology to mining academic relationships, and investigate the effectiveness of the proposed embedding model on academic collaborator recommendation tasks.
Design/methodology/approach: We propose an academic collaborator recommendation model based on attributed network embedding (ACR-ANE), which can get enhanced scholar embedding and take full advantage of the topological structure of the network and multi-type scholar attributes. The non-local neighbors for scholars are defined to capture strong relationships among scholars. A deep auto-encoder is adopted to encode the academic collaboration network structure and scholar attributes into a low-dimensional representation space.
Findings: 1. The proposed non-local neighbors can better describe the relationships among scholars in the real world than the first-order neighbors. 2. It is important to consider the structure of the academic collaboration network and scholar attributes when recommending collaborators for scholars simultaneously.
Research limitations: The designed method works for static networks, without taking account of the network dynamics.Practical implications: The designed model is embedded in academic collaboration network structure and scholarly attributes, which can be used to help scholars recommend potential collaborators.
Originality/value: Experiments on two real-world scholarly datasets, Aminer and APS, show that our proposed method performs better than other baselines.

Table and Figures | Reference | Related Articles | Metrics
Parameterless Pruning Algorithms for Similarity-Weight Network and Its Application in Extracting the Backbone of Global Value Chain
Lizhi Xing, Yu Han
Journal of Data and Information Science    2022, 7 (1): 57-75.   doi:10.2478/jdis-2022-0002
Accepted: 06 December 2021

Abstract45)   HTML2806)    PDF (5904KB)(41)      

Purpose: With the availability and utilization of Inter-Country Input-Output (ICIO) tables, it is possible to construct quantitative indices to assess its impact on the Global Value Chain (GVC). For the sake of visualization, ICIO networks with tremendous low- weight edges are too dense to show the substantial structure. These redundant edges, inevitably make the network data full of noise and eventually exert negative effects on Social Network Analysis (SNA). In this case, we need a method to filter such edges and obtain a sparser network with only the meaningful connections.
Design/methodology/approach: In this paper, we propose two parameterless pruning algorithms from the global and local perspectives respectively, then the performance of them is examined using the ICIO table from different databases.
Findings: The Searching Paths (SP) method extracts the strongest association paths from the global perspective, while Filtering Edges (FE) method captures the key links according to the local weight ratio. The results show that the FE method can basically include the SP method and become the best solution for the ICIO networks.
Research limitations: There are still two limitations in this research. One is that the computational complexity may increase rapidly while processing the large-scale networks, so the proposed method should be further improved. The other is that much more empirical networks should be introduced to testify the scientificity and practicability of our methodology.
Practical implications: The network pruning methods we proposed will promote the analysis of the ICIO network, in terms of community detection, link prediction, and spatial econometrics, etc. Also, they can be applied to many other complex networks with similar characteristics.
Originality/value: This paper improves the existing research from two aspects, namely, considering the heterogeneity of weights and avoiding the interference of parameters. Therefore, it provides a new idea for the research of network backbone extraction.

Table and Figures | Reference | Related Articles | Metrics
The Roles of Female Involvement and Risk Aversion in Open Access Publishing Patterns in Vietnamese Social Sciences and Humanities
Minh-Hoang Nguyen, Huyen Thanh Thanh Nguyen, Manh-Toan Ho, Tam-Tri Le, Quan-Hoang Vuong
Journal of Data and Information Science    2022, 7 (1): 76-96.   doi:10.2478/jdis-2022-0001
Accepted: 06 December 2021

Abstract52)   HTML12)    PDF (2183KB)(31)      

Purpose: The open-access (OA) publishing model can help improve researchers’ outreach, thanks to its accessibility and visibility to the public. Therefore, the presentation of female researchers can benefit from the OA publishing model. Despite that, little is known about how gender affects OA practices. Thus, the current study explores the effects of female involvement and risk aversion on OA publishing patterns among Vietnamese social sciences and humanities.
Design/methodology/approach: The study employed Bayesian Mindsponge Framework (BMF) on a dataset of 3,122 Vietnamese social sciences and humanities (SS&H) publications during 2008-2019. The Mindsponge mechanism was specifically used to construct theoretical models, while Bayesian inference was utilized for fitting models.
Findings: The result showed a positive association between female participation and OA publishing probability. However, the positive effect of female involvement on OA publishing probability was negated by the high ratio of female researchers in a publication. OA status was negatively associated with the JIF of the journal in which the publication was published, but the relationship was moderated by the involvement of a female researcher(s). The findings suggested that Vietnamese female researchers might be more likely to publish under the OA model in journals with high JIF for avoiding the risk of public criticism.
Research limitations: The study could only provide evidence on the association between female involvement and OA publishing probability. However, whether to publish under OA terms is often determined by the first or corresponding authors, but not necessarily gender-based.
Practical implications: Systematically coordinated actions are suggested to better support women and promote the OA movement in Vietnam.
Originality/value: The findings show the OA publishing patterns of female researchers in Vietnamese SS&H.

Table and Figures | Reference | Related Articles | Metrics
Public Reaction to Scientific Research via Twitter Sentiment Prediction
Murtuza Shahzad, Hamed Alhoori
Journal of Data and Information Science    2022, 7 (1): 97-124.   doi:10.2478/jdis-2022-0003
Accepted: 06 December 2021

Abstract74)   HTML16)    PDF (4132KB)(69)      

Purpose: Social media users share their ideas, thoughts, and emotions with other users. However, it is not clear how online users would respond to new research outcomes. This study aims to predict the nature of the emotions expressed by Twitter users toward scientific publications. Additionally, we investigate what features of the research articles help in such prediction. Identifying the sentiments of research articles on social media will help scientists gauge a new societal impact of their research articles.
Design/methodology/approach: Several tools are used for sentiment analysis, so we applied five sentiment analysis tools to check which are suitable for capturing a tweet’s sentiment value and decided to use NLTK VADER and TextBlob. We segregated the sentiment value into negative, positive, and neutral. We measure the mean and median of tweets’ sentiment value for research articles with more than one tweet. We next built machine learning models to predict the sentiments of tweets related to scientific publications and investigated the essential features that controlled the prediction models.
Findings: We found that the most important feature in all the models was the sentiment of the research article title followed by the author count. We observed that the tree-based models performed better than other classification models, with Random Forest achieving 89% accuracy for binary classification and 73% accuracy for three-label classification.
Research limitations: In this research, we used state-of-the-art sentiment analysis libraries. However, these libraries might vary at times in their sentiment prediction behavior. Tweet sentiment may be influenced by a multitude of circumstances and is not always immediately tied to the paper’s details. In the future, we intend to broaden the scope of our research by employing word2vec models.
Practical implications: Many studies have focused on understanding the impact of science on scientists or how science communicators can improve their outcomes. Research in this area has relied on fewer and more limited measures, such as citations and user studies with small datasets. There is currently a critical need to find novel methods to quantify and evaluate the broader impact of research. This study will help scientists better comprehend the emotional impact of their work. Additionally, the value of understanding the public’s interest and reactions helps science communicators identify effective ways to engage with the public and build positive connections between scientific communities and the public.
Originality/value: This study will extend work on public engagement with science, sociology of science, and computational social science. It will enable researchers to identify areas in which there is a gap between public and expert understanding and provide strategies by which this gap can be bridged.

Table and Figures | Reference | Related Articles | Metrics
A Discrimination Index Based on Jain’s Fairness Index to Differentiate Researchers with Identical H-index Values
Adian Fatchur Rochim, Abdul Muis, Riri Fitri Sari
Journal of Data and Information Science    2020, 5 (4): 5-18.   doi:10.2478/jdis-2020-0026
Abstract100)   HTML13)    PDF (2706KB)(98)      

Purpose: This paper proposes a discrimination index method based on the Jain’s fairness index to distinguish researchers with the same H-index.

Design/methodology/approach: A validity test is used to measure the correlation of D-offset with the parameters, i.e. H-index, the number of cited papers, the total number of citations, the number of indexed papers, and the number of uncited papers. The correlation test is based on the Saphiro-Wilk method and Pearson’s product-moment correlation.

Findings: The result from the discrimination index calculation is a two-digit decimal value called the discrimination-offset (D-offset), with a range of D-offset from 0.00 to 0.99. The result of the correlation value between the D-offset and the number of uncited papers is 0.35, D-offset with the number of indexed papers is 0.24, and the number of cited papers is 0.27. The test provides the result that it is very unlikely that there exists no relationship between the parameters.

Practical implications: For this reason, D-offset is proposed as an additional parameter for H-index to differentiate researchers with the same H-index. The H-index for researchers can be written with the format of “H-index: D-offset”.

Originality/value: D-offset is worthy to be considered as a complement value to add the H-index value. If the D-offset is added in the H-index value, the H-index will have more discrimination power to differentiate the rank of the researchers who have the same H-index.

Table and Figures | Reference | Related Articles | Metrics
A Micro Perspective of Research Dynamics Through “Citations of Citations” Topic Analysis
Xiaoli Chen, Tao Han
Journal of Data and Information Science    2020, 5 (4): 19-34.   doi:10.2478/jdis-2020-0034
Abstract84)   HTML8)    PDF (3324KB)(98)      

Purpose: Research dynamics have long been a research interest. It is a macro perspective tool for discovering temporal research trends of a certain discipline or subject. A micro perspective of research dynamics, however, concerning a single researcher or a highly cited paper in terms of their citations and “citations of citations” (forward chaining) remains unexplored.

Design/methodology/approach: In this paper, we use a cross-collection topic model to reveal the research dynamics of topic disappearance topic inheritance, and topic innovation in each generation of forward chaining.

Findings: For highly cited work, scientific influence exists in indirect citations. Topic modeling can reveal how long this influence exists in forward chaining, as well as its influence.

Research limitations: This paper measures scientific influence and indirect scientific influence only if the relevant words or phrases are borrowed or used in direct or indirect citations. Paraphrasing or semantically similar concept may be neglected in this research.

Practical implications: This paper demonstrates that a scientific influence exists in indirect citations through its analysis of forward chaining. This can serve as an inspiration on how to adequately evaluate research influence.

Originality: The main contributions of this paper are the following three aspects. First, besides research dynamics of topic inheritance and topic innovation, we model topic disappearance by using a cross-collection topic model. Second, we explore the length and character of the research impact through “citations of citations” content analysis. Finally, we analyze the research dynamics of artificial intelligence researcher Geoffrey Hinton’s publications and the topic dynamics of forward chaining.

Table and Figures | Reference | Related Articles | Metrics
Can Crossref Citations Replace Web of Science for Research Evaluation? The Share of Open Citations
Tomáš Chudlarský, Jan Dvořák
Journal of Data and Information Science    2020, 5 (4): 35-42.   doi:10.2478/jdis-2020-0037
Abstract104)   HTML3)    PDF (1563KB)(71)      

Purpose: We study the proportion of Web of Science (WoS) citation links that are represented in the Crossref Open Citation Index (COCI), with the possible aim of using COCI in research evaluation instead of the WoS, if the level of coverage was sufficient.

Design/methodology/approach: We calculate the proportion on citation links where both publications have a WoS accession number and a DOI simultaneously, and where the cited publications have had at least one author from our institution, the Czech Technical University in Prague. We attempt to look up each such citation link in COCI.

Findings: We find that 53.7% of WoS citation links are present in the COCI. The proportion varies largely by discipline. The total figures differ significantly from 40% in the large-scale study by Van Eck, Waltman, Larivière, and Sugimoto (blog 2018,

Research limitations: The sample does not cover all science areas uniformly; it is heavily focused on Engineering and Technology, and only some disciplines of Natural Sciences are present. However, this reflects the real scientific orientation and publication profile of our institution.

Practical implications: The current level of coverage is not sufficient for the WoS to be replaced by COCI for research evaluation.

Originality/value: The present study illustrates a COCI vs WoS comparison on the scale of a larger technical university in Central Europe.

Table and Figures | Reference | Related Articles | Metrics
Exploring the Potentialities of Automatic Extraction of University Webometric Information
Gianpiero Bianchi, Renato Bruni, Cinzia Daraio, Antonio Laureti Palma, Giulio Perani, Francesco Scalfati
Journal of Data and Information Science    2020, 5 (4): 43-55.   doi:10.2478/jdis-2020-0040
Abstract75)   HTML4)    PDF (661KB)(130)      

Purpose: The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.

Design/methodology/approach: Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, or from a leading provider of Web analytics (SimilarWeb, The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (, a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.

Findings: The main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.

Research limitations: The results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.

Practical implications: The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.

Originality/value: This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).

Table and Figures | Reference | Related Articles | Metrics
The Association between Researchers’ Conceptions of Research and Their Strategic Research Agendas
João M. Santos, Hugo Horta
Journal of Data and Information Science    2020, 5 (4): 56-74.   doi:10.2478/jdis-2020-0032
Abstract39)   HTML2)    PDF (315KB)(64)      

Purpose: In studies of the research process, the association between how researchers conceptualize research and their strategic research agendas has been largely overlooked. This study aims to address this gap.

Design/methodology/approach: This study analyzes this relationship using a dataset of more than 8,500 researchers across all scientific fields and the globe. It studies the associations between the dimensions of two inventories: the Conceptions of Research Inventory (CoRI) and the Multi-Dimensional Research Agenda Inventory—Revised (MDRAI-R).

Findings: The findings show a relatively strong association between researchers’ conceptions of research and their research agendas. While all conceptions of research are positively related to scientific ambition, the findings are mixed regarding how the dimensions of the two inventories relate to one another, which is significant for those seeking to understand the knowledge production process better.

Research limitations: The study relies on self-reported data, which always carries a risk of response bias.

Practical implications: The findings provide a greater understanding of the inner workings of knowledge processes and indicate that the two inventories, whether used individually or in combination, may provide complementary analytical perspectives to research performance indicators. They may thus offer important insights for managers of research environments regarding how to assess the research culture, beliefs, and conceptualizations of individual researchers and research teams when designing strategies to promote specific institutional research focuses and strategies.

Originality/value: To the best of the authors’ knowledge, this is the first study to associate research agendas and conceptions of research. It is based on a large sample of researchers working worldwide and in all fields of knowledge, which ensures that the findings have a reasonable degree of generalizability to the global population of researchers.

Table and Figures | Reference | Related Articles | Metrics
Current Status and Enhancement of Collaborative Research in the World: A Case Study of Osaka University
Shino Iwami, Toshihiko Shimizu, Melvin John F. Empizo, Jacque Lynn F. Gabayno, Nobuhiko Sarukura, Shota Fujii, Yoshinari Sumimura
Journal of Data and Information Science    2020, 5 (4): 75-85.   doi:10.2478/jdis-2020-0035
Abstract72)   HTML1)    PDF (2009KB)(83)      

Purpose: The purpose of this research is to provide evidence for decision-makers to realize the potentials of collaborations between countries/regions via the scientometric analysis of co-authoring in academic publications.

Design/methodology/approach: The approach is that Osaka University, which has set a strategy to become a global campus, is positioned to have a leading role to enhance such collaborations. This research measures co-authoring relations between Osaka University and other countries/regions to identify networks for fostering strong research collaborations.

Findings: Five countries are identified as candidates for the future global campuses of Osaka University based on three factors, co-authoring relations, GDP growth, and population growth.

Research limitations: The main limitation of this study is not being able to use the relations by the former positions of authors in Osaka University, because the data retrieved is limited by the query of the organization name at the first step.

Practical implications: The significance of this work is to provide evidence for the university strategy to expand abroad based on the quantity and visualization of trends.

Originality/value: With wider practical implementations, the approach of this research is useful in making a strategic roadmap for scientific organizations that intend to collaborate internationally.

Table and Figures | Reference | Related Articles | Metrics
Global Collaboration in Artificial Intelligence: Bibliometrics and Network Analysis from 1985 to 2019
Haotian Hu, Dongbo Wang, Sanhong Deng
Journal of Data and Information Science    2020, 5 (4): 86-115.   doi:10.2478/jdis-2020-0027
Abstract96)   HTML11)    PDF (8327KB)(114)      

Purpose: This study aims to explore the trend and status of international collaboration in the field of artificial intelligence (AI) and to understand the hot topics, core groups, and major collaboration patterns in global AI research.

Design/methodology/approach: We selected 38,224 papers in the field of AI from 1985 to 2019 in the core collection database of Web of Science (WoS) and studied international collaboration from the perspectives of authors, institutions, and countries through bibliometric analysis and social network analysis.

Findings: The bibliometric results show that in the field of AI, the number of published papers is increasing every year, and 84.8% of them are cooperative papers. Collaboration with more than three authors, collaboration between two countries and collaboration within institutions are the three main levels of collaboration patterns. Through social network analysis, this study found that the US, the UK, France, and Spain led global collaboration research in the field of AI at the country level, while Vietnam, Saudi Arabia, and United Arab Emirates had a high degree of international participation. Collaboration at the institution level reflects obvious regional and economic characteristics. There are the Developing Countries Institution Collaboration Group led by Iran, China, and Vietnam, as well as the Developed Countries Institution Collaboration Group led by the US, Canada, the UK. Also, the Chinese Academy of Sciences (China) plays an important, pivotal role in connecting the these institutional collaboration groups.

Research limitations: First, participant contributions in international collaboration may have varied, but in our research they are viewed equally when building collaboration networks. Second, although the edge weight in the collaboration network is considered, it is only used to help reduce the network and does not reflect the strength of collaboration.

Practical implications: The findings fill the current shortage of research on international collaboration in AI. They will help inform scientists and policy makers about the future of AI research.

Originality/value: This work is the longest to date regarding international collaboration in the field of AI. This research explores the evolution, future trends, and major collaboration patterns of international collaboration in the field of AI over the past 35 years. It also reveals the leading countries, core groups, and characteristics of collaboration in the field of AI.

Table and Figures | Reference | Related Articles | Metrics
Priorities for Social and Humanities Projects Based on Text Analysis
Ülle Must
Journal of Data and Information Science    2020, 5 (4): 116-125.   doi:10.2478/jdis-2020-0036
Abstract33)   HTML3)    PDF (1166KB)(110)      

Purpose: Changes in the world show that the role, importance, and coherence of SSH (social sciences and the humanities) will increase significantly in the coming years. This paper aims to monitor and analyze the evolution (or overlapping) of the SSH thematic pattern through three funding instruments since 2007.

Design/methodology/approach: The goal of the paper is to check to what extent the EU Framework Program (FP) affects/does not affect research on national level, and to highlight hot topics from a given period with the help of text analysis. Funded project titles and abstracts derived from the EU FP, Slovenian, and Estonian RIS were used. The final analysis and comparisons between different datasets were made based on the 200 most frequent words. After removing punctuation marks, numeric values, articles, prepositions, conjunctions, and auxiliary verbs, 4,854 unique words in ETIS, 4,421 unique words in the Slovenian Research Information System (SICRIS), and 3,950 unique words in FP were identified.

Findings: Across all funding instruments, about a quarter of the top words constitute half of the word occurrences. The text analysis results show that in the majority of cases words do not overlap between FP and nationally funded projects. In some cases, it may be due to using different vocabulary. There is more overlapping between words in the case of Slovenia (SL) and Estonia (EE) and less in the case of Estonia and EU Framework Programmes (FP). At the same time, overlapping words indicate a wider reach (culture, education, social, history, human, innovation, etc.). In nationally funded projects (bottom-up), it was relatively difficult to observe the change in thematic trends over time. More specific results emerged from the comparison of the different programs throughout FP (top-down).

Research limitations: Only projects with English titles and abstracts were analyzed.

Practical implications: The specifics of SSH have to take into account—the one-to-one meaning of terms/words is not as important as, for example, in the exact sciences. Thus, even in co-word analysis, the final content may go unnoticed.

Originality/value: This was the first attempt to monitor the trends of SSH projects using text analysis. The text analysis of the SSH projects of the two new EU Member States used in the study showed that SSH’s thematic coverage is not much affected by the EU Framework Program. Whether this result is field-specific or country-specific should be shown in the following study, which targets SSH projects in the so-called old Member States.

Table and Figures | Reference | Related Articles | Metrics
Topic Evolution and Emerging Topic Analysis Based on Open Source Software
Xiang Shen, Li Wang
Journal of Data and Information Science    2020, 5 (4): 126-136.   doi:10.2478/jdis-2020-0033
Abstract94)   HTML5)    PDF (6941KB)(76)      

Purpose: We present an analytical, open source and flexible natural language processing and text mining method for topic evolution, emerging topic detection and research trend forecasting for all kinds of data-tagged text.

Design/methodology/approach: We make full use of the functions provided by the open source VOSviewer and Microsoft Office, including a thesaurus for data clean-up and a LOOKUP function for comparative analysis.

Findings: Through application and verification in the domain of perovskite solar cells research, this method proves to be effective.

Research limitations: A certain amount of manual data processing and a specific research domain background are required for better, more illustrative analysis results. Adequate time for analysis is also necessary.

Practical implications: We try to set up an easy, useful, and flexible interdisciplinary text analyzing procedure for researchers, especially those without solid computer programming skills or who cannot easily access complex software. This procedure can also serve as a wonderful example for teaching information literacy.

Originality/value: This text analysis approach has not been reported before.

Table and Figures | Reference | Related Articles | Metrics
Scientometric Analysis of Research Output from Brazil in Response to the Zika Crisis Using e-Lattes
Ricardo Barros Sampaio, Antônio de Abreu Batista-Jr, Bruno Santos Ferreira, Mauricio L. Barreto, Jesús P. Mena-Chalco
Journal of Data and Information Science    2020, 5 (4): 137-146.   doi:10.2478/jdis-2020-0038
Abstract59)   HTML4)    PDF (866KB)(75)      

Purpose: This paper aims to test the use of e-Lattes to map the Brazilian scientific output in a recent research health subject: Zika Virus.

Design/methodology/approach: From a set of Lattes CVs of Zika researchers registered on the Lattes Platform, we used the e-Lattes to map the Brazilian scientific response to the Zika crisis.

Findings: Brazilian science articulated quickly during the public health emergency of international concern (PHEIC) due to the creation of mechanisms to streamline funding of scientific research.

Research limitations: We did not assess any dimension of research quality, including the scientific impact and societal value.

Practical implications: e-Lattes can provide useful guidelines for different stakeholders in research groups from Lattes CVs of members.

Originality/value: The information included in Lattes CVs permits us to assess science from a broader perspective taking into account not only scientific research production but also the training of human resources and scientific collaboration.

Table and Figures | Reference | Related Articles | Metrics
Detection of Malignant and Benign Breast Cancer Using the ANOVA-BOOTSTRAP-SVM
Borislava Petrova Vrigazova
Journal of Data and Information Science    2020, 5 (2): 62-75.   doi:10.2478/jdis-2020-0012
Abstract197)   HTML18)    PDF (842KB)(188)      

Purpose: The aim of this research is to propose a modification of the ANOVA-SVM method that can increase accuracy when detecting benign and malignant breast cancer.

Methodology: We proposed a new method ANOVA-BOOTSTRAP-SVM. It involves applying the analysis of variance (ANOVA) to support vector machines (SVM) but we use the bootstrap instead of cross validation as a train/test splitting procedure. We have tuned the kernel and the C parameter and tested our algorithm on a set of breast cancer datasets.

Findings: By using the new method proposed, we succeeded in improving accuracy ranging from 4.5 percentage points to 8 percentage points depending on the dataset.

Research limitations: The algorithm is sensitive to the type of kernel and value of the optimization parameter C.

Practical implications: We believe that the ANOVA-BOOTSTRAP-SVM can be used not only to recognize the type of breast cancer but also for broader research in all types of cancer.

Originality/value: Our findings are important as the algorithm can detect various types of cancer with higher accuracy compared to standard versions of the Support Vector Machines.

Table and Figures | Reference | Related Articles | Metrics
FAIR + FIT: Guiding Principles and Functional Metrics for Linked Open Data (LOD) KOS Products
Marcia Lei Zeng, Julaine Clunis
Journal of Data and Information Science    2020, 5 (1): 93-118.   doi:10.2478/jdis-2020-0008
Accepted: 17 April 2020

Abstract160)   HTML25)    PDF (9341KB)(123)      

Purpose: To develop a set of metrics and identify criteria for assessing the functionality of LOD KOS products while providing common guiding principles that can be used by LOD KOS producers and users to maximize the functions and usages of LOD KOS products.

Design/methodology/approach: Data collection and analysis were conducted at three time periods in 2015-16, 2017 and 2019. The sample data used in the comprehensive data analysis comprises all datasets tagged as types of KOS in the Datahub and extracted through their respective SPARQL endpoints. A comparative study of the LOD KOS collected from terminology services Linked Open Vocabularies (LOV) and BioPortal was also performed.

Findings: The study proposes a set of Functional, Impactful and Transformable (FIT) metrics for LOD KOS as value vocabularies. The FAIR principles, with additional recommendations, are presented for LOD KOS as open data.

Research limitations: The metrics need to be further tested and aligned with the best practices and international standards of both open data and various types of KOS.

Practical implications: Assessment performed with FAIR and FIT metrics support the creation and delivery of user-friendly, discoverable and interoperable LOD KOS datasets which can be used for innovative applications, act as a knowledge base, become a foundation of semantic analysis and entity extractions and enhance research in science and the humanities.

Originality/value: Our research provides best practice guidelines for LOD KOS as value vocabularies.

Table and Figures | Reference | Related Articles | Metrics
Improving Archival Records and Service of Traditional Korean Performing Arts in a Semantic Web Environment
Ziyoung Park, Hosin Lee, Seungchon Kim, Sungjae Park
Journal of Data and Information Science    2020, 5 (1): 68-80.   doi:10.2478/jdis-2020-0006
Accepted: 17 April 2020

Abstract76)   HTML17)    PDF (2501KB)(177)      

Purpose: This research project aims to organize the archival information of traditional Korean performing arts in a semantic web environment. Key requirements, which the archival records manager should consider for publishing and distribution of gugak performing archival information in a semantic web environment, are presented in the perspective of linked data.

Design/methodology/approach: This study analyzes the metadata provided by the National Gugak Center’s Gugak Archive, the search and browse menus of Gugak Archive’s website and K-PAAN, the performing arts portal site.

Findings: The importance of consistency, continuity, and systematicity—crucial qualities in traditional record management practices—is undiminished in a semantic web environment. However, a semantic web environment also requires new tools such as web identifiers (URIs), data models (RDF), and link information (interlinking).

Research limitations: The scope of this study does not include practical implementation strategies for the archival records management system and website services. The suggestions also do not discuss issues related to copyright or policy coordination between related organizations.

Practical implications: The findings of this study can assist records managers in converting a traditional performing arts information archive into a semantic web environment-based online archival service and system. This can also be useful for collaboration with record managers who are unfamiliar with relational or triple database system.

Originality/value: This study analyzed the metadata of the Gugak Archive and its online services to present practical requirements for managing and disseminating gugak performing arts information in a semantic web environment. In the application of the semantic web services’ principles and methods to an Gugak Archive, this study can contribute to the improvement of information organization and services in the field of Korean traditional music.

Table and Figures | Reference | Related Articles | Metrics
The ARQUIGRAFIA project:A Web Collaborative Environment for Architecture and Urban Heritage Image
Vânia Mara Alves Lima, Cibele Araújo Camargo Marques dos Santos, Artur Simões Rozestraten
Journal of Data and Information Science    2020, 5 (1): 51-67.   doi:10.2478/jdis-2020-0005
Accepted: 22 June 2012

Abstract574)   HTML21)    PDF (10984KB)(605)      

Purpose: This paper presents the ARQUIGRAFIA project, an open, public and nonprofit, continuous growth web collaborative environment dedicated to Brazilian architectural photographic images.

Design/methodology/approach: The ARQUIGRAFIA project promotes the active and collaborative participation among its institutional users (GLAMs, NGOs, laboratories and research groups) and private users (students, professionals, professors, researchers), both can create an account and share their digitized iconographic collections in the same Web environment by uploading their files, indexing, georeferencing and assigning a Creative Commons license.

Findings: The development of users interactions by means of semantic differentials impressions recording on visible plastic-spatial aspects of the architectures in synthetic infographics, as well as by the retrieval of images through an advanced system search based on those impressions parameters. By gamification means, the system often invites users to review images’ in order to improve images’ data accuracy. The pilot project named Open Air Museum that allows users to add audio descriptions to images in situ. An interface for users’ digital curatorship will be soon available.

Research limitations: The ARQUIGRAFIA’s multidisciplinary team gathering professors-researchers, graduate and undergraduate students from the Architecture and Urbanism, Design, Information Science, Computer Science faculties of the University of São Paulo, demands continuous financial resources for grants, for contracting third party services, for the participation in scientific events in Brazil and abroad, and for equipment. Since 2016, significant budget cuts in the University of São Paulo own research funds and in Brazilian federal scientific agencies can compromise the continuity of this project.

Practical implications: The open source template called +GRAFIA that can freely help other areas of knowledge to build their own visual Web collaborative environments.

Originality/value: The collaborative nature of the ARQUIGRAFIA distinguishes it from institutional image databases on the internet, precisely because it involves a heterogeneous network of collaborators.

Table and Figures | Reference | Related Articles | Metrics
Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches
Koraljka Golub, Johan Hagelbäck, Anders Ardö
Journal of Data and Information Science    2020, 5 (1): 18-38.   doi:10.2478/jdis-2020-0003
Accepted: 17 April 2020

Abstract151)   HTML20)    PDF (347KB)(408)      

Purpose: With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.

Design/methodology/approach: State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels).

Findings: Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available—and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes).

Research limitations: Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.

Practical implications: In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.

Originality/value: The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.

Table and Figures | Reference | Related Articles | Metrics
Knowledge Organization and Representation under the AI Lens
Jian Qin
Journal of Data and Information Science    2020, 5 (1): 3-17.   doi:10.2478/jdis-2020-0002
Accepted: 17 April 2020

Abstract315)   HTML53)    PDF (3034KB)(356)      

Purpose: This paper compares the paradigmatic differences between knowledge organization (KO) in library and information science and knowledge representation (KR) in AI to show the convergence in KO and KR methods and applications.

Methodology: The literature review and comparative analysis of KO and KR paradigms is the primary method used in this paper.

Findings: A key difference between KO and KR lays in the purpose of KO is to organize knowledge into certain structure for standardizing and/or normalizing the vocabulary of concepts and relations, while KR is problem-solving oriented. Differences between KO and KR are discussed based on the goal, methods, and functions.

Research limitations: This is only a preliminary research with a case study as proof of concept.

Practical implications: The paper articulates on the opportunities in applying KR and other AI methods and techniques to enhance the functions of KO.

Originality/value: Ontologies and linked data as the evidence of the convergence of KO and KR paradigms provide theoretical and methodological support to innovate KO in the AI era.

Table and Figures | Reference | Related Articles | Metrics
The Second Edition of the Integrative Levels Classification: Evolution of a KOS
Ziyoung Park, Claudio Gnoli, Daniele P. Morelli
Journal of Data and Information Science    2020, 5 (1): 39-50.   doi:10.2478/jdis-2020-0004
Accepted: 17 April 2020

Abstract186)   HTML38)    PDF (1067KB)(389)      

Purpose: This paper informs about the publication of the second edition of the Integrative Levels Classification (ILC2), a freely-faceted knowledge organization system (KOS), and reviews the main changes that have been introduced as compared to its first edition (ILC1).

Design/methodology/approach: The most relevant changes are illustrated, with special reference to those of interest to general classification theory, by means of examples of notation for individual classes and combinations of them.

Findings: Changes introduced in ILC2 include: the names and order of some main classes; the development of subclasses for various phenomena, especially quantities and algebraic structures; the order of facet categories and the new category of Disorder; notation for special facets; distinction of the semantical function of facets (attributes) from their syntactic function. The system can be freely accessed online through a PHP browser as well as in SKOS format.

Research limitations: Only a selection of changed classes is discussed for space reasons.

Practical implications: ILC1 has been previously applied to the BARTOC directory of KOSs. Update of BARTOC data to ILC2 and application of ILC2 to further information systems are envisaged. Possible methods for reclassifying BARTOC with ILC2 are discussed.

Originality: ILC is a newly developed classification system, based on phenomena instead of traditional disciplines and featuring various innovative devices. This paper is an original account of its most recent evolution.

Table and Figures | Reference | Related Articles | Metrics
“SEMANTIC” in a Digital Curation Model
Hyewon Lee, Soyoung Yoon, Ziyoung Park
Journal of Data and Information Science    2020, 5 (1): 81-92.   doi:10.2478/jdis-2020-0007
Accepted: 17 April 2020

Abstract122)   HTML17)    PDF (4457KB)(249)      

Purpose: This study attempts to propose an abstract model by gathering concepts that can focus on resource representation and description in a digital curation model and suggest a conceptual model that emphasizes semantic enrichment in a digital curation model.

Design/methodology/approach: This study conducts a literature review to analyze the preceding curation models, DCC CLM, DCC&U, UC3, and DCN.

Findings: The concept of semantic enrichment is expressed in a single word, SEMANTIC in this study. The Semantic Enrichment Model, SEMANTIC has elements, subject, extraction, multi-language, authority, network, thing, identity, and connect.

Research limitations: This study does not reflect the actual information environment because it focuses on the concepts of the representation of digital objects.

Practical implications: This study presents the main considerations for creating and reinforcing the description and representation of digital objects when building and developing digital curation models in specific institutions.

Originality/value: This study summarizes the elements that should be emphasized in the representation of digital objects in terms of information organization.

Table and Figures | Reference | Related Articles | Metrics
A Metric Approach to Hot Topics in Biomedicine via Keyword Co-occurrence
Jane H. Qin, Jean J. Wang, Fred Y. Ye
Journal of Data and Information Science    2019, 4 (4): 13-25.   doi:10.2478/jdis-2019-0018
Accepted: 19 December 2019

Abstract122)   HTML7)    PDF (2302KB)(223)      

Purpose: To reveal the research hotpots and relationship among three research hot topics in biomedicine, namely CRISPR, iPS (induced Pluripotent Stem) cell and Synthetic biology.

Design/methodology/approach: We set up their keyword co-occurrence networks with using three indicators and information visualization for metric analysis.

Findings: The results reveal the main research hotspots in the three topics are different, but the overlapping keywords in the three topics indicate that they are mutually integrated and interacted each other.

Research limitations: All analyses use keywords, without any other forms.

Practical implications: We try to find the information distribution and structure of these three hot topics for revealing their research status and interactions, and for promoting biomedical developments.

Originality/value: We chose the core keywords in three research hot topics in biomedicine by using h-index.

Reference | Related Articles | Metrics
CiteOpinion: Evidence-based Evaluation Tool for Academic Contributions of Research Papers Based on Citing Sentences
Xiaoqiu Le, Jingdan Chu, Siyi Deng, Qihang Jiao, Jingjing Pei, Liya Zhu, Junliang Yao
Journal of Data and Information Science    2019, 4 (4): 26-41.   doi:10.2478/jdis-2019-0019
Accepted: 19 December 2019

Abstract80)   HTML5)    PDF (3345KB)(167)      

Purpose: To uncover the evaluation information on the academic contribution of research papers cited by peers based on the content cited by citing papers, and to provide an evidence-based tool for evaluating the academic value of cited papers.

Design/methodology/approach: CiteOpinion uses a deep learning model to automatically extract citing sentences from representative citing papers; it starts with an analysis on the citing sentences, then it identifies major academic contribution points of the cited paper, positive/negative evaluations from citing authors and the changes in the subjects of subsequent citing authors by means of Recognizing Categories of Moves (problems,methods, conclusions, etc.), and sentiment analysis and topic clustering.

Findings: Citing sentences in a citing paper contain substantial evidences useful for academic evaluation. They can also be used to objectively and authentically reveal the nature and degree of contribution of the cited paper reflected by citation, beyond simple citation statistics.

Practical implications: The evidence-based evaluation tool CiteOpinion can provide an objective and in-depth academic value evaluation basis for the representative papers of scientific researchers, research teams, and institutions.

Originality/value: No other similar practical tool is found in papers retrieved.

Research limitations: There are difficulties in acquiring full text of citing papers. There is a need to refine the calculation based on the sentiment scores of citing sentences. Currently, the tool is only used for academic contribution evaluation, while its value in policy studies, technical application, and promotion of science is not yet tested.

Reference | Related Articles | Metrics
Masked Sentence Model Based on BERT for Move Recognition in Medical Scientific Abstracts
Gaihong Yu, Zhixiong Zhang, Huan Liu, Liangping Ding
Journal of Data and Information Science    2019, 4 (4): 42-55.   doi:10.2478/jdis-2019-0020
Accepted: 19 December 2019

Abstract178)   HTML17)    PDF (3097KB)(182)      

Purpose: Move recognition in scientific abstracts is an NLP task of classifying sentences of the abstracts into different types of language units. To improve the performance of move recognition in scientific abstracts, a novel model of move recognition is proposed that outperforms the BERT-based method.

Design/methodology/approach: Prevalent models based on BERT for sentence classification often classify sentences without considering the context of the sentences. In this paper, inspired by the BERT masked language model (MLM), we propose a novel model called the masked sentence model that integrates the content and contextual information of the sentences in move recognition. Experiments are conducted on the benchmark dataset PubMed 20K RCT in three steps. Then, we compare our model with HSLN-RNN, BERT-based and SciBERT using the same dataset.

Findings: Compared with the BERT-based and SciBERT models, the F1 score of our model outperforms them by 4.96% and 4.34%, respectively, which shows the feasibility and effectiveness of the novel model and the result of our model comes closest to the state-of-the-art results of HSLN-RNN at present.

Research limitations: The sequential features of move labels are not considered, which might be one of the reasons why HSLN-RNN has better performance. Our model is restricted to dealing with biomedical English literature because we use a dataset from PubMed, which is a typical biomedical database, to fine-tune our model.

Practical implications: The proposed model is better and simpler in identifying move structures in scientific abstracts and is worthy of text classification experiments for capturing contextual features of sentences.

Originality/value: The study proposes a masked sentence model based on BERT that considers the contextual features of the sentences in abstracts in a new way. The performance of this classification model is significantly improved by rebuilding the input layer without changing the structure of neural networks.

Reference | Related Articles | Metrics
Identification of Sarcasm in Textual Data:A Comparative Study
Pulkit Mehndiratta, Devpriya Soni
Journal of Data and Information Science    2019, 4 (4): 56-83.   doi:10.2478/jdis-2019-0021
Accepted: 19 December 2019

Abstract116)   HTML2)    PDF (5823KB)(218)      

Purpose: Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet. Textual data contributes a major share towards data generated on the world wide web. Understanding people’s sentiment is an important aspect of natural language processing, but this opinion can be biased and incorrect, if people use sarcasm while commenting, posting status updates or reviewing any product or a movie. Thus, it is of utmost importance to detect sarcasm correctly and make a correct prediction about the people’s intentions.

Design/methodology/approach: This study tries to evaluate various machine learning models along with standard and hybrid deep learning models across various standardized datasets. We have performed vectorization of text using word embedding techniques. This has been done to convert the textual data into vectors for analytical purposes. We have used three standardized datasets available in public domain and used three word embeddings i.e Word2Vec, GloVe and fastText to validate the hypojournal.

Findings: The results were analyzed and conclusions are drawn. The key finding is: the hybrid models that include Bidirectional LongTerm Short Memory (Bi-LSTM) and Convolutional Neural Network (CNN) outperform others conventional machine learning as well as deep learning models across all the datasets considered in this study, making our hypojournal valid.

Research limitations: Using the data from different sources and customizing the models according to each dataset, slightly decreases the usability of the technique. But, overall this methodology provides effective measures to identify the presence of sarcasm with a minimum average accuracy of 80% or above for one dataset and better than the current baseline results for the other datasets.

Practical implications: The results provide solid insights for the system developers to integrate this model into real-time analysis of any review or comment posted in the public domain. This study has various other practical implications for businesses that depend on user ratings and public opinions. This study also provides a launching platform for various researchers to work on the problem of sarcasm identification in textual data.

Originality/value: This is a first of its kind study, to provide us the difference between conventional and the hybrid methods of prediction of sarcasm in textual data. The study also provides possible indicators that hybrid models are better when applied to textual data for analysis of sarcasm.

Reference | Related Articles | Metrics
Are Contributions from Chinese Physicists Undercited?
Jinzhong Guo, Xiaoling Liu, Liying Yang, Jinshan Wu
Journal of Data and Information Science    2019, 4 (4): 84-95.   doi:10.2478/jdis-2019-0022
Accepted: 19 December 2019

Abstract120)   HTML7)    PDF (3974KB)(144)      

Purpose: In this work, we want to examine whether or not there are some scientific fields to which contributions from Chinese scholars have been under or over cited.

Design/methodology/approach: We do so by comparing the number of received citations and the IOF of publications in each scientific field from each country. The IOF is calculated from applying the modified closed system input-output analysis (MCSIOA) to the citation network. MCSIOA is a PageRank-like algorithm which means here that citations from the more influential subfields are weighted more towards the IOF.

Findings: About 40% of subfields in physics in China are undercited, meaning that their net influence ranks are higher (better) than the direct rank, while about 75% of subfields in the USA and German are undercited

Research limitations: Only APS data is analyzed in this work. The expected citation influence is assumed to be represented by the IOF, and this can be wrong.

Practical implications: MCSIOA provides a measure of net influences and according to that measure. Overall, Chinese physicists’ publications are more likely overcited rather than being undercited.

Originality/value: The issue of under or over cited has been analyzed in this work using MCSIOA.

Reference | Related Articles | Metrics
Measuring Societal Impact Is as Complex as ABC
Ed Noyons
Journal of Data and Information Science    2019, 4 (3): 6-21.   doi:10.2478/jdis-2019-0012
Accepted: 02 September 2019

Abstract200)   HTML3)    PDF (10717KB)(168)      

Purpose This paper describes an alternative way of assessing journals considering a broader perspective of its impact. The Area-based connectedness (ABC) to society of journals applied here contributes to the assessment of the dissemination task of journals but with more data it may also contribute to the assessment of other missions.

Design/methodology/Approach: The ABC approach assesses the performance of research actors, in this case journals, considering the characteristics of the research areas in which they are active. Each paper in a journal inherits the characteristics of its area. These areas are defined by a publication-based classification. The characteristics of areas relate to 5 dimensions of connectedness to society (news, policy, industrial R&D, technology and local interest) and are calculated by bibliometric indicators and social media metrics.

Findings: In the paper, I illustrate the approach by showing the results for a few journals. They illustrate the diverse profiles that journals may have. We are able to provide a profile for each journal in the Web of Science database. The profiles we present show an appropriate view on the journals’ societal connectedness.

Research limitations: The classification I apply to perform the analyses is a CWTS in house classification based on Web of Science data. As such the application depends on the (updates of) that system. The classification is available at

Practical implications: The dimensions of connectedness discussed in this paper relate to the dissemination task of journals but further development of this method may provide more options to monitor the tasks/mission of journals.

Originality/value The ABC approach is a unique way to assess performance or impact of research actors considering the characteristics of the areas in which output is published and as such less prone to manipulation or gaming.

Reference | Related Articles | Metrics
Practice and Challenge of International Peer Review: A Case Study of Research Evaluation of CAS Centers for Excellence
Fang Xu, Xiaoxuan Li
Journal of Data and Information Science    2019, 4 (3): 22-34.   doi:10.2478/jdis-2019-0013
Accepted: 02 September 2019

Abstract126)   HTML4)    PDF (336KB)(237)      

Purpose The main goal of this paper is to show that international peer review can work in China’s context with satisfactory outcomes. Moreover, this paper also provides a reference for the practice of science and technology management.

Design/methodology/Approach: This paper starts with a discussion of two critical questions about the significance and design of international peer review. A case study of international peer review of CAS Centers for Excellence is further analyzed.

Findings: International peer review may provide a solution to address the problem of quantitative oriented research evaluation in China. The case study of research evaluation of CAS Centers for Excellence shows that it is possible and feasible to conduct an international peer review in China’s context. When applying this approach to other scenarios, there are still many issues to consider including individualized design of international peer review combined with practical demands, and further improvement of theories and methods of international peer review.^Research limitation: 1) Only the case of international peer review of CAS Centers for Excellence is analyzed; 2) A relatively small number of respondents were surveyed in the questionnaire.

Practical implications: The work presented in this study can be used as a reference for future studies.

Originality/value Currently, there are no similarly detailed studies exploring the significance and methodology of international peer review in China.

Reference | Related Articles | Metrics
Disclosing and Evaluating Artistic Research
Florian Vanlee, Walter Ysebaert
Journal of Data and Information Science    2019, 4 (3): 35-54.   doi:10.2478/jdis-2019-0014
Accepted: 02 September 2019

Abstract136)   HTML1)    PDF (333KB)(330)      

Purpose This study expands on the results of a stakeholder-driven research project on quality indicators and output assessment of art and design research in Flanders—the Northern, Dutch-speaking region of Belgium. Herein, it emphasizes the value of arts & design output registration as a modality to articulate the disciplinary demarcations of art and design research.

Design/methodology/Approach: The particularity of art and design research in Flanders is first analyzed and compared to international examples. Hereafter, the results of the stakeholder-driven project on the creation of indicators for arts & design research output assessment are discussed.

Findings: The findings accentuate the importance of allowing an assessment culture to emerge from practitioners themselves, instead of imposing ill-suited methods borrowed from established scientific evaluation models (Biggs & Karlsson, 2011)—notwithstanding the practical difficulties it generates. They point to the potential of stakeholder-driven approaches for artistic research, which benefits from constructing a shared metadiscourse among its practitioners regarding the continuities and discontinuities between “artistic” and “traditional” research, and the communal goals and values that guide its knowledge production (Biggs & Karlsson, 2011; Hellstr?m, 2010; Ysebaert & Martens, 2018). ^Research limitation: The central limitation of the study is that it focuses exclusively on the “Architecture & Design” panel of the project, and does not account for intra-disciplinary complexities in output assessment.

Practical implications: The goal of the research project is to create a robust assessment system for arts & design research in Flanders, which may later guide similar international projects.

Originality/value This study is currently the only one to consider the productive potential of (collaborative) PRFSs for artistic research.

Reference | Related Articles | Metrics
Methods and Practices for Institutional Benchmarking based on Research Impact and Competitiveness: A Case Study of ShanghaiTech University
Jiang Chang, Jianhua Liu
Journal of Data and Information Science    2019, 4 (3): 55-72.   doi:10.2478/jdis-2019-0015
Accepted: 02 September 2019

Abstract205)   HTML0)    PDF (1838KB)(268)      

Purpose To develop and test a mission-oriented and multi-dimensional benchmarking method for a small scale university aiming for internationally first-class basic research.

Design/methodology/Approach: An individualized evidence-based assessment scheme was employed to benchmark ShanghaiTech University against selected top research institutions, focusing on research impact and competitiveness at the institutional and disciplinary levels. Topic maps opposing ShanghaiTech and corresponding top institutions were produced for the main research disciplines of ShanghaiTech. This provides opportunities for further exploration of strengths and weakness.

Findings: This study establishes a preliminary framework for assessing the mission of the university. It further provides assessment principles, assessment questions, and indicators. Analytical methods and data sources were tested and proved to be applicable and efficient.

Research limitations: To better fit the selective research focuses of this university, its schema of research disciplines needs to be re-organized and benchmarking targets should include disciplinary top institutions and not necessarily those universities leading overall rankings. Current reliance on research articles and certain databases may neglect important research output types.

Practical implications: This study provides a working framework and practical methods for mission-oriented, individual, and multi-dimensional benchmarking that ShanghaiTech decided to use for periodical assessments. It also offers a working reference for other institutions to adapt. Further needs are identified so that ShanghaiTech can tackle them for future benchmarking.

Originality/value This is an effort to develop a mission-oriented, individually designed, systematically structured, and multi-dimensional assessment methodology which differs from often used composite indices.

Reference | Related Articles | Metrics
An Automatic Method to Identify Citations to Journals in News Stories: A Case Study of UK Newspapers Citing Web of Science Journals
Kayvan Kousha, Mike Thelwall
Journal of Data and Information Science    2019, 4 (3): 73-95.   doi:10.2478/jdis-2019-0016
Accepted: 02 September 2019

Abstract86)   HTML3)    PDF (749KB)(260)      

Purpose Communicating scientific results to the public is essential to inspire future researchers and ensure that discoveries are exploited. News stories about research are a key communication pathway for this and have been manually monitored to assess the extent of press coverage of scholarship.

Design/methodology/Approach: To make larger scale studies practical, this paper introduces an automatic method to extract citations from newspaper stories to large sets of academic journals. Curated ProQuest queries were used to search for citations to 9,639 Science and 3,412 Social Science Web of Science (WoS) journals from eight UK daily newspapers during 2006-2015. False matches were automatically filtered out by a new program, with 94% of the remaining stories meaningfully citing research.

Findings: Most Science (95%) and Social Science (94%) journals were never cited by these newspapers. Half of the cited Science journals covered medical or health-related topics, whereas 43% of the Social Sciences journals were related to psychiatry or psychology. From the citing news stories, 60% described research extensively and 53% used multiple sources, but few commented on research quality.

Research limitations: The method has only been tested in English and from the ProQuest Newspapers database.

Practical implications: Others can use the new method to systematically harvest press coverage of research.

Originality/value An automatic method was introduced and tested to extract citations from newspaper stories to large sets of academic journals.

Reference | Related Articles | Metrics
A Multi-match Approach to the Author Uncertainty Problem
Stephen F. Carley, Alan L. Porter, Jan L. Youtie
Journal of Data and Information Science    2019, 4 (2): 1-18.   doi:10.2478/jdis-2019-0006
Accepted: 12 October 2011

Abstract597)   HTML26)    PDF (776KB)(751)      

Purpose: The ability to identify the scholarship of individual authors is essential for performance evaluation. A number of factors hinder this endeavor. Common and similarly spelled surnames make it difficult to isolate the scholarship of individual authors indexed on large databases. Variations in name spelling of individual scholars further complicates matters. Common family names in scientific powerhouses like China make it problematic to distinguish between authors possessing ubiquitous and/or anglicized surnames (as well as the same or similar first names). The assignment of unique author identifiers provides a major step toward resolving these difficulties. We maintain, however, that in and of themselves, author identifiers are not sufficient to fully address the author uncertainty problem. In this study we build on the author identifier approach by considering commonalities in fielded data between authors containing the same surname and first initial of their first name. We illustrate our approach using three case studies.

Design/methodology/approach: The approach we advance in this study is based on commonalities among fielded data in search results. We cast a broad initial net—i.e., a Web of Science (WOS) search for a given author’s last name, followed by a comma, followed by the first initial of his or her first name (e.g., a search for ‘John Doe’ would assume the form: ‘Doe, J’). Results for this search typically contain all of the scholarship legitimately belonging to this author in the given database (i.e., all of his or her true positives), along with a large amount of noise, or scholarship not belonging to this author (i.e., a large number of false positives). From this corpus we proceed to iteratively weed out false positives and retain true positives. Author identifiers provide a good starting point—e.g., if ‘Doe, J’ and ‘Doe, John’ share the same author identifier, this would be sufficient for us to conclude these are one and the same individual. We find email addresses similarly adequate—e.g., if two author names which share the same surname and same first initial have an email address in common, we conclude these authors are the same person. Author identifier and email address data is not always available, however. When this occurs, other fields are used to address the author uncertainty problem.

Commonalities among author data other than unique identifiers and email addresses is less conclusive for name consolidation purposes. For example, if ‘Doe, John’ and ‘Doe, J’ have an affiliation in common, do we conclude that these names belong the same person? They may or may not; affiliations have employed two or more faculty members sharing the same last and first initial. Similarly, it’s conceivable that two individuals with the same last name and first initial publish in the same journal, publish with the same co-authors, and/or cite the same references. Should we then ignore commonalities among these fields and conclude they’re too imprecise for name consolidation purposes? It is our position that such commonalities are indeed valuable for addressing the author uncertainty problem, but more so when used in combination.

Our approach makes use of automation as well as manual inspection, relying initially on author identifiers, then commonalities among fielded data other than author identifiers, and finally manual verification. To achieve name consolidation independent of author identifier matches, we have developed a procedure that is used with bibliometric software called VantagePoint (see While the application of our technique does not exclusively depend on VantagePoint, it is the software we find most efficient in this study. The script we developed to implement this procedure is designed to implement our name disambiguation procedure in a way that significantly reduces manual effort on the user’s part. Those who seek to replicate our procedure independent of VantagePoint can do so by manually following the method we outline, but we note that the manual application of our procedure takes a significant amount of time and effort, especially when working with larger datasets.

Our script begins by prompting the user for a surname and a first initial (for any author of interest). It then prompts the user to select a WOS field on which to consolidate author names. After this the user is prompted to point to the name of the authors field, and finally asked to identify a specific author name (referred to by the script as the primary author) within this field whom the user knows to be a true positive (a suggested approach is to point to an author name associated with one of the records that has the author’s ORCID iD or email address attached to it).

The script proceeds to identify and combine all author names sharing the primary author’s surname and first initial of his or her first name who share commonalities in the WOS field on which the user was prompted to consolidate author names. This typically results in significant reduction in the initial dataset size. After the procedure completes the user is usually left with a much smaller (and more manageable) dataset to manually inspect (and/or apply additional name disambiguation techniques to).

Research limitations: Match field coverage can be an issue. When field coverage is paltry dataset reduction is not as significant, which results in more manual inspection on the user’s part. Our procedure doesn’t lend itself to scholars who have had a legal family name change (after marriage, for example). Moreover, the technique we advance is (sometimes, but not always) likely to have a difficult time dealing with scholars who have changed careers or fields dramatically, as well as scholars whose work is highly interdisciplinary.

Practical implications: The procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research, especially when the name under consideration is a more common family name. It is more effective when match field coverage is high and a number of match fields exist.

Originality/value: Once again, the procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research. It combines preexisting with more recent approaches, harnessing the benefits of both.

Findings: Our study applies the name disambiguation procedure we advance to three case studies. Ideal match fields are not the same for each of our case studies. We find that match field effectiveness is in large part a function of field coverage. Comparing original dataset size, the timeframe analyzed for each case study is not the same, nor are the subject areas in which they publish. Our procedure is more effective when applied to our third case study, both in terms of list reduction and 100% retention of true positives. We attribute this to excellent match field coverage, and especially in more specific match fields, as well as having a more modest/manageable number of publications.

While machine learning is considered authoritative by many, we do not see it as practical or replicable. The procedure advanced herein is both practical, replicable and relatively user friendly. It might be categorized into a space between ORCID and machine learning. Machine learning approaches typically look for commonalities among citation data, which is not always available, structured or easy to work with. The procedure we advance is intended to be applied across numerous fields in a dataset of interest (e.g. emails, coauthors, affiliations, etc.), resulting in multiple rounds of reduction. Results indicate that effective match fields include author identifiers, emails, source titles, co-authors and ISSNs. While the script we present is not likely to result in a dataset consisting solely of true positives (at least for more common surnames), it does significantly reduce manual effort on the user’s part. Dataset reduction (after our procedure is applied) is in large part a function of (a) field availability and (b) field coverage.

Reference | Related Articles | Metrics
Normalizing Book Citations in Google Scholar: A Hybrid Cited-side Citing-side Method
John Mingers†, Eren Kaymaz
Journal of Data and Information Science    2019, 4 (2): 19-35.   doi:10.2478/jdis-2019-0007
Accepted: 30 May 2019

Abstract90)   HTML4)    PDF (772KB)(287)      

Purpose: To design and test a method for normalizing book citations in Google Scholar.

Design/methodology/approach: A hybrid citing-side, cited-side normalization method was developed and this was tested on a sample of 285 research monographs. The results were analyzed and conclusions drawn.

Findings: The method was technically feasible but required extensive manual intervention because of the poor quality of the Google Scholar data.

Research limitations: The sample of books was limited and also all were from one discipline —business and management. Also, the method has only been tested on Google Scholar, it would be useful to test it on Web of Science or Scopus.

Practical limitations: Google Scholar is a poor source of data although it does cover a much wider range citation sources that other databases.

Originality/value: This is the first method that has been developed specifically for normalizing books which have so far not been able to be normalized.

Reference | Related Articles | Metrics
Evolution of the Socio-cognitive Structure of Knowledge Management (1986-2015): An Author Co-citation Analysis
Carlos Luis González-Valiente, Magda León Santos, Ricardo Arencibia-Jorge
Journal of Data and Information Science    2019, 4 (2): 36-55.   doi:10.2478/jdis-2019-0008
Accepted: 30 May 2019

Abstract100)   HTML7)    PDF (6931KB)(197)      

Purpose: The evolution of the socio-cognitive structure of the field of knowledge management (KM) during the period 1986-2015 is described.

Design/methodology/approach: Records retrieved from Web of Science were submitted to author co-citation analysis (ACA) following a longitudinal perspective as of the following time slices: 1986-1996, 1997-2006, and 2007-2015. The top 10% of most cited first authors by sub-periods were mapped in bibliometric networks in order to interpret the communities formed and their relationships.

Findings: KM is a homogeneous field as indicated by networks results. Nine classical authors are identified since they are highly co-cited in each sub-period, highlighting Ikujiro Nonaka as the most influential authors in the field. The most significant communities in KM are devoted to strategic management, KM foundations, organisational learning and behaviour, and organisational theories. Major trends in the evolution of the intellectual structure of KM evidence a technological influence in 1986-1996, a strategic influence in 1997-2006, and finally a sociological influence in 2007-2015.

Research limitations: Describing a field from a single database can offer biases in terms of output coverage. Likewise, the conference proceedings and books were not used and the analysis was only based on first authors. However, the results obtained can be very useful to understand the evolution of KM research.

Practical implications: These results might be useful for managers and academicians to understand the evolution of KM field and to (re)define research activities and organisational projects.

Originality/value: The novelty of this paper lies in considering ACA as a bibliometric technique to study KM research. In addition, our investigation has a wider time coverage than earlier articles.

Reference | Related Articles | Metrics
Does a Country/Region’s Economic Status Affect Its Universities’ Presence in International Rankings?
Esteban Fernández Tuesta, Carlos Garcia-Zorita, Rosario Romera Ayllon, Elías Sanz-Casado
Journal of Data and Information Science    2019, 4 (2): 56-78.   doi:10.2478/jdis-2019-0009
Accepted: 30 May 2019

Abstract185)   HTML1)    PDF (1864KB)(322)      

Purpose: Study how economic parameters affect positions in the Academic Ranking of World Universities’ top 500 published by the Shanghai Jiao Tong University Graduate School of Education in countries/regions with listed higher education institutions.

Design/methodology/approach: The methodology used capitalises on the multi-variate characteristics of the data analysed. The multi-colinearity problem posed is solved by running principal components prior to regression analysis, using both classical (OLS) and robust (Huber and Tukey) methods.

Findings: Our results revealed that countries/regions with long ranking traditions are highly competitive. Findings also showed that some countries/regions such as Germany, United Kingdom, Canada, and Italy, had a larger number of universities in the top positions than predicted by the regression model. In contrast, for Japan, a country where social and economic performance is high, the number of ARWU universities projected by the model was much larger than the actual figure. In much the same vein, countries/regions that invest heavily in education, such as Japan and Denmark, had lower than expected results.

Research limitations: Using data from only one ranking is a limitation of this study, but the methodology used could be useful to other global rankings.

Practical implications: The results provide good insights for policy makers. They indicate the existence of a relationship between research output and the number of universities per million inhabitants. Countries/regions, which have historically prioritised higher education, exhibited highest values for indicators that compose the rankings methodology; furthermore, minimum increase in welfare indicators could exhibited significant rises in the presence of their universities on the rankings.

Originality/value: This study is well defined and the result answers important questions about characteristics of countries/regions and their higher education system.

Reference | Related Articles | Metrics
Node2vec Representation for Clustering Journals and as A Possible Measure of Diversity
Zhesi Shen, Fuyou Chen, Liying Yang, Jinshan Wu
Journal of Data and Information Science    2019, 4 (2): 79-92.   doi:10.2478/jdis-2019-0010
Accepted: 30 May 2019

Abstract201)   HTML8)    PDF (5606KB)(388)      

Purpose: To investigate the effectiveness of using node2vec on journal citation networks to represent journals as vectors for tasks such as clustering, science mapping, and journal diversity measure.

Design/methodology/approach: Node2vec is used in a journal citation network to generate journal vector representations.

Findings: 1. Journals are clustered based on the node2vec trained vectors to form a science map. 2. The norm of the vector can be seen as an indicator of the diversity of journals. 3. Using node2vec trained journal vectors to determine the Rao-Stirling diversity measure leads to a better measure of diversity than that of direct citation vectors.

Research limitations: All analyses use citation data and only focus on the journal level.

Practical implications: Node2vec trained journal vectors embed rich information about journals, can be used to form a science map and may generate better values of journal diversity measures.

Originality/value: The effectiveness of node2vec in scientometric analysis is tested. Possible indicators for journal diversity measure are presented.

Reference | Related Articles | Metrics
A Criteria-based Assessment of the Coverage of Scopus and Web of Science
Dag, W. Aksnes, Gunnar Sivertsen
Journal of Data and Information Science    2019, 4 (1): 1-21.   doi:10.2478/jdis-2019-0001
Accepted: 05 August 2011

Abstract694)   HTML72)    PDF (1071KB)(979)      

Purpose: The purpose of this study is to assess the coverage of the scientific literature in Scopus and Web of Science from the perspective of research evaluation.

Design/methodology/approach: The academic communities of Norway have agreed on certain criteria for what should be included as original research publications in research evaluation and funding contexts. These criteria have been applied since 2004 in a comprehensive bibliographic database called the Norwegian Science Index (NSI). The relative coverages of Scopus and Web of Science are compared with regard to publication type, field of research and language.

Findings: Our results show that Scopus covers 72 percent of the total Norwegian scientific and scholarly publication output in 2015 and 2016, while the corresponding figure for Web of Science Core Collection is 69 percent. The coverages are most comprehensive in medicine and health (89 and 87 percent) and in the natural sciences and technology (85 and 84 percent). The social sciences (48 percent in Scopus and 40 percent in Web of Science Core Collection) and particularly the humanities (27 and 23 percent) are much less covered in the two international data sources.

Research limitation: Comparing with data from only one country is a limitation of the study, but the criteria used to define a country’s scientific output as well as the identification of patterns of field-dependent partial representations in Scopus and Web of Science should be recognizable and useful also for other countries.

Originality/value: The novelty of this study is the criteria-based approach to studying coverage problems in the two data sources.

Reference | Related Articles | Metrics
  First page | Prev page | Next page | Last page Page 1 of 3, 102 records