Research Paper published in JDIS
    Published in last 1 year |  In last 2 years |  In last 3 years |  All
Please wait a minute...
For Selected: Toggle Thumbnails
A Discrimination Index Based on Jain’s Fairness Index to Differentiate Researchers with Identical H-index Values
Adian Fatchur Rochim, Abdul Muis, Riri Fitri Sari
Journal of Data and Information Science    2020, 5 (4): 5-18.   doi:10.2478/jdis-2020-0026
Abstract59)   HTML11)    PDF (2706KB)(74)      

Purpose: This paper proposes a discrimination index method based on the Jain’s fairness index to distinguish researchers with the same H-index.

Design/methodology/approach: A validity test is used to measure the correlation of D-offset with the parameters, i.e. H-index, the number of cited papers, the total number of citations, the number of indexed papers, and the number of uncited papers. The correlation test is based on the Saphiro-Wilk method and Pearson’s product-moment correlation.

Findings: The result from the discrimination index calculation is a two-digit decimal value called the discrimination-offset (D-offset), with a range of D-offset from 0.00 to 0.99. The result of the correlation value between the D-offset and the number of uncited papers is 0.35, D-offset with the number of indexed papers is 0.24, and the number of cited papers is 0.27. The test provides the result that it is very unlikely that there exists no relationship between the parameters.

Practical implications: For this reason, D-offset is proposed as an additional parameter for H-index to differentiate researchers with the same H-index. The H-index for researchers can be written with the format of “H-index: D-offset”.

Originality/value: D-offset is worthy to be considered as a complement value to add the H-index value. If the D-offset is added in the H-index value, the H-index will have more discrimination power to differentiate the rank of the researchers who have the same H-index.

Table and Figures | Reference | Related Articles | Metrics
A Micro Perspective of Research Dynamics Through “Citations of Citations” Topic Analysis
Xiaoli Chen, Tao Han
Journal of Data and Information Science    2020, 5 (4): 19-34.   doi:10.2478/jdis-2020-0034
Abstract56)   HTML8)    PDF (3324KB)(74)      

Purpose: Research dynamics have long been a research interest. It is a macro perspective tool for discovering temporal research trends of a certain discipline or subject. A micro perspective of research dynamics, however, concerning a single researcher or a highly cited paper in terms of their citations and “citations of citations” (forward chaining) remains unexplored.

Design/methodology/approach: In this paper, we use a cross-collection topic model to reveal the research dynamics of topic disappearance topic inheritance, and topic innovation in each generation of forward chaining.

Findings: For highly cited work, scientific influence exists in indirect citations. Topic modeling can reveal how long this influence exists in forward chaining, as well as its influence.

Research limitations: This paper measures scientific influence and indirect scientific influence only if the relevant words or phrases are borrowed or used in direct or indirect citations. Paraphrasing or semantically similar concept may be neglected in this research.

Practical implications: This paper demonstrates that a scientific influence exists in indirect citations through its analysis of forward chaining. This can serve as an inspiration on how to adequately evaluate research influence.

Originality: The main contributions of this paper are the following three aspects. First, besides research dynamics of topic inheritance and topic innovation, we model topic disappearance by using a cross-collection topic model. Second, we explore the length and character of the research impact through “citations of citations” content analysis. Finally, we analyze the research dynamics of artificial intelligence researcher Geoffrey Hinton’s publications and the topic dynamics of forward chaining.

Table and Figures | Reference | Related Articles | Metrics
Can Crossref Citations Replace Web of Science for Research Evaluation? The Share of Open Citations
Tomáš Chudlarský, Jan Dvořák
Journal of Data and Information Science    2020, 5 (4): 35-42.   doi:10.2478/jdis-2020-0037
Abstract86)   HTML3)    PDF (1563KB)(48)      

Purpose: We study the proportion of Web of Science (WoS) citation links that are represented in the Crossref Open Citation Index (COCI), with the possible aim of using COCI in research evaluation instead of the WoS, if the level of coverage was sufficient.

Design/methodology/approach: We calculate the proportion on citation links where both publications have a WoS accession number and a DOI simultaneously, and where the cited publications have had at least one author from our institution, the Czech Technical University in Prague. We attempt to look up each such citation link in COCI.

Findings: We find that 53.7% of WoS citation links are present in the COCI. The proportion varies largely by discipline. The total figures differ significantly from 40% in the large-scale study by Van Eck, Waltman, Larivière, and Sugimoto (blog 2018,

Research limitations: The sample does not cover all science areas uniformly; it is heavily focused on Engineering and Technology, and only some disciplines of Natural Sciences are present. However, this reflects the real scientific orientation and publication profile of our institution.

Practical implications: The current level of coverage is not sufficient for the WoS to be replaced by COCI for research evaluation.

Originality/value: The present study illustrates a COCI vs WoS comparison on the scale of a larger technical university in Central Europe.

Table and Figures | Reference | Related Articles | Metrics
Exploring the Potentialities of Automatic Extraction of University Webometric Information
Gianpiero Bianchi, Renato Bruni, Cinzia Daraio, Antonio Laureti Palma, Giulio Perani, Francesco Scalfati
Journal of Data and Information Science    2020, 5 (4): 43-55.   doi:10.2478/jdis-2020-0040
Abstract52)   HTML4)    PDF (661KB)(87)      

Purpose: The main objective of this work is to show the potentialities of recently developed approaches for automatic knowledge extraction directly from the universities’ websites. The information automatically extracted can be potentially updated with a frequency higher than once per year, and be safe from manipulations or misinterpretations. Moreover, this approach allows us flexibility in collecting indicators about the efficiency of universities’ websites and their effectiveness in disseminating key contents. These new indicators can complement traditional indicators of scientific research (e.g. number of articles and number of citations) and teaching (e.g. number of students and graduates) by introducing further dimensions to allow new insights for “profiling” the analyzed universities.

Design/methodology/approach: Webometrics relies on web mining methods and techniques to perform quantitative analyses of the web. This study implements an advanced application of the webometric approach, exploiting all the three categories of web mining: web content mining; web structure mining; web usage mining. The information to compute our indicators has been extracted from the universities’ websites by using web scraping and text mining techniques. The scraped information has been stored in a NoSQL DB according to a semi-structured form to allow for retrieving information efficiently by text mining techniques. This provides increased flexibility in the design of new indicators, opening the door to new types of analyses. Some data have also been collected by means of batch interrogations of search engines (Bing, or from a leading provider of Web analytics (SimilarWeb, The information extracted from the Web has been combined with the University structural information taken from the European Tertiary Education Register (, a database collecting information on Higher Education Institutions (HEIs) at European level. All the above was used to perform a clusterization of 79 Italian universities based on structural and digital indicators.

Findings: The main findings of this study concern the evaluation of the potential in digitalization of universities, in particular by presenting techniques for the automatic extraction of information from the web to build indicators of quality and impact of universities’ websites. These indicators can complement traditional indicators and can be used to identify groups of universities with common features using clustering techniques working with the above indicators.

Research limitations: The results reported in this study refers to Italian universities only, but the approach could be extended to other university systems abroad.

Practical implications: The approach proposed in this study and its illustration on Italian universities show the usefulness of recently introduced automatic data extraction and web scraping approaches and its practical relevance for characterizing and profiling the activities of universities on the basis of their websites. The approach could be applied to other university systems.

Originality/value: This work applies for the first time to university websites some recently introduced techniques for automatic knowledge extraction based on web scraping, optical character recognition and nontrivial text mining operations (Bruni & Bianchi, 2020).

Table and Figures | Reference | Related Articles | Metrics
The Association between Researchers’ Conceptions of Research and Their Strategic Research Agendas
João M. Santos, Hugo Horta
Journal of Data and Information Science    2020, 5 (4): 56-74.   doi:10.2478/jdis-2020-0032
Abstract21)   HTML1)    PDF (315KB)(44)      

Purpose: In studies of the research process, the association between how researchers conceptualize research and their strategic research agendas has been largely overlooked. This study aims to address this gap.

Design/methodology/approach: This study analyzes this relationship using a dataset of more than 8,500 researchers across all scientific fields and the globe. It studies the associations between the dimensions of two inventories: the Conceptions of Research Inventory (CoRI) and the Multi-Dimensional Research Agenda Inventory—Revised (MDRAI-R).

Findings: The findings show a relatively strong association between researchers’ conceptions of research and their research agendas. While all conceptions of research are positively related to scientific ambition, the findings are mixed regarding how the dimensions of the two inventories relate to one another, which is significant for those seeking to understand the knowledge production process better.

Research limitations: The study relies on self-reported data, which always carries a risk of response bias.

Practical implications: The findings provide a greater understanding of the inner workings of knowledge processes and indicate that the two inventories, whether used individually or in combination, may provide complementary analytical perspectives to research performance indicators. They may thus offer important insights for managers of research environments regarding how to assess the research culture, beliefs, and conceptualizations of individual researchers and research teams when designing strategies to promote specific institutional research focuses and strategies.

Originality/value: To the best of the authors’ knowledge, this is the first study to associate research agendas and conceptions of research. It is based on a large sample of researchers working worldwide and in all fields of knowledge, which ensures that the findings have a reasonable degree of generalizability to the global population of researchers.

Table and Figures | Reference | Related Articles | Metrics
Current Status and Enhancement of Collaborative Research in the World: A Case Study of Osaka University
Shino Iwami, Toshihiko Shimizu, Melvin John F. Empizo, Jacque Lynn F. Gabayno, Nobuhiko Sarukura, Shota Fujii, Yoshinari Sumimura
Journal of Data and Information Science    2020, 5 (4): 75-85.   doi:10.2478/jdis-2020-0035
Abstract48)   HTML1)    PDF (2009KB)(65)      

Purpose: The purpose of this research is to provide evidence for decision-makers to realize the potentials of collaborations between countries/regions via the scientometric analysis of co-authoring in academic publications.

Design/methodology/approach: The approach is that Osaka University, which has set a strategy to become a global campus, is positioned to have a leading role to enhance such collaborations. This research measures co-authoring relations between Osaka University and other countries/regions to identify networks for fostering strong research collaborations.

Findings: Five countries are identified as candidates for the future global campuses of Osaka University based on three factors, co-authoring relations, GDP growth, and population growth.

Research limitations: The main limitation of this study is not being able to use the relations by the former positions of authors in Osaka University, because the data retrieved is limited by the query of the organization name at the first step.

Practical implications: The significance of this work is to provide evidence for the university strategy to expand abroad based on the quantity and visualization of trends.

Originality/value: With wider practical implementations, the approach of this research is useful in making a strategic roadmap for scientific organizations that intend to collaborate internationally.

Table and Figures | Reference | Related Articles | Metrics
Global Collaboration in Artificial Intelligence: Bibliometrics and Network Analysis from 1985 to 2019
Haotian Hu, Dongbo Wang, Sanhong Deng
Journal of Data and Information Science    2020, 5 (4): 86-115.   doi:10.2478/jdis-2020-0027
Abstract65)   HTML7)    PDF (8327KB)(71)      

Purpose: This study aims to explore the trend and status of international collaboration in the field of artificial intelligence (AI) and to understand the hot topics, core groups, and major collaboration patterns in global AI research.

Design/methodology/approach: We selected 38,224 papers in the field of AI from 1985 to 2019 in the core collection database of Web of Science (WoS) and studied international collaboration from the perspectives of authors, institutions, and countries through bibliometric analysis and social network analysis.

Findings: The bibliometric results show that in the field of AI, the number of published papers is increasing every year, and 84.8% of them are cooperative papers. Collaboration with more than three authors, collaboration between two countries and collaboration within institutions are the three main levels of collaboration patterns. Through social network analysis, this study found that the US, the UK, France, and Spain led global collaboration research in the field of AI at the country level, while Vietnam, Saudi Arabia, and United Arab Emirates had a high degree of international participation. Collaboration at the institution level reflects obvious regional and economic characteristics. There are the Developing Countries Institution Collaboration Group led by Iran, China, and Vietnam, as well as the Developed Countries Institution Collaboration Group led by the US, Canada, the UK. Also, the Chinese Academy of Sciences (China) plays an important, pivotal role in connecting the these institutional collaboration groups.

Research limitations: First, participant contributions in international collaboration may have varied, but in our research they are viewed equally when building collaboration networks. Second, although the edge weight in the collaboration network is considered, it is only used to help reduce the network and does not reflect the strength of collaboration.

Practical implications: The findings fill the current shortage of research on international collaboration in AI. They will help inform scientists and policy makers about the future of AI research.

Originality/value: This work is the longest to date regarding international collaboration in the field of AI. This research explores the evolution, future trends, and major collaboration patterns of international collaboration in the field of AI over the past 35 years. It also reveals the leading countries, core groups, and characteristics of collaboration in the field of AI.

Table and Figures | Reference | Related Articles | Metrics
Priorities for Social and Humanities Projects Based on Text Analysis
Ülle Must
Journal of Data and Information Science    2020, 5 (4): 116-125.   doi:10.2478/jdis-2020-0036
Abstract19)   HTML2)    PDF (1166KB)(58)      

Purpose: Changes in the world show that the role, importance, and coherence of SSH (social sciences and the humanities) will increase significantly in the coming years. This paper aims to monitor and analyze the evolution (or overlapping) of the SSH thematic pattern through three funding instruments since 2007.

Design/methodology/approach: The goal of the paper is to check to what extent the EU Framework Program (FP) affects/does not affect research on national level, and to highlight hot topics from a given period with the help of text analysis. Funded project titles and abstracts derived from the EU FP, Slovenian, and Estonian RIS were used. The final analysis and comparisons between different datasets were made based on the 200 most frequent words. After removing punctuation marks, numeric values, articles, prepositions, conjunctions, and auxiliary verbs, 4,854 unique words in ETIS, 4,421 unique words in the Slovenian Research Information System (SICRIS), and 3,950 unique words in FP were identified.

Findings: Across all funding instruments, about a quarter of the top words constitute half of the word occurrences. The text analysis results show that in the majority of cases words do not overlap between FP and nationally funded projects. In some cases, it may be due to using different vocabulary. There is more overlapping between words in the case of Slovenia (SL) and Estonia (EE) and less in the case of Estonia and EU Framework Programmes (FP). At the same time, overlapping words indicate a wider reach (culture, education, social, history, human, innovation, etc.). In nationally funded projects (bottom-up), it was relatively difficult to observe the change in thematic trends over time. More specific results emerged from the comparison of the different programs throughout FP (top-down).

Research limitations: Only projects with English titles and abstracts were analyzed.

Practical implications: The specifics of SSH have to take into account—the one-to-one meaning of terms/words is not as important as, for example, in the exact sciences. Thus, even in co-word analysis, the final content may go unnoticed.

Originality/value: This was the first attempt to monitor the trends of SSH projects using text analysis. The text analysis of the SSH projects of the two new EU Member States used in the study showed that SSH’s thematic coverage is not much affected by the EU Framework Program. Whether this result is field-specific or country-specific should be shown in the following study, which targets SSH projects in the so-called old Member States.

Table and Figures | Reference | Related Articles | Metrics
Topic Evolution and Emerging Topic Analysis Based on Open Source Software
Xiang Shen, Li Wang
Journal of Data and Information Science    2020, 5 (4): 126-136.   doi:10.2478/jdis-2020-0033
Abstract47)   HTML4)    PDF (6941KB)(48)      

Purpose: We present an analytical, open source and flexible natural language processing and text mining method for topic evolution, emerging topic detection and research trend forecasting for all kinds of data-tagged text.

Design/methodology/approach: We make full use of the functions provided by the open source VOSviewer and Microsoft Office, including a thesaurus for data clean-up and a LOOKUP function for comparative analysis.

Findings: Through application and verification in the domain of perovskite solar cells research, this method proves to be effective.

Research limitations: A certain amount of manual data processing and a specific research domain background are required for better, more illustrative analysis results. Adequate time for analysis is also necessary.

Practical implications: We try to set up an easy, useful, and flexible interdisciplinary text analyzing procedure for researchers, especially those without solid computer programming skills or who cannot easily access complex software. This procedure can also serve as a wonderful example for teaching information literacy.

Originality/value: This text analysis approach has not been reported before.

Table and Figures | Reference | Related Articles | Metrics
Scientometric Analysis of Research Output from Brazil in Response to the Zika Crisis Using e-Lattes
Ricardo Barros Sampaio, Antônio de Abreu Batista-Jr, Bruno Santos Ferreira, Mauricio L. Barreto, Jesús P. Mena-Chalco
Journal of Data and Information Science    2020, 5 (4): 137-146.   doi:10.2478/jdis-2020-0038
Abstract43)   HTML4)    PDF (866KB)(48)      

Purpose: This paper aims to test the use of e-Lattes to map the Brazilian scientific output in a recent research health subject: Zika Virus.

Design/methodology/approach: From a set of Lattes CVs of Zika researchers registered on the Lattes Platform, we used the e-Lattes to map the Brazilian scientific response to the Zika crisis.

Findings: Brazilian science articulated quickly during the public health emergency of international concern (PHEIC) due to the creation of mechanisms to streamline funding of scientific research.

Research limitations: We did not assess any dimension of research quality, including the scientific impact and societal value.

Practical implications: e-Lattes can provide useful guidelines for different stakeholders in research groups from Lattes CVs of members.

Originality/value: The information included in Lattes CVs permits us to assess science from a broader perspective taking into account not only scientific research production but also the training of human resources and scientific collaboration.

Table and Figures | Reference | Related Articles | Metrics
Detection of Malignant and Benign Breast Cancer Using the ANOVA-BOOTSTRAP-SVM
Borislava Petrova Vrigazova
Journal of Data and Information Science    2020, 5 (2): 62-75.   doi:10.2478/jdis-2020-0012
Abstract122)   HTML17)    PDF (842KB)(132)      

Purpose: The aim of this research is to propose a modification of the ANOVA-SVM method that can increase accuracy when detecting benign and malignant breast cancer.

Methodology: We proposed a new method ANOVA-BOOTSTRAP-SVM. It involves applying the analysis of variance (ANOVA) to support vector machines (SVM) but we use the bootstrap instead of cross validation as a train/test splitting procedure. We have tuned the kernel and the C parameter and tested our algorithm on a set of breast cancer datasets.

Findings: By using the new method proposed, we succeeded in improving accuracy ranging from 4.5 percentage points to 8 percentage points depending on the dataset.

Research limitations: The algorithm is sensitive to the type of kernel and value of the optimization parameter C.

Practical implications: We believe that the ANOVA-BOOTSTRAP-SVM can be used not only to recognize the type of breast cancer but also for broader research in all types of cancer.

Originality/value: Our findings are important as the algorithm can detect various types of cancer with higher accuracy compared to standard versions of the Support Vector Machines.

Table and Figures | Reference | Related Articles | Metrics
FAIR + FIT: Guiding Principles and Functional Metrics for Linked Open Data (LOD) KOS Products
Marcia Lei Zeng, Julaine Clunis
Journal of Data and Information Science    2020, 5 (1): 93-118.   doi:10.2478/jdis-2020-0008
Accepted: 17 April 2020

Abstract137)   HTML24)    PDF (9341KB)(82)      

Purpose: To develop a set of metrics and identify criteria for assessing the functionality of LOD KOS products while providing common guiding principles that can be used by LOD KOS producers and users to maximize the functions and usages of LOD KOS products.

Design/methodology/approach: Data collection and analysis were conducted at three time periods in 2015-16, 2017 and 2019. The sample data used in the comprehensive data analysis comprises all datasets tagged as types of KOS in the Datahub and extracted through their respective SPARQL endpoints. A comparative study of the LOD KOS collected from terminology services Linked Open Vocabularies (LOV) and BioPortal was also performed.

Findings: The study proposes a set of Functional, Impactful and Transformable (FIT) metrics for LOD KOS as value vocabularies. The FAIR principles, with additional recommendations, are presented for LOD KOS as open data.

Research limitations: The metrics need to be further tested and aligned with the best practices and international standards of both open data and various types of KOS.

Practical implications: Assessment performed with FAIR and FIT metrics support the creation and delivery of user-friendly, discoverable and interoperable LOD KOS datasets which can be used for innovative applications, act as a knowledge base, become a foundation of semantic analysis and entity extractions and enhance research in science and the humanities.

Originality/value: Our research provides best practice guidelines for LOD KOS as value vocabularies.

Table and Figures | Reference | Related Articles | Metrics
Improving Archival Records and Service of Traditional Korean Performing Arts in a Semantic Web Environment
Ziyoung Park, Hosin Lee, Seungchon Kim, Sungjae Park
Journal of Data and Information Science    2020, 5 (1): 68-80.   doi:10.2478/jdis-2020-0006
Accepted: 17 April 2020

Abstract65)   HTML15)    PDF (2501KB)(136)      

Purpose: This research project aims to organize the archival information of traditional Korean performing arts in a semantic web environment. Key requirements, which the archival records manager should consider for publishing and distribution of gugak performing archival information in a semantic web environment, are presented in the perspective of linked data.

Design/methodology/approach: This study analyzes the metadata provided by the National Gugak Center’s Gugak Archive, the search and browse menus of Gugak Archive’s website and K-PAAN, the performing arts portal site.

Findings: The importance of consistency, continuity, and systematicity—crucial qualities in traditional record management practices—is undiminished in a semantic web environment. However, a semantic web environment also requires new tools such as web identifiers (URIs), data models (RDF), and link information (interlinking).

Research limitations: The scope of this study does not include practical implementation strategies for the archival records management system and website services. The suggestions also do not discuss issues related to copyright or policy coordination between related organizations.

Practical implications: The findings of this study can assist records managers in converting a traditional performing arts information archive into a semantic web environment-based online archival service and system. This can also be useful for collaboration with record managers who are unfamiliar with relational or triple database system.

Originality/value: This study analyzed the metadata of the Gugak Archive and its online services to present practical requirements for managing and disseminating gugak performing arts information in a semantic web environment. In the application of the semantic web services’ principles and methods to an Gugak Archive, this study can contribute to the improvement of information organization and services in the field of Korean traditional music.

Table and Figures | Reference | Related Articles | Metrics
The ARQUIGRAFIA project:A Web Collaborative Environment for Architecture and Urban Heritage Image
Vânia Mara Alves Lima, Cibele Araújo Camargo Marques dos Santos, Artur Simões Rozestraten
Journal of Data and Information Science    2020, 5 (1): 51-67.   doi:10.2478/jdis-2020-0005
Accepted: 22 June 2012

Abstract542)   HTML20)    PDF (10984KB)(550)      

Purpose: This paper presents the ARQUIGRAFIA project, an open, public and nonprofit, continuous growth web collaborative environment dedicated to Brazilian architectural photographic images.

Design/methodology/approach: The ARQUIGRAFIA project promotes the active and collaborative participation among its institutional users (GLAMs, NGOs, laboratories and research groups) and private users (students, professionals, professors, researchers), both can create an account and share their digitized iconographic collections in the same Web environment by uploading their files, indexing, georeferencing and assigning a Creative Commons license.

Findings: The development of users interactions by means of semantic differentials impressions recording on visible plastic-spatial aspects of the architectures in synthetic infographics, as well as by the retrieval of images through an advanced system search based on those impressions parameters. By gamification means, the system often invites users to review images’ in order to improve images’ data accuracy. The pilot project named Open Air Museum that allows users to add audio descriptions to images in situ. An interface for users’ digital curatorship will be soon available.

Research limitations: The ARQUIGRAFIA’s multidisciplinary team gathering professors-researchers, graduate and undergraduate students from the Architecture and Urbanism, Design, Information Science, Computer Science faculties of the University of São Paulo, demands continuous financial resources for grants, for contracting third party services, for the participation in scientific events in Brazil and abroad, and for equipment. Since 2016, significant budget cuts in the University of São Paulo own research funds and in Brazilian federal scientific agencies can compromise the continuity of this project.

Practical implications: The open source template called +GRAFIA that can freely help other areas of knowledge to build their own visual Web collaborative environments.

Originality/value: The collaborative nature of the ARQUIGRAFIA distinguishes it from institutional image databases on the internet, precisely because it involves a heterogeneous network of collaborators.

Table and Figures | Reference | Related Articles | Metrics
Automatic Classification of Swedish Metadata Using Dewey Decimal Classification: A Comparison of Approaches
Koraljka Golub, Johan Hagelbäck, Anders Ardö
Journal of Data and Information Science    2020, 5 (1): 18-38.   doi:10.2478/jdis-2020-0003
Accepted: 17 April 2020

Abstract128)   HTML19)    PDF (347KB)(267)      

Purpose: With more and more digital collections of various information resources becoming available, also increasing is the challenge of assigning subject index terms and classes from quality knowledge organization systems. While the ultimate purpose is to understand the value of automatically produced Dewey Decimal Classification (DDC) classes for Swedish digital collections, the paper aims to evaluate the performance of six machine learning algorithms as well as a string-matching algorithm based on characteristics of DDC.

Design/methodology/approach: State-of-the-art machine learning algorithms require at least 1,000 training examples per class. The complete data set at the time of research involved 143,838 records which had to be reduced to top three hierarchical levels of DDC in order to provide sufficient training data (totaling 802 classes in the training and testing sample, out of 14,413 classes at all levels).

Findings: Evaluation shows that Support Vector Machine with linear kernel outperforms other machine learning algorithms as well as the string-matching algorithm on average;the string-matching algorithm outperforms machine learning for specific classes when characteristics of DDC are most suitable for the task. Word embeddings combined with different types of neural networks (simple linear network, standard neural network, 1D convolutional neural network, and recurrent neural network) produced worse results than Support Vector Machine, but reach close results, with the benefit of a smaller representation size. Impact of features in machine learning shows that using keywords or combining titles and keywords gives better results than using only titles as input. Stemming only marginally improves the results. Removed stop-words reduced accuracy in most cases, while removing less frequent words increased it marginally. The greatest impact is produced by the number of training examples: 81.90% accuracy on the training set is achieved when at least 1,000 records per class are available in the training set, and 66.13% when too few records (often less than 100 per class) on which to train are available—and these hold only for top 3 hierarchical levels (803 instead of 14,413 classes).

Research limitations: Having to reduce the number of hierarchical levels to top three levels of DDC because of the lack of training data for all classes, skews the results so that they work in experimental conditions but barely for end users in operational retrieval systems.

Practical implications: In conclusion, for operative information retrieval systems applying purely automatic DDC does not work, either using machine learning (because of the lack of training data for the large number of DDC classes) or using string-matching algorithm (because DDC characteristics perform well for automatic classification only in a small number of classes). Over time, more training examples may become available, and DDC may be enriched with synonyms in order to enhance accuracy of automatic classification which may also benefit information retrieval performance based on DDC. In order for quality information services to reach the objective of highest possible precision and recall, automatic classification should never be implemented on its own; instead, machine-aided indexing that combines the efficiency of automatic suggestions with quality of human decisions at the final stage should be the way for the future.

Originality/value: The study explored machine learning on a large classification system of over 14,000 classes which is used in operational information retrieval systems. Due to lack of sufficient training data across the entire set of classes, an approach complementing machine learning, that of string matching, was applied. This combination should be explored further since it provides the potential for real-life applications with large target classification systems.

Table and Figures | Reference | Related Articles | Metrics
Knowledge Organization and Representation under the AI Lens
Jian Qin
Journal of Data and Information Science    2020, 5 (1): 3-17.   doi:10.2478/jdis-2020-0002
Accepted: 17 April 2020

Abstract230)   HTML50)    PDF (3034KB)(301)      

Purpose: This paper compares the paradigmatic differences between knowledge organization (KO) in library and information science and knowledge representation (KR) in AI to show the convergence in KO and KR methods and applications.

Methodology: The literature review and comparative analysis of KO and KR paradigms is the primary method used in this paper.

Findings: A key difference between KO and KR lays in the purpose of KO is to organize knowledge into certain structure for standardizing and/or normalizing the vocabulary of concepts and relations, while KR is problem-solving oriented. Differences between KO and KR are discussed based on the goal, methods, and functions.

Research limitations: This is only a preliminary research with a case study as proof of concept.

Practical implications: The paper articulates on the opportunities in applying KR and other AI methods and techniques to enhance the functions of KO.

Originality/value: Ontologies and linked data as the evidence of the convergence of KO and KR paradigms provide theoretical and methodological support to innovate KO in the AI era.

Table and Figures | Reference | Related Articles | Metrics
The Second Edition of the Integrative Levels Classification: Evolution of a KOS
Ziyoung Park, Claudio Gnoli, Daniele P. Morelli
Journal of Data and Information Science    2020, 5 (1): 39-50.   doi:10.2478/jdis-2020-0004
Accepted: 17 April 2020

Abstract169)   HTML36)    PDF (1067KB)(300)      

Purpose: This paper informs about the publication of the second edition of the Integrative Levels Classification (ILC2), a freely-faceted knowledge organization system (KOS), and reviews the main changes that have been introduced as compared to its first edition (ILC1).

Design/methodology/approach: The most relevant changes are illustrated, with special reference to those of interest to general classification theory, by means of examples of notation for individual classes and combinations of them.

Findings: Changes introduced in ILC2 include: the names and order of some main classes; the development of subclasses for various phenomena, especially quantities and algebraic structures; the order of facet categories and the new category of Disorder; notation for special facets; distinction of the semantical function of facets (attributes) from their syntactic function. The system can be freely accessed online through a PHP browser as well as in SKOS format.

Research limitations: Only a selection of changed classes is discussed for space reasons.

Practical implications: ILC1 has been previously applied to the BARTOC directory of KOSs. Update of BARTOC data to ILC2 and application of ILC2 to further information systems are envisaged. Possible methods for reclassifying BARTOC with ILC2 are discussed.

Originality: ILC is a newly developed classification system, based on phenomena instead of traditional disciplines and featuring various innovative devices. This paper is an original account of its most recent evolution.

Table and Figures | Reference | Related Articles | Metrics
“SEMANTIC” in a Digital Curation Model
Hyewon Lee, Soyoung Yoon, Ziyoung Park
Journal of Data and Information Science    2020, 5 (1): 81-92.   doi:10.2478/jdis-2020-0007
Accepted: 17 April 2020

Abstract91)   HTML15)    PDF (4457KB)(195)      

Purpose: This study attempts to propose an abstract model by gathering concepts that can focus on resource representation and description in a digital curation model and suggest a conceptual model that emphasizes semantic enrichment in a digital curation model.

Design/methodology/approach: This study conducts a literature review to analyze the preceding curation models, DCC CLM, DCC&U, UC3, and DCN.

Findings: The concept of semantic enrichment is expressed in a single word, SEMANTIC in this study. The Semantic Enrichment Model, SEMANTIC has elements, subject, extraction, multi-language, authority, network, thing, identity, and connect.

Research limitations: This study does not reflect the actual information environment because it focuses on the concepts of the representation of digital objects.

Practical implications: This study presents the main considerations for creating and reinforcing the description and representation of digital objects when building and developing digital curation models in specific institutions.

Originality/value: This study summarizes the elements that should be emphasized in the representation of digital objects in terms of information organization.

Table and Figures | Reference | Related Articles | Metrics
A Metric Approach to Hot Topics in Biomedicine via Keyword Co-occurrence
Jane H. Qin, Jean J. Wang, Fred Y. Ye
Journal of Data and Information Science    2019, 4 (4): 13-25.   doi:10.2478/jdis-2019-0018
Accepted: 19 December 2019

Abstract107)   HTML7)    PDF (2302KB)(204)      

Purpose: To reveal the research hotpots and relationship among three research hot topics in biomedicine, namely CRISPR, iPS (induced Pluripotent Stem) cell and Synthetic biology.

Design/methodology/approach: We set up their keyword co-occurrence networks with using three indicators and information visualization for metric analysis.

Findings: The results reveal the main research hotspots in the three topics are different, but the overlapping keywords in the three topics indicate that they are mutually integrated and interacted each other.

Research limitations: All analyses use keywords, without any other forms.

Practical implications: We try to find the information distribution and structure of these three hot topics for revealing their research status and interactions, and for promoting biomedical developments.

Originality/value: We chose the core keywords in three research hot topics in biomedicine by using h-index.

Reference | Related Articles | Metrics
CiteOpinion: Evidence-based Evaluation Tool for Academic Contributions of Research Papers Based on Citing Sentences
Xiaoqiu Le, Jingdan Chu, Siyi Deng, Qihang Jiao, Jingjing Pei, Liya Zhu, Junliang Yao
Journal of Data and Information Science    2019, 4 (4): 26-41.   doi:10.2478/jdis-2019-0019
Accepted: 19 December 2019

Abstract65)   HTML5)    PDF (3345KB)(149)      

Purpose: To uncover the evaluation information on the academic contribution of research papers cited by peers based on the content cited by citing papers, and to provide an evidence-based tool for evaluating the academic value of cited papers.

Design/methodology/approach: CiteOpinion uses a deep learning model to automatically extract citing sentences from representative citing papers; it starts with an analysis on the citing sentences, then it identifies major academic contribution points of the cited paper, positive/negative evaluations from citing authors and the changes in the subjects of subsequent citing authors by means of Recognizing Categories of Moves (problems,methods, conclusions, etc.), and sentiment analysis and topic clustering.

Findings: Citing sentences in a citing paper contain substantial evidences useful for academic evaluation. They can also be used to objectively and authentically reveal the nature and degree of contribution of the cited paper reflected by citation, beyond simple citation statistics.

Practical implications: The evidence-based evaluation tool CiteOpinion can provide an objective and in-depth academic value evaluation basis for the representative papers of scientific researchers, research teams, and institutions.

Originality/value: No other similar practical tool is found in papers retrieved.

Research limitations: There are difficulties in acquiring full text of citing papers. There is a need to refine the calculation based on the sentiment scores of citing sentences. Currently, the tool is only used for academic contribution evaluation, while its value in policy studies, technical application, and promotion of science is not yet tested.

Reference | Related Articles | Metrics
Masked Sentence Model Based on BERT for Move Recognition in Medical Scientific Abstracts
Gaihong Yu, Zhixiong Zhang, Huan Liu, Liangping Ding
Journal of Data and Information Science    2019, 4 (4): 42-55.   doi:10.2478/jdis-2019-0020
Accepted: 19 December 2019

Abstract117)   HTML13)    PDF (3097KB)(140)      

Purpose: Move recognition in scientific abstracts is an NLP task of classifying sentences of the abstracts into different types of language units. To improve the performance of move recognition in scientific abstracts, a novel model of move recognition is proposed that outperforms the BERT-based method.

Design/methodology/approach: Prevalent models based on BERT for sentence classification often classify sentences without considering the context of the sentences. In this paper, inspired by the BERT masked language model (MLM), we propose a novel model called the masked sentence model that integrates the content and contextual information of the sentences in move recognition. Experiments are conducted on the benchmark dataset PubMed 20K RCT in three steps. Then, we compare our model with HSLN-RNN, BERT-based and SciBERT using the same dataset.

Findings: Compared with the BERT-based and SciBERT models, the F1 score of our model outperforms them by 4.96% and 4.34%, respectively, which shows the feasibility and effectiveness of the novel model and the result of our model comes closest to the state-of-the-art results of HSLN-RNN at present.

Research limitations: The sequential features of move labels are not considered, which might be one of the reasons why HSLN-RNN has better performance. Our model is restricted to dealing with biomedical English literature because we use a dataset from PubMed, which is a typical biomedical database, to fine-tune our model.

Practical implications: The proposed model is better and simpler in identifying move structures in scientific abstracts and is worthy of text classification experiments for capturing contextual features of sentences.

Originality/value: The study proposes a masked sentence model based on BERT that considers the contextual features of the sentences in abstracts in a new way. The performance of this classification model is significantly improved by rebuilding the input layer without changing the structure of neural networks.

Reference | Related Articles | Metrics
Identification of Sarcasm in Textual Data:A Comparative Study
Pulkit Mehndiratta, Devpriya Soni
Journal of Data and Information Science    2019, 4 (4): 56-83.   doi:10.2478/jdis-2019-0021
Accepted: 19 December 2019

Abstract97)   HTML2)    PDF (5823KB)(193)      

Purpose: Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet. Textual data contributes a major share towards data generated on the world wide web. Understanding people’s sentiment is an important aspect of natural language processing, but this opinion can be biased and incorrect, if people use sarcasm while commenting, posting status updates or reviewing any product or a movie. Thus, it is of utmost importance to detect sarcasm correctly and make a correct prediction about the people’s intentions.

Design/methodology/approach: This study tries to evaluate various machine learning models along with standard and hybrid deep learning models across various standardized datasets. We have performed vectorization of text using word embedding techniques. This has been done to convert the textual data into vectors for analytical purposes. We have used three standardized datasets available in public domain and used three word embeddings i.e Word2Vec, GloVe and fastText to validate the hypojournal.

Findings: The results were analyzed and conclusions are drawn. The key finding is: the hybrid models that include Bidirectional LongTerm Short Memory (Bi-LSTM) and Convolutional Neural Network (CNN) outperform others conventional machine learning as well as deep learning models across all the datasets considered in this study, making our hypojournal valid.

Research limitations: Using the data from different sources and customizing the models according to each dataset, slightly decreases the usability of the technique. But, overall this methodology provides effective measures to identify the presence of sarcasm with a minimum average accuracy of 80% or above for one dataset and better than the current baseline results for the other datasets.

Practical implications: The results provide solid insights for the system developers to integrate this model into real-time analysis of any review or comment posted in the public domain. This study has various other practical implications for businesses that depend on user ratings and public opinions. This study also provides a launching platform for various researchers to work on the problem of sarcasm identification in textual data.

Originality/value: This is a first of its kind study, to provide us the difference between conventional and the hybrid methods of prediction of sarcasm in textual data. The study also provides possible indicators that hybrid models are better when applied to textual data for analysis of sarcasm.

Reference | Related Articles | Metrics
Are Contributions from Chinese Physicists Undercited?
Jinzhong Guo, Xiaoling Liu, Liying Yang, Jinshan Wu
Journal of Data and Information Science    2019, 4 (4): 84-95.   doi:10.2478/jdis-2019-0022
Accepted: 19 December 2019

Abstract97)   HTML7)    PDF (3974KB)(122)      

Purpose: In this work, we want to examine whether or not there are some scientific fields to which contributions from Chinese scholars have been under or over cited.

Design/methodology/approach: We do so by comparing the number of received citations and the IOF of publications in each scientific field from each country. The IOF is calculated from applying the modified closed system input-output analysis (MCSIOA) to the citation network. MCSIOA is a PageRank-like algorithm which means here that citations from the more influential subfields are weighted more towards the IOF.

Findings: About 40% of subfields in physics in China are undercited, meaning that their net influence ranks are higher (better) than the direct rank, while about 75% of subfields in the USA and German are undercited

Research limitations: Only APS data is analyzed in this work. The expected citation influence is assumed to be represented by the IOF, and this can be wrong.

Practical implications: MCSIOA provides a measure of net influences and according to that measure. Overall, Chinese physicists’ publications are more likely overcited rather than being undercited.

Originality/value: The issue of under or over cited has been analyzed in this work using MCSIOA.

Reference | Related Articles | Metrics
Measuring Societal Impact Is as Complex as ABC
Ed Noyons
Journal of Data and Information Science    2019, 4 (3): 6-21.   doi:10.2478/jdis-2019-0012
Accepted: 02 September 2019

Abstract189)   HTML2)    PDF (10717KB)(164)      

Purpose This paper describes an alternative way of assessing journals considering a broader perspective of its impact. The Area-based connectedness (ABC) to society of journals applied here contributes to the assessment of the dissemination task of journals but with more data it may also contribute to the assessment of other missions.

Design/methodology/Approach: The ABC approach assesses the performance of research actors, in this case journals, considering the characteristics of the research areas in which they are active. Each paper in a journal inherits the characteristics of its area. These areas are defined by a publication-based classification. The characteristics of areas relate to 5 dimensions of connectedness to society (news, policy, industrial R&D, technology and local interest) and are calculated by bibliometric indicators and social media metrics.

Findings: In the paper, I illustrate the approach by showing the results for a few journals. They illustrate the diverse profiles that journals may have. We are able to provide a profile for each journal in the Web of Science database. The profiles we present show an appropriate view on the journals’ societal connectedness.

Research limitations: The classification I apply to perform the analyses is a CWTS in house classification based on Web of Science data. As such the application depends on the (updates of) that system. The classification is available at

Practical implications: The dimensions of connectedness discussed in this paper relate to the dissemination task of journals but further development of this method may provide more options to monitor the tasks/mission of journals.

Originality/value The ABC approach is a unique way to assess performance or impact of research actors considering the characteristics of the areas in which output is published and as such less prone to manipulation or gaming.

Reference | Related Articles | Metrics
Practice and Challenge of International Peer Review: A Case Study of Research Evaluation of CAS Centers for Excellence
Fang Xu, Xiaoxuan Li
Journal of Data and Information Science    2019, 4 (3): 22-34.   doi:10.2478/jdis-2019-0013
Accepted: 02 September 2019

Abstract108)   HTML3)    PDF (336KB)(199)      

Purpose The main goal of this paper is to show that international peer review can work in China’s context with satisfactory outcomes. Moreover, this paper also provides a reference for the practice of science and technology management.

Design/methodology/Approach: This paper starts with a discussion of two critical questions about the significance and design of international peer review. A case study of international peer review of CAS Centers for Excellence is further analyzed.

Findings: International peer review may provide a solution to address the problem of quantitative oriented research evaluation in China. The case study of research evaluation of CAS Centers for Excellence shows that it is possible and feasible to conduct an international peer review in China’s context. When applying this approach to other scenarios, there are still many issues to consider including individualized design of international peer review combined with practical demands, and further improvement of theories and methods of international peer review.^Research limitation: 1) Only the case of international peer review of CAS Centers for Excellence is analyzed; 2) A relatively small number of respondents were surveyed in the questionnaire.

Practical implications: The work presented in this study can be used as a reference for future studies.

Originality/value Currently, there are no similarly detailed studies exploring the significance and methodology of international peer review in China.

Reference | Related Articles | Metrics
Disclosing and Evaluating Artistic Research
Florian Vanlee, Walter Ysebaert
Journal of Data and Information Science    2019, 4 (3): 35-54.   doi:10.2478/jdis-2019-0014
Accepted: 02 September 2019

Abstract113)   HTML0)    PDF (333KB)(219)      

Purpose This study expands on the results of a stakeholder-driven research project on quality indicators and output assessment of art and design research in Flanders—the Northern, Dutch-speaking region of Belgium. Herein, it emphasizes the value of arts & design output registration as a modality to articulate the disciplinary demarcations of art and design research.

Design/methodology/Approach: The particularity of art and design research in Flanders is first analyzed and compared to international examples. Hereafter, the results of the stakeholder-driven project on the creation of indicators for arts & design research output assessment are discussed.

Findings: The findings accentuate the importance of allowing an assessment culture to emerge from practitioners themselves, instead of imposing ill-suited methods borrowed from established scientific evaluation models (Biggs & Karlsson, 2011)—notwithstanding the practical difficulties it generates. They point to the potential of stakeholder-driven approaches for artistic research, which benefits from constructing a shared metadiscourse among its practitioners regarding the continuities and discontinuities between “artistic” and “traditional” research, and the communal goals and values that guide its knowledge production (Biggs & Karlsson, 2011; Hellstr?m, 2010; Ysebaert & Martens, 2018). ^Research limitation: The central limitation of the study is that it focuses exclusively on the “Architecture & Design” panel of the project, and does not account for intra-disciplinary complexities in output assessment.

Practical implications: The goal of the research project is to create a robust assessment system for arts & design research in Flanders, which may later guide similar international projects.

Originality/value This study is currently the only one to consider the productive potential of (collaborative) PRFSs for artistic research.

Reference | Related Articles | Metrics
Methods and Practices for Institutional Benchmarking based on Research Impact and Competitiveness: A Case Study of ShanghaiTech University
Jiang Chang, Jianhua Liu
Journal of Data and Information Science    2019, 4 (3): 55-72.   doi:10.2478/jdis-2019-0015
Accepted: 02 September 2019

Abstract185)   HTML0)    PDF (1838KB)(222)      

Purpose To develop and test a mission-oriented and multi-dimensional benchmarking method for a small scale university aiming for internationally first-class basic research.

Design/methodology/Approach: An individualized evidence-based assessment scheme was employed to benchmark ShanghaiTech University against selected top research institutions, focusing on research impact and competitiveness at the institutional and disciplinary levels. Topic maps opposing ShanghaiTech and corresponding top institutions were produced for the main research disciplines of ShanghaiTech. This provides opportunities for further exploration of strengths and weakness.

Findings: This study establishes a preliminary framework for assessing the mission of the university. It further provides assessment principles, assessment questions, and indicators. Analytical methods and data sources were tested and proved to be applicable and efficient.

Research limitations: To better fit the selective research focuses of this university, its schema of research disciplines needs to be re-organized and benchmarking targets should include disciplinary top institutions and not necessarily those universities leading overall rankings. Current reliance on research articles and certain databases may neglect important research output types.

Practical implications: This study provides a working framework and practical methods for mission-oriented, individual, and multi-dimensional benchmarking that ShanghaiTech decided to use for periodical assessments. It also offers a working reference for other institutions to adapt. Further needs are identified so that ShanghaiTech can tackle them for future benchmarking.

Originality/value This is an effort to develop a mission-oriented, individually designed, systematically structured, and multi-dimensional assessment methodology which differs from often used composite indices.

Reference | Related Articles | Metrics
An Automatic Method to Identify Citations to Journals in News Stories: A Case Study of UK Newspapers Citing Web of Science Journals
Kayvan Kousha, Mike Thelwall
Journal of Data and Information Science    2019, 4 (3): 73-95.   doi:10.2478/jdis-2019-0016
Accepted: 02 September 2019

Abstract78)   HTML2)    PDF (749KB)(193)      

Purpose Communicating scientific results to the public is essential to inspire future researchers and ensure that discoveries are exploited. News stories about research are a key communication pathway for this and have been manually monitored to assess the extent of press coverage of scholarship.

Design/methodology/Approach: To make larger scale studies practical, this paper introduces an automatic method to extract citations from newspaper stories to large sets of academic journals. Curated ProQuest queries were used to search for citations to 9,639 Science and 3,412 Social Science Web of Science (WoS) journals from eight UK daily newspapers during 2006-2015. False matches were automatically filtered out by a new program, with 94% of the remaining stories meaningfully citing research.

Findings: Most Science (95%) and Social Science (94%) journals were never cited by these newspapers. Half of the cited Science journals covered medical or health-related topics, whereas 43% of the Social Sciences journals were related to psychiatry or psychology. From the citing news stories, 60% described research extensively and 53% used multiple sources, but few commented on research quality.

Research limitations: The method has only been tested in English and from the ProQuest Newspapers database.

Practical implications: Others can use the new method to systematically harvest press coverage of research.

Originality/value An automatic method was introduced and tested to extract citations from newspaper stories to large sets of academic journals.

Reference | Related Articles | Metrics
A Multi-match Approach to the Author Uncertainty Problem
Stephen F. Carley, Alan L. Porter, Jan L. Youtie
Journal of Data and Information Science    2019, 4 (2): 1-18.   doi:10.2478/jdis-2019-0006
Accepted: 12 October 2011

Abstract581)   HTML26)    PDF (776KB)(710)      

Purpose: The ability to identify the scholarship of individual authors is essential for performance evaluation. A number of factors hinder this endeavor. Common and similarly spelled surnames make it difficult to isolate the scholarship of individual authors indexed on large databases. Variations in name spelling of individual scholars further complicates matters. Common family names in scientific powerhouses like China make it problematic to distinguish between authors possessing ubiquitous and/or anglicized surnames (as well as the same or similar first names). The assignment of unique author identifiers provides a major step toward resolving these difficulties. We maintain, however, that in and of themselves, author identifiers are not sufficient to fully address the author uncertainty problem. In this study we build on the author identifier approach by considering commonalities in fielded data between authors containing the same surname and first initial of their first name. We illustrate our approach using three case studies.

Design/methodology/approach: The approach we advance in this study is based on commonalities among fielded data in search results. We cast a broad initial net—i.e., a Web of Science (WOS) search for a given author’s last name, followed by a comma, followed by the first initial of his or her first name (e.g., a search for ‘John Doe’ would assume the form: ‘Doe, J’). Results for this search typically contain all of the scholarship legitimately belonging to this author in the given database (i.e., all of his or her true positives), along with a large amount of noise, or scholarship not belonging to this author (i.e., a large number of false positives). From this corpus we proceed to iteratively weed out false positives and retain true positives. Author identifiers provide a good starting point—e.g., if ‘Doe, J’ and ‘Doe, John’ share the same author identifier, this would be sufficient for us to conclude these are one and the same individual. We find email addresses similarly adequate—e.g., if two author names which share the same surname and same first initial have an email address in common, we conclude these authors are the same person. Author identifier and email address data is not always available, however. When this occurs, other fields are used to address the author uncertainty problem.

Commonalities among author data other than unique identifiers and email addresses is less conclusive for name consolidation purposes. For example, if ‘Doe, John’ and ‘Doe, J’ have an affiliation in common, do we conclude that these names belong the same person? They may or may not; affiliations have employed two or more faculty members sharing the same last and first initial. Similarly, it’s conceivable that two individuals with the same last name and first initial publish in the same journal, publish with the same co-authors, and/or cite the same references. Should we then ignore commonalities among these fields and conclude they’re too imprecise for name consolidation purposes? It is our position that such commonalities are indeed valuable for addressing the author uncertainty problem, but more so when used in combination.

Our approach makes use of automation as well as manual inspection, relying initially on author identifiers, then commonalities among fielded data other than author identifiers, and finally manual verification. To achieve name consolidation independent of author identifier matches, we have developed a procedure that is used with bibliometric software called VantagePoint (see While the application of our technique does not exclusively depend on VantagePoint, it is the software we find most efficient in this study. The script we developed to implement this procedure is designed to implement our name disambiguation procedure in a way that significantly reduces manual effort on the user’s part. Those who seek to replicate our procedure independent of VantagePoint can do so by manually following the method we outline, but we note that the manual application of our procedure takes a significant amount of time and effort, especially when working with larger datasets.

Our script begins by prompting the user for a surname and a first initial (for any author of interest). It then prompts the user to select a WOS field on which to consolidate author names. After this the user is prompted to point to the name of the authors field, and finally asked to identify a specific author name (referred to by the script as the primary author) within this field whom the user knows to be a true positive (a suggested approach is to point to an author name associated with one of the records that has the author’s ORCID iD or email address attached to it).

The script proceeds to identify and combine all author names sharing the primary author’s surname and first initial of his or her first name who share commonalities in the WOS field on which the user was prompted to consolidate author names. This typically results in significant reduction in the initial dataset size. After the procedure completes the user is usually left with a much smaller (and more manageable) dataset to manually inspect (and/or apply additional name disambiguation techniques to).

Research limitations: Match field coverage can be an issue. When field coverage is paltry dataset reduction is not as significant, which results in more manual inspection on the user’s part. Our procedure doesn’t lend itself to scholars who have had a legal family name change (after marriage, for example). Moreover, the technique we advance is (sometimes, but not always) likely to have a difficult time dealing with scholars who have changed careers or fields dramatically, as well as scholars whose work is highly interdisciplinary.

Practical implications: The procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research, especially when the name under consideration is a more common family name. It is more effective when match field coverage is high and a number of match fields exist.

Originality/value: Once again, the procedure we advance has the ability to save a significant amount of time and effort for individuals engaged in name disambiguation research. It combines preexisting with more recent approaches, harnessing the benefits of both.

Findings: Our study applies the name disambiguation procedure we advance to three case studies. Ideal match fields are not the same for each of our case studies. We find that match field effectiveness is in large part a function of field coverage. Comparing original dataset size, the timeframe analyzed for each case study is not the same, nor are the subject areas in which they publish. Our procedure is more effective when applied to our third case study, both in terms of list reduction and 100% retention of true positives. We attribute this to excellent match field coverage, and especially in more specific match fields, as well as having a more modest/manageable number of publications.

While machine learning is considered authoritative by many, we do not see it as practical or replicable. The procedure advanced herein is both practical, replicable and relatively user friendly. It might be categorized into a space between ORCID and machine learning. Machine learning approaches typically look for commonalities among citation data, which is not always available, structured or easy to work with. The procedure we advance is intended to be applied across numerous fields in a dataset of interest (e.g. emails, coauthors, affiliations, etc.), resulting in multiple rounds of reduction. Results indicate that effective match fields include author identifiers, emails, source titles, co-authors and ISSNs. While the script we present is not likely to result in a dataset consisting solely of true positives (at least for more common surnames), it does significantly reduce manual effort on the user’s part. Dataset reduction (after our procedure is applied) is in large part a function of (a) field availability and (b) field coverage.

Reference | Related Articles | Metrics
Normalizing Book Citations in Google Scholar: A Hybrid Cited-side Citing-side Method
John Mingers†, Eren Kaymaz
Journal of Data and Information Science    2019, 4 (2): 19-35.   doi:10.2478/jdis-2019-0007
Accepted: 30 May 2019

Abstract72)   HTML3)    PDF (772KB)(245)      

Purpose: To design and test a method for normalizing book citations in Google Scholar.

Design/methodology/approach: A hybrid citing-side, cited-side normalization method was developed and this was tested on a sample of 285 research monographs. The results were analyzed and conclusions drawn.

Findings: The method was technically feasible but required extensive manual intervention because of the poor quality of the Google Scholar data.

Research limitations: The sample of books was limited and also all were from one discipline —business and management. Also, the method has only been tested on Google Scholar, it would be useful to test it on Web of Science or Scopus.

Practical limitations: Google Scholar is a poor source of data although it does cover a much wider range citation sources that other databases.

Originality/value: This is the first method that has been developed specifically for normalizing books which have so far not been able to be normalized.

Reference | Related Articles | Metrics
Evolution of the Socio-cognitive Structure of Knowledge Management (1986-2015): An Author Co-citation Analysis
Carlos Luis González-Valiente, Magda León Santos, Ricardo Arencibia-Jorge
Journal of Data and Information Science    2019, 4 (2): 36-55.   doi:10.2478/jdis-2019-0008
Accepted: 30 May 2019

Abstract86)   HTML6)    PDF (6931KB)(177)      

Purpose: The evolution of the socio-cognitive structure of the field of knowledge management (KM) during the period 1986-2015 is described.

Design/methodology/approach: Records retrieved from Web of Science were submitted to author co-citation analysis (ACA) following a longitudinal perspective as of the following time slices: 1986-1996, 1997-2006, and 2007-2015. The top 10% of most cited first authors by sub-periods were mapped in bibliometric networks in order to interpret the communities formed and their relationships.

Findings: KM is a homogeneous field as indicated by networks results. Nine classical authors are identified since they are highly co-cited in each sub-period, highlighting Ikujiro Nonaka as the most influential authors in the field. The most significant communities in KM are devoted to strategic management, KM foundations, organisational learning and behaviour, and organisational theories. Major trends in the evolution of the intellectual structure of KM evidence a technological influence in 1986-1996, a strategic influence in 1997-2006, and finally a sociological influence in 2007-2015.

Research limitations: Describing a field from a single database can offer biases in terms of output coverage. Likewise, the conference proceedings and books were not used and the analysis was only based on first authors. However, the results obtained can be very useful to understand the evolution of KM research.

Practical implications: These results might be useful for managers and academicians to understand the evolution of KM field and to (re)define research activities and organisational projects.

Originality/value: The novelty of this paper lies in considering ACA as a bibliometric technique to study KM research. In addition, our investigation has a wider time coverage than earlier articles.

Reference | Related Articles | Metrics
Does a Country/Region’s Economic Status Affect Its Universities’ Presence in International Rankings?
Esteban Fernández Tuesta, Carlos Garcia-Zorita, Rosario Romera Ayllon, Elías Sanz-Casado
Journal of Data and Information Science    2019, 4 (2): 56-78.   doi:10.2478/jdis-2019-0009
Accepted: 30 May 2019

Abstract154)   HTML1)    PDF (1864KB)(264)      

Purpose: Study how economic parameters affect positions in the Academic Ranking of World Universities’ top 500 published by the Shanghai Jiao Tong University Graduate School of Education in countries/regions with listed higher education institutions.

Design/methodology/approach: The methodology used capitalises on the multi-variate characteristics of the data analysed. The multi-colinearity problem posed is solved by running principal components prior to regression analysis, using both classical (OLS) and robust (Huber and Tukey) methods.

Findings: Our results revealed that countries/regions with long ranking traditions are highly competitive. Findings also showed that some countries/regions such as Germany, United Kingdom, Canada, and Italy, had a larger number of universities in the top positions than predicted by the regression model. In contrast, for Japan, a country where social and economic performance is high, the number of ARWU universities projected by the model was much larger than the actual figure. In much the same vein, countries/regions that invest heavily in education, such as Japan and Denmark, had lower than expected results.

Research limitations: Using data from only one ranking is a limitation of this study, but the methodology used could be useful to other global rankings.

Practical implications: The results provide good insights for policy makers. They indicate the existence of a relationship between research output and the number of universities per million inhabitants. Countries/regions, which have historically prioritised higher education, exhibited highest values for indicators that compose the rankings methodology; furthermore, minimum increase in welfare indicators could exhibited significant rises in the presence of their universities on the rankings.

Originality/value: This study is well defined and the result answers important questions about characteristics of countries/regions and their higher education system.

Reference | Related Articles | Metrics
Node2vec Representation for Clustering Journals and as A Possible Measure of Diversity
Zhesi Shen, Fuyou Chen, Liying Yang, Jinshan Wu
Journal of Data and Information Science    2019, 4 (2): 79-92.   doi:10.2478/jdis-2019-0010
Accepted: 30 May 2019

Abstract169)   HTML6)    PDF (5606KB)(313)      

Purpose: To investigate the effectiveness of using node2vec on journal citation networks to represent journals as vectors for tasks such as clustering, science mapping, and journal diversity measure.

Design/methodology/approach: Node2vec is used in a journal citation network to generate journal vector representations.

Findings: 1. Journals are clustered based on the node2vec trained vectors to form a science map. 2. The norm of the vector can be seen as an indicator of the diversity of journals. 3. Using node2vec trained journal vectors to determine the Rao-Stirling diversity measure leads to a better measure of diversity than that of direct citation vectors.

Research limitations: All analyses use citation data and only focus on the journal level.

Practical implications: Node2vec trained journal vectors embed rich information about journals, can be used to form a science map and may generate better values of journal diversity measures.

Originality/value: The effectiveness of node2vec in scientometric analysis is tested. Possible indicators for journal diversity measure are presented.

Reference | Related Articles | Metrics
A Criteria-based Assessment of the Coverage of Scopus and Web of Science
Dag, W. Aksnes, Gunnar Sivertsen
Journal of Data and Information Science    2019, 4 (1): 1-21.   doi:10.2478/jdis-2019-0001
Accepted: 05 August 2011

Abstract684)   HTML72)    PDF (1071KB)(942)      

Purpose: The purpose of this study is to assess the coverage of the scientific literature in Scopus and Web of Science from the perspective of research evaluation.

Design/methodology/approach: The academic communities of Norway have agreed on certain criteria for what should be included as original research publications in research evaluation and funding contexts. These criteria have been applied since 2004 in a comprehensive bibliographic database called the Norwegian Science Index (NSI). The relative coverages of Scopus and Web of Science are compared with regard to publication type, field of research and language.

Findings: Our results show that Scopus covers 72 percent of the total Norwegian scientific and scholarly publication output in 2015 and 2016, while the corresponding figure for Web of Science Core Collection is 69 percent. The coverages are most comprehensive in medicine and health (89 and 87 percent) and in the natural sciences and technology (85 and 84 percent). The social sciences (48 percent in Scopus and 40 percent in Web of Science Core Collection) and particularly the humanities (27 and 23 percent) are much less covered in the two international data sources.

Research limitation: Comparing with data from only one country is a limitation of the study, but the criteria used to define a country’s scientific output as well as the identification of patterns of field-dependent partial representations in Scopus and Web of Science should be recognizable and useful also for other countries.

Originality/value: The novelty of this study is the criteria-based approach to studying coverage problems in the two data sources.

Reference | Related Articles | Metrics
Equalities between h-type Indices and Definitions of Rational h-type Indicators
Leo Egghe, Yves Fassin, Ronald Rousseau
Journal of Data and Information Science    2019, 4 (1): 22-31.   doi:10.2478/jdis-2019-0002
Accepted: 31 January 2019

Abstract103)   HTML1)    PDF (316KB)(273)      

Purpose: To show for which publication-citation arrays h-type indices are equal and to reconsider rational h-type indices. Results for these research questions fill some gaps in existing basic knowledge about h-type indices.

Design/methodology/approach: The results and introduction of new indicators are based on well-known definitions.

Findings: The research purpose has been reached: answers to the first questions are obtained and new indicators are defined.

Research limitations: h-type indices do not meet the Bouyssou-Marchant independence requirement.

Practical implications: On the one hand, more insight has been obtained for well-known indices such as the h- and the g-index and on the other hand, simple extensions of existing indicators have been added to the bibliometric toolbox. Relative rational h-type indices are more useful for individuals than the existing absolute ones.

Originality/value: Answers to basic questions such as “when are the values of two h-type indices equal” are provided. A new rational h-index is introduced.

Reference | Related Articles | Metrics
Measuring Scientific Productivity in China Using Malmquist Productivity Index
Yaoyao Song, Torben Schubert, Huihui Liu, Guoliang Yang
Journal of Data and Information Science    2019, 4 (1): 32-59.   doi:10.2478/jdis-2019-0003
Accepted: 31 January 2019

Abstract168)   HTML2)    PDF (896KB)(374)      

Purpose: This paper aims to investigate the scientific productivity of China’s science system.

Design/methodology/approach: This paper employs the Malmquist productivity index (MPI) based on Data Envelopment Analysis (DEA).

Findings: The results reveal that the overall efficiency of Chinese universities increased significantly from 2009 to 2016, which is mainly driven by technological progress. From the perspective of the functions of higher education, research and transfer activities perform better than the teaching activities.Research limitations: As an implication, the indicator selection mechanism, investigation period and the MPI model can be further extended in the future research.

Practical implications: The results indicate that Chinese education administrative departments should take actions to guide and promote the teaching activities and formulate reasonable resource allocation regulations to reach the balanced development in Chinese universities.

Originality/value: This paper selects 58 Chinese universities and conducts a quantified measurement during the period 2009-2016. Three main functional activities of universities (i.e. teaching, researching, and application) are innovatively categorized into different schemes, and we calculate their performance, respectively.

Reference | Related Articles | Metrics
Identification and Prediction of Interdisciplinary Research Topics: A Study Based on the Concept Lattice Theory
Haiyun Xu, Chao Wang, Kun Dong, Zenghui Yue
Journal of Data and Information Science    2019, 4 (1): 60-88.   doi:10.2478/jdis-2019-0004
Accepted: 31 January 2019

Abstract193)   HTML1)    PDF (9813KB)(241)      

Purpose: Formal concept analysis (FCA) and concept lattice theory (CLT) are introduced for constructing a network of IDR topics and for evaluating their effectiveness for knowledge structure exploration.

Design/methodology/approach: We introduced the theory and applications of FCA and CLT, and then proposed a method for interdisciplinary knowledge discovery based on CLT. As an example of empirical analysis, interdisciplinary research (IDR) topics in Information & Library Science (LIS) and Medical Informatics, and in LIS and Geography-Physical, were utilized as empirical fields. Subsequently, we carried out a comparative analysis with two other IDR topic recognition methods.

Findings: The CLT approach is suitable for IDR topic identification and predictions.

Research limitations: IDR topic recognition based on the CLT is not sensitive to the interdisciplinarity of topic terms, since the data can only reflect whether there is a relationship between the discipline and the topic terms. Moreover, the CLT cannot clearly represent a large amounts of concepts.Practical implications: A deeper understanding of the IDR topics was obtained as the structural and hierarchical relationships between them were identified, which can help to get more precise identification and prediction to IDR topics.

Originality/value: IDR topics identification based on CLT have performed well and this theory has several advantages for identifying and predicting IDR topics. First, in a concept lattice, there is a partial order relation between interconnected nodes, and consequently, a complete concept lattice can present hierarchical properties. Second, clustering analysis of IDR topics based on concept lattices can yield clusters that highlight the essential knowledge features and help display the semantic relationship between different IDR topics. Furthermore, the Hasse diagram automatically displays all the IDR topics associated with the different disciplines, thus forming clusters of specific concepts and visually retaining and presenting the associations of IDR topics through multiple inheritance relationships between the concepts.

Reference | Related Articles | Metrics
Sentiment Analysis of Japanese Tourism Online Reviews
Chuanming Yu, Xingyu Zhu, Bolin Feng, Lin Cai, Lu An
Journal of Data and Information Science    2019, 4 (1): 89-113.   doi:10.2478/jdis-2019-0005
Accepted: 31 January 2019

Abstract205)   HTML6)    PDF (1547KB)(409)      

Purpose: Online reviews on tourism attractions provide important references for potential tourists to choose tourism spots. The main goal of this study is conducting sentiment analysis to facilitate users comprehending the large scale of the reviews, based on the comments about Chinese attractions from Japanese tourism website 4Travel.

Design/methodology/approach: Different statistics- and rule-based methods are used to analyze the sentiment of the reviews. Three groups of novel statistics-based methods combining feature selection functions and the traditional term frequency-inverse document frequency (TF-IDF) method are proposed. We also make seven groups of different rules-based methods. The macro-average and micro-average values for the best classification results of the methods are calculated respectively and the performance of the methods are shown.

Findings: We compare the statistics-based and rule-based methods separately and compare the overall performance of the two method. According to the results, it is concluded that the combination of feature selection functions and weightings can strongly improve the overall performance. The emotional vocabulary in the field of tourism (EVT), kaomojis, negative and transitional words can notably improve the performance in all of three categories. The rule-based methods outperform the statistics-based ones with a narrow advantage.

Research limitation: Two limitations can be addressed: 1) the empirical studies to verify the validity of the proposed methods are only conducted on Japanese languages; and 2) the deep learning technology is not been incorporated in the methods.

Practical implications: The results help to elucidate the intrinsic characteristics of the Japanese language and the influence on sentiment analysis. These findings also provide practical usage guidelines within the field of sentiment analysis of Japanese online tourism reviews.Originality/value: Our research is of practicability. Currently, there are no studies that focus on the sentiment analysis of Japanese reviews about Chinese attractions.

Reference | Related Articles | Metrics
The Norwegian Model in Norway
Gunnar Sivertsen
Journal of Data and Information Science    2018, 3 (4): 3-19.   doi:10.2478/jdis-2018-0017
Accepted: 08 January 2019

Abstract138)   HTML5)    PDF (447KB)(363)      

The “Norwegian Model” attempts to comprehensively cover all the peer-reviewed scholarly literatures in all areas of research in one single weighted indicator. Thereby, scientific production is made comparable across departments and faculties within and between research institutions, and the indicator may serve institutional evaluation and funding. This article describes the motivation for creating the model in Norway, how it was designed, organized and implemented, as well as the effects and experiences with the model. The article ends with an overview of a new type of bibliometric studies that are based on the type of comprehensive national publication data that the Norwegian Model provides.

Reference | Related Articles | Metrics
Performance-based Research Funding in Denmark: The Adoption and Translation of the Norwegian Model(1)
Kaare Aagaard
Journal of Data and Information Science    2018, 3 (4): 20-30.   doi:10.2478/jdis-2018-0018
Accepted: 08 January 2019

Abstract114)   HTML0)    PDF (268KB)(316)      

Purpose: The main goal of this study is to outline and analyze the Danish adoption and translation of the Norwegian Publication Indicator.

Design/methodology/approach: The study takes the form of a policy analysis mainly drawing on document analysis of policy papers, previously published studies and grey literature.Findings: The study highlights a number of crucial factors that relate both to the Danish process and to the final Danish result underscoring that the Danish BFI model is indeed a quite different system than its Norwegian counterpart. One consequence of these process- and design differences is the fact that the broader legitimacy of the Danish BFI today appears to be quite poor.

Reasons for this include: unclear and shifting objectives throughout the process; limited willingness to take ownership of the model among stakeholders; lack of communication throughout the implementation process and an apparent underestimation of the challenges associated with the use of bibliometric indicators.

Research limitation: The conclusions of the study are based on the authors’ interpretation of a long drawn and complex process with many different stakeholders involved. The format of this article does not allow for a detailed documentation of all elements, but further details can be provided upon request.

Practical implications: The analysis may feed into current policy discussions on the future of the Danish BFI.Originality/value: Some elements of the present analysis have previously been published in Danish outlets, but this article represents the first publication on this issue targeting a broader international audience.

Reference | Related Articles | Metrics
  First page | Prev page | Next page | Last page Page 1 of 3, 97 records