Please wait a minute...
Journal of Data and Information Science  2019, Vol. 4 Issue (2): 79-92    DOI: 10.2478/jdis-2019-0010
Research Paper     
Node2vec Representation for Clustering Journals and as A Possible Measure of Diversity
Zhesi Shen1,Fuyou Chen1,Liying Yang1,Jinshan Wu2†()
1National Science Library, Chinese Academy of Sciences, Beijing 100190, P.R.China
2School of Systems Science, Beijing Normal University, Beijing, 100875, P.R.China
Download: PDF (5606 KB)      HTML  
Export: BibTeX | EndNote (RIS)      

Abstract  

Purpose: To investigate the effectiveness of using node2vec on journal citation networks to represent journals as vectors for tasks such as clustering, science mapping, and journal diversity measure.

Design/methodology/approach: Node2vec is used in a journal citation network to generate journal vector representations.

Findings: 1. Journals are clustered based on the node2vec trained vectors to form a science map. 2. The norm of the vector can be seen as an indicator of the diversity of journals. 3. Using node2vec trained journal vectors to determine the Rao-Stirling diversity measure leads to a better measure of diversity than that of direct citation vectors.

Research limitations: All analyses use citation data and only focus on the journal level.

Practical implications: Node2vec trained journal vectors embed rich information about journals, can be used to form a science map and may generate better values of journal diversity measures.

Originality/value: The effectiveness of node2vec in scientometric analysis is tested. Possible indicators for journal diversity measure are presented.



Key wordsScience mapping      Diversity      Graph embedding      Vector norm     
Received: 03 April 2019      Published: 30 May 2019
Cite this article:

Zhesi Shen, Fuyou Chen, Liying Yang, Jinshan Wu. Node2vec Representation for Clustering Journals and as A Possible Measure of Diversity. Journal of Data and Information Science, 2019, 4(2): 79-92.

URL:

http://manu47.magtech.com.cn/Jwk3_jdis/10.2478/jdis-2019-0010     OR     http://manu47.magtech.com.cn/Jwk3_jdis/Y2019/V4/I2/79

Figure 1. Map of scientific journals. Colors of dots mean the corresponding ESI categories of journals. Dots in red with black border are journals indexed as multidisciplinary, of which we list only a few on the map.
Figure 2. (a) Our vector-based clustering of journals compared with several existing journal classification systems JCR, VOS, ESI and LCAS. (b) We also compare resulted clusters using various dimensions of the node2vec vectors and we find there is not much differences among d = 32, d = 64 and d = 128: the similarity of 64-32 and 128-32 are much higher than that of 8-32 while 16-32 is somewhere in between. Also when Vec clusters with d = 8, 16, 32, 64, 128 are compared against VOS, we find as long as d > 16, increasing d does not make a big difference. Considering both performance and computational cost, we only report the results of d = 32.
Example Test 1 Test 2 Test 3
King PLoS Comput. Biol.
Man Nat. Cell Biol.
Woman Phys. Rev. Lett. Genome Biol. J. Neurosci.
Queen J. Stat. Mech. Theory Exp Bioinformatics NeuroImage
Phys. Rev. E BMC Bioinformatics Biol. Cybern.
Fluctuation Noise Lett. J. Comput. Biol. Front. Comput. Neurosci.
EPL BioData Min. Cereb. Cortex
Eur. Phys. J. B J. Bioinform. Comput. Biol. J. Comput. Neurosci.
Table 1 The “King - Man + Woman = Queen” test on the node2vec trained vectors of journals. Top-5 “Queen”-like journals are presented.
Figure 3. Scatter plot of vector norms versus node centrality. Node centrality is measured as the node occurrence frequency in the random walk series generated for Node2Vec. Orange dots represent journals indexed in Multidisciplinary Science.
Figure 4. Diversity of journals calculated using similarity measured by (a) vector vn learned from node2vec and (b) vector vc. Journals indexed as Multidisciplinary are colored according to their JIFs with blue implying low JIF and red implying high JIF as shown in the right legend. The citing diversity of journal i is measured based on its referenced journals, and cited diversity is measured based on the journals citing it.
Mikolov et al., 2013) and (Grover & Leskovec, 2016).">
Figure 5. A graphic summary of our work: concepts and connections in red are the ones that have been implemented in the current work while the ones in green can be topics of future investigation. The rest of concepts and connections have been proposed and implemented in earlier studies, see for example, (Mikolov et al., 2013) and (Grover & Leskovec, 2016).
[1]   Boyack K., Gl?nzel W., Gl?ser J., Havemann F., Scharnhorst A., Thijs B., van Eck N. J., Velden T., & Waltmann L. (2017). Topic identification challenge. Scientometrics, 111, 1223-1224.
[2]   Boyack, K. W., &Klavans R. (2014). Including cited non-source items in a large-scale map of science: What difference does it make? Journal of Informetrics, 8, 569-580. doi:10.1016/j.joi.2014.04.001.
doi: 10.1016/j.joi.2014.04.001
[3]   Colavizza G., Boyack K. W., van Eck N. J., & Waltman L. (2018). The closer the better: Similarity of publication pairs at different cocitation levels. Journal of the Association for Information Science and Technology, 69, 600-609. doi:10.1002/asi.23981.
doi: 10.1002/asi.23981
[4]   Gl?nzel W,&Schubert A (2003). A new classification scheme of science fields and subfields designed for scientometric evaluation purposes. Scientometrics, 56, 357-367.
doi: 10.1023/A:1022378804087
[5]   Gl?nzel W,&Thijs B. (2011). Using core documents for the representation of clusters and topics. Scientometrics, 88, 297-309.
doi: 10.1007/s11192-011-0347-4
[6]   Grover A., &Leskovec J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 855-864). ACM.
doi: 10.1145/2939672.2939754 pmid: 5108654
[7]   Haunschild R., Schier H., Marx W., & Bornman L. (2018). Algorithmically generated subject categories based on citation relations: An empirical micro study using papers on overall water splitting. Journal of Informetrics, 12, 436-447. doi:10.1016/j.joi.2018.03.004.
doi: 10.1016/j.joi.2018.03.004
[8]   Janssens F., Gl?nzel W., & De Moor B. (2008). A hybrid mapping of information science. Scientometrics, 75, 607-631.
doi: 10.1007/s11192-007-2002-7
[9]   JCR2017 (2018). 2017 journal impact factor, journal citation reports (clarivate analytics, 2018).
[10]   Klavans R., &Boyack K.W. (2009). Toward a consensus map of science. Journal of the American Society for Information Science and Technology, 60, 455-476.
[11]   Klavans R., &Boyack K.W. (2017). Which type of citation analysis generates the most accurate taxonomy of scientific and technical knowledge? Journal of the Association for Information Science and Technology, 68, 984-998.
doi: 10.1002/asi.23734
[12]   Leydesdorff L. (2006). Can scientific journals be classified in terms of aggregated journal-journal citation relations using the journal citation reports? Journal of the American Society for Information Science and Technology, 57, 601-613.
doi: 10.1002/asi.20322
[13]   Leydesdorff L., Bornmann L., & Wagner C. S. (2017). Generating clustered journal maps: An automated system for hierarchical classification. Scientometrics, 110, 1601-1614.
doi: 10.1007/s11192-016-2226-5 pmid: 28255188
[14]   Leydesdorff L., Bornmann L., & Wagner C. S. (2017). Generating clustered journal maps: an automated system for hierarchical classification. Scientometrics, 110, 1601-1614. doi:10.1007/s11192-016-2226-5.
doi: 10.1007/s11192-016-2226-5 pmid: 28255188
[15]   Leydesdorff L., Wagner C. S., & Bornmann L. (2018). Betweenness and diversity in journal citation networks as measures of interdisciplinaritya€”a tribute to eugene garfield. Scientometrics, 114, 567-592.
[16]   Maaten L. v. d., &Hinton G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9, 2579-2605.
[17]   Mikolov T., Sutskever I., Chen K., Corrado G. S., & Dean J. (2013). Distributed representations of words and phrases and their compositionality. In advances in neural information processing systems (pp. 3111-3119).
[18]   Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., Vanderplas J., Passos A., Cournapeau D., Brucher M., Perrot M., & Duchesnay E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830.
[19]   Rao C.R. (1982). Diversity: its measurement, decomposition apportionment and analysis. Sankhy : The Indian Journal of Statistics, Series A, 44, 1-22.
[20]   Schakel A.M., &Wilson B.J. (2015). Measuring word significance using distributed representations of words. arXiv preprint arXiv:1508.02297.
[21]   Shen Z., Yang L., Pei J., Li M., Wu C., Bao J., Wei T., Di Z., Rousseau R., & Wu J. (2016). Interrelations among scientific fields and their relative influences revealed by an input—output analysis. Journal of Informetrics, 10, 82-97. doi:https://doi.org/10.1016/j.joi.2015.11.002.
[22]   Sjogarde P,&Ahlgren P. (2018). Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics. Journal of Informetrics, 12, 133-152. doi:10.1016/j.joi.2017.12.006.
[23]   Stirling A. (2007). A general framework for analysing diversity in science, technology and society. Journal of the Royal Society Interface, 4, 707-719.
[24]   Vinh N. X., Epps J., & Bailey J. (2010). Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. Journal of Machine Learning Research, 11, 2837-2854.
[25]   Waltman L. (2016). A review of the literature on citation impact indicators. Journal of Informetrics, 10, 365-391. doi:https://doi.org/10.1016/j.joi.2016.02.007.
[26]   Waltman L. & Van Eck, N. J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63, 2378-2392.
[1] Jose A. Moral-Munoz, Manuel Arroyo-Morales, Barbara F. Piper, Antonio I. Cuesta-Vargas, Lourdes Díaz-Rodríguez, William C.S. Cho, Enrique Herrera-Viedma, Manuel J. Cobo . Thematic Trends in Complementary and Alternative Medicine Applied in Cancer-Related Symptoms[J]. Journal of Data and Information Science, 2018, 3(2): 1-19.
[2] Chen Chaomei. Science Mapping: A Systematic Review of the Literature[J]. Journal of Data and Information Science, 2017, 2(2): 1-40.