Journal of Data and Information Science ›› 2021, Vol. 6 ›› Issue (4): 139163.doi: 10.2478/jdis20210032
• Research Papers • Previous Articles
Received:
20210619
Revised:
20210721
Accepted:
20210723
Online:
20211120
Published:
20211108
Contact:
Zheng Xie
Email:xiezheng81@nudt.edu.cn
Figure 1.
The statistics of the PNAS papers’ in three major disciplines. There are 90.44% of papers belonging to major disciplines, namely Biological Sciences, Physical Sciences, and Social Sciences. Panel (a) shows the annual number of papers of each discipline. Panel (b) shows the fraction of papers of Biophysics in Biological Sciences.
Algorithm 2. The architecture of the Transformer used here. 

1: model=get model(token num=max(len(source token dict), len(target token dict)), embed dim=32, encoder num=4, decoder num=4, head num=4, hidden dim=32, dropout rate=0.05, use same embed=False,) 2: model.compile(’adam’, ’sparse categorical crossentropy’) 3: model.fit(x=[np.array(encode input?30), np.array(decode input?30)], y=np.array(decode output?30), epochs=5, batch size=32, ) 
Algorithm 3. Constructing word branches. 

Input: titles of papers; parameter m; Output: word branches. 1: for each preprocessed title a_{0},..., a_{n} do 2: let b_{01} = a_{0}; 3: for i from 1 to n do 4: predict b_{i}_{1},..., b_{im} ranked according to the probability given by the Transformer; 5: if i > 1 then 6: generate directed edges from b_{(}_{i}_{1)1} to b_{i}_{1},..., b_{im}; 7: use b_{01},..., b_{i}_{1} to predict next tokens by the Transformer; 8: end if 9: end for 10: end for 
Algorithm 4. Cropping word branches. 

Input: word branches; Output: cropped word branches. 1: calculate the tokenpaper matrix (f_{ij})_{N×M}; 2: calculate the tfidf matrix (w_{ij})_{N×M}; 3: for i from 1 to M do 4: rank tokens according to {w_{i}_{1},..., w_{iN}}; 5: if token b_{i}_{1} is not in top x% then 6: crop the tokens of {b_{i}_{1},..., b_{im}}; 7: crop the edges connecting those tokens; 8: else 9: crop the tokens of {b_{i}_{2},..., b_{im}} that are not in top x%; 10: crop the edges connecting those tokens. 11: end if 12: end for 
Table 1
The information of the dataset dblp.
Time  a  b  c  d  e  f 

1999  2,475  3,274  95,021  0.11  2.371  0.998 
2000  2,380  3,347  93,910  0.101  2.395  0.998 
2001  2,455  3,477  108,954  0.108  2.355  0.999 
2002  2,812  3,710  117,269  0.094  2.272  1.0 
2003  2,656  3,592  115,019  0.1  2.312  0.999 
2004  2,955  3,919  138,451  0.101  2.299  0.999 
2005  3,131  4,084  154,041  0.099  2.275  0.999 
2006  3,248  4,260  166,614  0.1  2.289  0.999 
2007  3,419  4,368  184,420  0.102  2.279  0.999 
2008  3,408  4,436  184,881  0.104  2.304  1.0 
2009  3,658  4,609  212,771  0.098  2.218  1.0 
2010  3,639  4,668  221,090  0.1  2.204  0.999 
2011  3,462  4,688  220,020  0.111  2.228  0.999 
2012  3,621  4,875  209,517  0.114  2.28  1.0 
2013  3,593  4,846  231,959  0.096  2.189  1.0 
2014  3,334  4,679  210,099  0.096  2.208  1.0 
Figure 3.
The performances of using the random walk method to detect node communities. The word networks are treated as directed ones. The step is the parameter of the random walk method, the length of random walks to perform. Panels show the performances of the method with the step 3 (blue dot lines), 4 (red dot lines) and 5 (orange dot lines).
Figure 4.
The performances of using the louvain method to detect node communities. The word networks are treated as undirected ones. The resolution is the parameter of the louvain method. Panels show the performances of the method with the resolution 0.5 (orange dot lines), 0.7 (red dot lines), and 1.0 (blue dot lines).
Figure 6.
The comparisons between the LDA and our method. Panels (ad) show the average values of assessment indexes, and Panels (eh) show the standard deviations for running the LDA (red dot lines) and the provided method using the louvain method to detect node communities (resolution 0.7, blue dot lines).
Figure 7.
The performances of the LDA with the same number of topics as that of our method. Panels show the performances of the case with the number of communities of tokens (blue dot lines) and the performances with the number of communities of edges (red dot lines). These communities are detected by the louvain method.
Figure 9.
The evolution of the tokens emphasized by the LDA. Panels show the word clouds of the top five tokens in the top five topics according to the summation of a token’s weight overall topics. The number of topics is five. The index n here is the number of tokens emphasized by running the LDA ten times.
[1]  Agrawal, R., & Srikant, R. (1994, September). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference of Very Large Data Bases. 1215, 487499. 
[2] 
Ahn, Y.Y., Bagrow, J.P., & Lehmann, S. (2010). Link communities reveal multiscale complexity in networks. nature, 466(7307), 761764.
doi: 10.1038/nature09182 
[3]  Asuncion, A., Welling, M., Smyth, P., & Teh, Y.W. (2012) On smoothing and inference for topic models. UAI Press. arXiv:1205.2662. 
[4]  Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation. the Journal of machine Learning research, 3, 9931022. 
[5] 
Blondel, V.D., Guillaume, J.L., Lambiotte, R., & Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and experiment, 2008(10), P10008.
doi: 10.1088/17425468/2008/10/P10008 
[6] 
Boyack, K.W., Newman, D., Duhon, R.J., Klavans, R., Patek, M., Biberstine, J.R.,... & Börner, K. (2011). Clustering more than two million biomedical publications: Comparing the accuracies of nine textbased similarity approaches. PloS one, 6(3), e18029.
doi: 10.1371/journal.pone.0018029 
[7]  Cheng, J.P., Dong, L., & Lapata, M. (2016). Long shortterm memorynetworks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 551561. 
[8] 
Doucet, A., & AhonenMyka, H. (2010). An efficient any language approach for the integration of phrases in document retrieval. Language resources and evaluation, 44(1), 159180.
doi: 10.1007/s1057900991023 
[9]  Gehring, J., Auli, M., Grangier, D., Yarats, D., & Dauphin, Y.N. (2017, July). Convolutional sequence to sequence learning. In International Conference on Machine Learning. 12431252. 
[10] 
Gildea, D., & Jurafsky, D. (2002). Automatic labeling of semantic roles. Computational Linguistics, 28(3), 245288.
doi: 10.1162/089120102760275983 
[11]  Girvan, M., & Newman, M.E. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences, 99(12), 78217826. 
[12]  Griffiths, T.L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(suppl 1), 52285235. 
[13] 
Hochreiter, S., & Schmidhuber, J. (1997). Long shortterm memory. Neural Computation, 9(8), 17351780.
pmid: 9377276 
[14]  Kalchbrenner, N., et al. Espeholt, L., Simonyan, K., Oord, A.V.D., Graves, A., & Kavukcuoglu, K. (2016). Neural machine translation in linear time. arXiv preprint arXiv:1610.10099. 
[15]  Kim, Y., Denton, C., Hoang, L., & Rush, A.M. (2017). Structured attention networks. In International Conference on Learning Representations. arXiv:1702.00887 
[16]  Kingsbury, P., & Palmer, M. (2002). From treebank to propbank. Language Resources & Evaluation. Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). 
[17] 
Leicht, E.A., & Newman, M.E. (2008). Community structure in directed networks. Physical Review Letters, 100(11), 118703.
pmid: 18517839 
[18]  Li, P.J., Lam, W., Bing, L., & Wang, Z. (2017). Deep Recurrent Generative Decoder for Abstractive Text Summarization. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 20812090. 
[19]  McDonald, R., Pereira, F., Kulick, S., Winters, S., Jin, Y., & White, P. (2005, June). Simple algorithms for complex relation extraction with applications to biomedical IE. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL05). 491498. 
[20]  Mintz, M., Bills, S., Snow, R., & Jurafsky, D. (2009, August). Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. 10031011. 
[21]  Pons, P., & Latapy, M. (2005, October). Computing communities in large networks using random walks. International symposium on computer and information sciences. ISCIS 2005: Computer and Information Sciences  ISCIS 2005, 284293. 
[22]  Ramage, D., Manning, C.D., & Dumais, S. (2011, August). Partially labeled topic models for interpretable text mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 457465.457465. 
[23]  Schmidhuber, J. (2001) Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. WileyIEEE Press. 
[24]  Sethy, A., & Ramabhadran, B. (2008). Bagofword normalized ngram models. In Ninth Annual Conference of the International Speech Communication Association. 15941597. 
[25] 
Shannon, C.E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379423.
doi: 10.1002/bltj.1948.27.issue3 
[26] 
Small, H., Boyack, K.W., & Klavans, R. (2014). Identifying emerging topics in science and technology. Research policy, 43(8), 14501467.
doi: 10.1016/j.respol.2014.02.005 
[27]  Swampillai, K., & Stevenson, M. (2011, September). Extracting relations within and across sentences. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011. 2532. 
[28] 
Talley, E.M., Newman, D., Mimno, D., Herr, B.W., Wallach, H.M., Burns, G.A.,... & McCallum, A. (2011). Database of NIH grants using machinelearned categories and graphical clustering. Nature Methods, 8(6), 443444.
doi: 10.1038/nmeth.1619 
[29]  Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems. 59986008. 
[30] 
Velden, T., Boyack, K.W., Gläser, J., Koopman, R., Scharnhorst, A., & Wang, S. (2017). Comparison of topic extraction approaches and their results. Scientometrics, 111(2), 11691221.
doi: 10.1007/s1119201723061 
[31]  Wallach, H.M. (2006, June). Topic modeling: Beyond bagofwords. In Proceedings of the 23rd international conference on Machine learning, 977984. 
[32] 
Yin, W., Schütze, H., Xiang, B., & Zhou, B. (2016). Abcnn: Attentionbased convolutional neural network for modeling sentence pairs. Transactions of the Association for Computational Linguistics, 4, 259272.
doi: 10.1162/tacl_a_00097 
[33]  Zeng, D.J., Liu, K., Lai, S., Zhou, G., & Zhao, J. (2014, August). Relation classification via convolutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. 23352344. 
[34] 
Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., & Zhang, G. (2018). Does deep learning help topic extraction? A kernel kmeans clustering method with word embedding. Journal of Informetrics, 12(4), 10991117.
doi: 10.1016/j.joi.2018.09.004 
[1]  Sahand Vahidnia, Alireza Abbasi, Hussein A. Abbass. Embeddingbased Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering [J]. Journal of Data and Information Science, 2021, 6(3): 99122. 
[2]  Ülle Must. Priorities for Social and Humanities Projects Based on Text Analysis ^{①} [J]. Journal of Data and Information Science, 2020, 5(4): 116125. 
Viewed  
Full text 


Abstract 

