Please wait a minute...
Journal of Data and Information Science  2018, Vol. 3 Issue (2): 20-37    DOI: 10.2478/jdis-2018-0007
Research Paper     
CitationAS: A Tool of Automatic Survey Generation Based on Citation Content*
Jie Wang1,3, Chengzhi Zhang2(), Mengying Zhang1,3, Sanhong Deng1,3
1School of Information Management, Nanjing University, Nanjing 210023, China
2Department of Information Management, Nanjing University of Science and Technology, Nanjing 210094, China
3Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing University, Nanjing 210023, China
Download: PDF (571 KB)      HTML  
Export: BibTeX | EndNote (RIS)      

Abstract  

Purpose: This study aims to build an automatic survey generation tool, named CitationAS, based on citation content as represented by the set of citing sentences in the original articles.Design/methodology/approach: Firstly, we apply LDA to analyse topic distribution of citation content. Secondly, in CitationAS, we use bisecting K-means, Lingo and STC to cluster retrieved citation content. Then Word2Vec, WordNet and combination of them are applied to generate cluster labels. Next, we employ TF-IDF, MMR, as well as considering sentence location information, to extract important sentences, which are used to generate surveys. Finally, we adopt manual evaluation for the generated surveys.Findings: In experiments, we choose 20 high-frequency phrases as search terms. Results show that Lingo-Word2Vec, STC-WordNet and bisecting K-means-Word2Vec have better clustering effects. In 5 points evaluation system, survey quality scores obtained by designing methods are close to 3, indicating surveys are within acceptable limits. When considering sentence location information, survey quality will be improved. Combination of Lingo, Word2Vec, TF-IDF or MMR can acquire higher survey quality.Research limitations: The manual evaluation method may have a certain subjectivity. We use a simple linear function to combine Word2Vec and WordNet that may not bring out their strengths. The generated surveys may not contain some newly created knowledge of some articles which may concentrate on sentences with no citing.Practical implications: CitationAS tool can automatically generate a comprehensive, detailed and accurate survey according to user’s search terms. It can also help researchers learn about research status in a certain field.Originality/value: CitaitonAS tool is of practicability. It merges cluster labels from semantic level to improve clustering results. The tool also considers sentence location information when calculating sentence score by TF-IDF and MMR.



Key wordsAutomatic survey system      Citation content      Clustering algorithms      Label generation approaches      Sentence extraction methods     
Published: 14 June 2018
Cite this article:

Jie Wang, Chengzhi Zhang, Mengying Zhang, Sanhong Deng . CitationAS: A Tool of Automatic Survey Generation Based on Citation Content*. Journal of Data and Information Science, 2018, 3(2): 20-37.

URL:

http://manu47.magtech.com.cn/Jwk3_jdis/10.2478/jdis-2018-0007     OR     http://manu47.magtech.com.cn/Jwk3_jdis/Y2018/V3/I2/20

No. Citation sentence
1 Their transcription is dependent on mouse Cebpe and human CEBPE [12].
2 These changes may derive in a higher risk for type 2 diabetes development [8], [9].
3 It interacts with a variety of transcriptional factors and MLL proteins [9]-[12].
4 Most pathogens of humans, animals and plants are multi-host pathogens [1]-[3], [20].
Table 1 Citation Sentences Examples
Figure 1. Framework of CitationAS.
Phrase (Frequency) Phrase (Frequency)
cell line (37507) reactive oxygen species (5160)
gene expression (37001) central nervous system (4418)
amino acid (35165) smooth muscle cell (3439)
transcription factor (25626) protein protein interaction (3286)
cancer cell (25605) single nucleotide polymorphism (2535)
stem cell (22567) tumor necrosis factor (2482)
growth factor (17531) genome wide association (2386)
signaling pathway (16597) case control study (2269)
cell proliferation (14203) false discovery rate (2209)
meta analysis (12647) innate immune response (2133)
Table 2 Top 20 Phrases According to High Frequency.
Score Evaluation standards
5 Sentences are very smooth. Paragraphs and surveys are very comprehensive, exist very small redundancy and can fully reflect retrieval topics. The logical structure of survey is reasonable.
4 Sentences are relatively smooth. Paragraphs and surveys are relatively comprehensive, exist relatively small redundancy and can relatively reflect retrieval topics. The logical structure of survey is relatively reasonable.
3 Sentences are basically smooth. Paragraphs and surveys are basically comprehensive, exist certain redundancy and can basically reflect retrieval topics. The logical structure of survey is basically reasonable.
2 Sentences are not smooth enough. Paragraphs and surveys are not comprehensive, exist relatively high redundancy and cannot reflect retrieval topics enough. The logical structure of survey is confusing.
1 The smoothness of sentences becomes very poor. Paragraphs and surveys are far from comprehensive, exist very high redundancy and cannot fully reflect retrieval topics. There is no logical structure in the survey.
Table 3 Evaluation Standards.
Figure 2. User Interface of CitationAS.
Topic No. Topic words
1 protein, domain, binding, structure, membrane, residue, acid, interaction, site, amino
2 disease, patient, increase, risk, study, disorder, chronic, factor, blood, clinical
3 bacteria, gene, strain, plant, resistance, species, report, found, host, pathogen
4 study, health, patient, year, hiv, treatment, country, population, report, clinical
5 gene, sequence, data, analysis, based, identified, study, expression, number, region
6 model, method, data, based, test, analysis, value, number, calculated, approach
7 cell, expression, tissue, mice, differentiation, development, human, stem, bone, mouse
8 acid, increase, level, activity, glucose, concentration, stress, enzyme, insulin, effect
9 study, process, task, response, visual, effect, memory, information, social, related
10 cell, signalling, pathway, activation, receptor, role, factor, protein, expression, apoptosis
Table 4 Topic Distribution in Dataset.
Ranking Volunteer A Volunteer B
1 STC-TF-IDF Lingo-TF-IDF
2 Lingo-TF-IDF Lingo-MMR
3 STC-MMR STC-MMR
4 Lingo-MMR STC-TF-IDF
5 bisecting K-means-MMR bisecting K-means-MMR
6 bisecting K-means-TF-IDF bisecting K-means-TF-IDF
Table 5 Six Methods Rankings based on Two Volunteers.
Ranking Volunteer A Volunteer B
1 Lingo-MMR Lingo-MMR
2 Lingo-TF-IDF Lingo-TF-IDF
3 STC-TF-IDF STC-MMR
4 STC-MMR STC-TF-IDF
5 bisecting K-means-TF-IDF bisecting K-means-MMR
6 bisecting K-means-MMR bisecting K-means-TF-IDF
Table 6 Six Methods Rankings based on Two Volunteers when Considering Sentence Location.
Figure 3. Average Scores of Different Methods.
[1]   Berry M. W., Dumais S. T., & O’Brien G. W. (1995). Using linear algebra for intelligent information retrieval. Siam Review, 37(4), 573-595.
[2]   Blei D. M., Ng A. Y., & Jordan M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022.
[3]   Carbonell Jaime, & Goldstein. (1998). The use of MMR, diversity-based reranking for reordering documents and producing summaries. Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 335-336.
doi: 10.1145/290941.291025
[4]   Cohan A., & Goharian,N. (2015). Scientific article summarization using citation-context and article’s discourse structure. Proceedings of Conference on Empirical Methods in Natural Language Processing, 390-400.
doi: 10.18653/v1/D15-1045
[5]   Divoli A., Nakov P., & Hearst M. A. (2012). Do peers see more in a paper than its authors? Advances in Bioinformatics, 2012(2012), 750214.
[6]   Elkiss A., Shen S., Fader A., States D., & Radev D. (2008). Blind men and elephants: What do citation summaries tell us about a research article? Journal of the American Society for Information Science and Technology, 59(1), 51-62.
[7]   Fahad A., Alshatri N., Tari Z., Alamri A., Khalil I., Zomaya A. Y., Foufou S., & Bouras A. (2014). A survey of clustering algorithms for big data: taxonomy and empirical analysis. Emerging Topics in Computing IEEE Transactions on, 2(3), 267-279.
[8]   Fellbaum C.,& Miller, G.(1998).WordNet: An electronic lexical database Cambridge, MA: MIT Press An electronic lexical database .Cambridge, MA: MIT Press.
[9]   Jaidka K., Khoo C., & Na J. C. (2013). Deconstructing human literature reviews - A framework for multi-document summarization. The Workshop on European Natural Language Generation, 127, 125-135.
doi: 10.1017/S0075426900001701
[10]   Lee D. D. (2000). Algorithms for nonnegative matrix factorization. Advances in Neural Information Processing Systems, 13(6), 556-562.
doi: 10.1016/j.patrec.2011.01.012
[11]   Liu X.(2013). Generating metadata for cyberlearning resources through information retrieval and meta-search. Journal of the American Society for Information Science and Technology, 64(4): 771-786.
[12]   Maricic S., Spaventi J., Pavicic L., & Pifat-Mrzljak G. (1998). Citation context versus the frequency counts of citation histories. Journal of the Association for Information Science & Technology, 49(6), 530-540.
doi: 10.1002/(SICI)1097-4571(19980501)49:63.0.CO;2-8
[13]   Marujo L., Ribeiro R., Matos D. M. D., Joao P.Neto, Gershman, A., & Carbonell J. (2015). Extending a single-document summarizer to multi-document: a hierarchical approach. Computer Science, 176-181.
[14]   Mikolov T., Le Q. V., & Sutskever I. (2013). Exploiting Similarities among Languages for Machine Translation. Computer Science, 1-10.
[15]   Nenkova A., & McKeown, K. (2001). Automatic summarization. Association for Computational Linguistic, 39th Annual Meeting and 10th Conference of the European Chapter, Proceedings of the Student Research Workshop and Tutorial Abstracts, 5(3), 1-42.
[16]   Osiński S., &Weiss ,D. (2005a). Carrot2: Design of a flexible and efficient web information retrieval framework. Proceedings of the Third International Atlantic Web Intelligence Conference, 439-444.
doi: 10.1007/11495772_68
[17]   Osinski S., &Weiss, D. (2005b). A concept-driven algorithm for clustering search results. IEEE Intelligent Systems, 20(3), 48-54.
[18]   Qazvinian V., &Radev ,D.R. (2008). Scientific paper summarization using citation summary networks. Proceedings of International Conference on Computational Linguistics, 689-696.
doi: 10.3115/1599081.1599168
[19]   Rada R., Mili H., Bicknell E., & Blettner M. (1989). Development and application of a metric on semantic nets. IEEE Transactions on Systems Man & Cybernetics, 19(1), 17-30.
[20]   Salton G., &Yu C.T. (1973). On the construction of effective vocabularies for information retrieval. Acm Sigplan Notices, 9(3), 48-60.
doi: 10.1145/951787.951766
[21]   Sarkar K., Saraf K., & Ghosh A. (2015). Improving graph based multidocument text summarization using an enhanced sentence similarity measure. Proceedings of IEEE nternational Conference on Recent Trends in Information Systems, 359-365.
[22]   Stefanowski J., &Weiss, D.(2003). Carrot2 and language properties in web search results clustering. Proceedings of the First International Atlantic Web Intelligence Conference, 2663, 240-249.
doi: 10.1007/3-540-44831-4_25
[23]   Tandon N., & Jain A.(2012). Citation context sentiment analysis for structured summarization of research papers. Proceedings of 35th German Conference on Artificial Intelligence, 1-5.
[24]   Valizadeh M., & Brazdil P., (2015). Density-based graph model summarization: attaining better performance and efficiency. Intelligent Data Analysis, 19(3), 617-629.
doi: 10.3233/IDA-150735
[25]   Yang L., Cai X., Pan S., Dai H., & Mu D. (2017). Multi-document summarization based on sentence cluster using non-negative matrix factorization. Journal of Intelligent & Fuzzy Systems, 33(1), 1-13.
[26]   Yang S., Lu W., Yang D., Li X., Wu C., & Wei B. (2016). KeyphraseDS: Automatic generation of survey by exploiting keyphrase information. Neurocomputing, 224, 58-70.
doi: 10.1016/j.neucom.2016.10.052
[27]   Yang Y., & Pedersen, J. O. (1997). A Comparative Study on Feature Selection in Text Categorization. Proceedings of the 14th International Conference on Machine Learning, 4(3), 412-420.
[28]   Zhang R., Li W., Gao D., & Ouyang Y. (2013). Automatic twitter topic summarization with speech acts. IEEE Transactions on Audio Speech & Language Processing, 21(3), 649-658.
No related articles found!