Journal of Data and Information Science ›› 2019, Vol. 4 ›› Issue (4): 56-83.doi: 10.2478/jdis-2019-0021

• Research Paper • Previous Articles     Next Articles

Identification of Sarcasm in Textual Data:A Comparative Study

Pulkit Mehndiratta(),Devpriya Soni   

  1. Jaypee Institute of Information Technology, Noida, India
  • Received:2019-09-06 Revised:2019-11-28 Online:2019-12-11 Published:2019-12-19
  • Contact: Pulkit Mehndiratta


Purpose: Ever increasing penetration of the Internet in our lives has led to an enormous amount of multimedia content generation on the internet. Textual data contributes a major share towards data generated on the world wide web. Understanding people’s sentiment is an important aspect of natural language processing, but this opinion can be biased and incorrect, if people use sarcasm while commenting, posting status updates or reviewing any product or a movie. Thus, it is of utmost importance to detect sarcasm correctly and make a correct prediction about the people’s intentions.

Design/methodology/approach: This study tries to evaluate various machine learning models along with standard and hybrid deep learning models across various standardized datasets. We have performed vectorization of text using word embedding techniques. This has been done to convert the textual data into vectors for analytical purposes. We have used three standardized datasets available in public domain and used three word embeddings i.e Word2Vec, GloVe and fastText to validate the hypojournal.

Findings: The results were analyzed and conclusions are drawn. The key finding is: the hybrid models that include Bidirectional LongTerm Short Memory (Bi-LSTM) and Convolutional Neural Network (CNN) outperform others conventional machine learning as well as deep learning models across all the datasets considered in this study, making our hypojournal valid.

Research limitations: Using the data from different sources and customizing the models according to each dataset, slightly decreases the usability of the technique. But, overall this methodology provides effective measures to identify the presence of sarcasm with a minimum average accuracy of 80% or above for one dataset and better than the current baseline results for the other datasets.

Practical implications: The results provide solid insights for the system developers to integrate this model into real-time analysis of any review or comment posted in the public domain. This study has various other practical implications for businesses that depend on user ratings and public opinions. This study also provides a launching platform for various researchers to work on the problem of sarcasm identification in textual data.

Originality/value: This is a first of its kind study, to provide us the difference between conventional and the hybrid methods of prediction of sarcasm in textual data. The study also provides possible indicators that hybrid models are better when applied to textual data for analysis of sarcasm.

Key words: Machine learning, Artificial neural networks, Word embedding, Text vectorization, Accuracy