Journal of Data and Information Science ›› 2020, Vol. 5 ›› Issue (2): 111-135.doi: 10.2478/jdis-2020-0014

• Research Papers • Previous Articles    

A Two-Level Approach based on Integration of Bagging and Voting for Outlier Detection

Alican Dogan1, Derya Birant2,()   

  1. 1The Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir, Turkey
    2Department of Computer Engineering, Dokuz Eylul University, Izmir, Turkey
  • Received:2019-12-13 Revised:2020-04-27 Accepted:2020-04-29 Online:2020-05-20 Published:2020-05-24
  • Contact: Derya Birant E-mail:derya@cs.deu.edu.tr

Abstract:

Purpose: The main aim of this study is to build a robust novel approach that is able to detect outliers in the datasets accurately. To serve this purpose, a novel approach is introduced to determine the likelihood of an object to be extremely different from the general behavior of the entire dataset.

Design/methodology/approach: This paper proposes a novel two-level approach based on the integration of bagging and voting techniques for anomaly detection problems. The proposed approach, named Bagged and Voted Local Outlier Detection (BV-LOF), benefits from the Local Outlier Factor (LOF) as the base algorithm and improves its detection rate by using ensemble methods.

Findings: Several experiments have been performed on ten benchmark outlier detection datasets to demonstrate the effectiveness of the BV-LOF method. According to the results, the BV-LOF approach significantly outperformed LOF on 9 datasets of 10 ones on average.

Research limitations: In the BV-LOF approach, the base algorithm is applied to each subset data multiple times with different neighborhood sizes (k) in each case and with different ensemble sizes (T). In our study, we have chosen k and T value ranges as [1-100]; however, these ranges can be changed according to the dataset handled and to the problem addressed.

Practical implications: The proposed method can be applied to the datasets from different domains (i.e. health, finance, manufacturing, etc.) without requiring any prior information. Since the BV-LOF method includes two-level ensemble operations, it may lead to more computational time than single-level ensemble methods; however, this drawback can be overcome by parallelization and by using a proper data structure such as R*-tree or KD-tree.

Originality/value: The proposed approach (BV-LOF) investigates multiple neighborhood sizes (k), which provides findings of instances with different local densities, and in this way, it provides more likelihood of outlier detection that LOF may neglect. It also brings many benefits such as easy implementation, improved capability, higher applicability, and interpretability.

Key words: Outlier detection, Local outlier factor, Ensemble learning, Bagging, Voting