Distinguishing the Forest from the TREES: A Comparison of Tree-Based Data Mining Methods

Abstract

One of the most commonly used data mining techniques is decision trees, also referred to as classification and regression trees or C&RT. Several new decision tree methods are based on ensembles or networks of trees and carry names like TreeNet and Random Forest. Viaene et al. compared several data mining procedures, including tree methods and logistic regression, for modeling expert opinion of fraud/no fraud using a small fixed data set of fraud indicators or “red flags.” They found that simple logistic regression did as well at matching expert opinion on fraud/no fraud as the more sophisticated procedures. In this paper we will introduce some publicly available regression tree approaches and explain how they are used to model four proxies for fraud in insurance claim data. We find that the methods all provide some explanatory value or lift from the available variables with significant differences in fit among the methods and the four targets. All modeling outcomes are compared to logistic regression as in Viaene et al., with some model/software combinations doing significantly better than the logistic model.

Keywords: Fraud, data mining, ROC curve, claim investigation, decision trees

Volume
0002,0002,Fall
Page
0184-0208
Year
2008
Categories
Financial and Statistical Methods
Statistical Models and Methods
Data Mining
Actuarial Applications and Methodologies
Reserving
Fraud Detection
Publications
Variance
Authors
Richard A Derrig
Louise A Francis