Abstract
In recent years a number of "data mining" approaches for modeling data containing nonlinear and other complex dependencies have appeared in the literature. One of the key data mining techniques is decision trees, also referred to as classification and regression trees or CART (Breiman et al, 1993). That method results in relatively easy to apply decision rules that partition data and model many of the complexities in insurance data. In recent years considerable effort has been expended to improve the quality of the fit of regression trees. These new methods are based on ensembles or networks of trees and carry names like TREENET and Random Forest. Viaene et al (2002) compared several data mining procedures, including tree methods and logistic regression, for prediction accuracy on a small fixed data set of fraud indicators or "red flags". They found simple logistic regression did as well at predicting expert opinion as the more sophisticated procedures. In this paper we will introduce some available regression tree approaches and explain how they are used to model nonlinear dependencies in insurance claim data. We investigate the relative performance of several software products in predicting the key claim variables for the decision to investigate for excessive and/or fraudulent practices, and the expectation of favorable results from the investigation, in a large claim database. Among the software programs we will investigate are CART, S-PLUS, TREENET, Random Forest and Insightful Miner Tree procedures. The data used for this analysis are the approximately 500,000 auto injury claims reported to the Detailed Claim Database (DCD) of the Automobile Insurers Bureau of Massachusetts from accident years 1995 through 1997. The decision to order an independent medical examination or a special investigation for fraud, and the favorable outcomes of such decisions, are the modeling targets. We find that the methods all provide some predictive value or lift from the available DCD variables with significant differences among the methods and the four targets. All modeling outcomes are compared to logistic regression as in Viaene et al. with some model/software combinations doing significantly better than the logistic model.
Keywords: Fraud, Data Mining, ROC Curve, Variable Importance, Decision Trees
Volume
Winter
Page
1 - 49
Year
2006
Categories
Financial and Statistical Methods
Statistical Models and Methods
Data Mining
Financial and Statistical Methods
Statistical Models and Methods
Decision Methods
Actuarial Applications and Methodologies
Reserving
Fraud Detection
Publications
Casualty Actuarial Society E-Forum