Variable Reduction for Predictive Modeling with Clustering

Abstract
Motivation: Thousands of variables are contained in insurance data warehouses. In addition, external sources of information could be attached to the data contained in data warehouses. When actuaries build a predictive model, they are confronted with redundant variables which reduce the model efficiency (time to develop the model, interpretation of the results, and inflate variance of the estimates). For these reasons, there is a need for a method to reduce the number of variables to input in the predictive model.

Method: We have used Proc varclus (SAS/STAT) to find clusters of variables defined at a geographical level and attached to a database of automobile policies. The procedure finds cluster of variables which are correlated between themselves and not correlated with variables in other clusters. Using business knowledge and 1-R2 ration, cluster representatives can be selected, thus reducing the number of variables. Then, the cluster representatives are input in the predictive model.

Conclusions: The procedure used in the paper for variable clustering quickly reduces a set of numeric variables to manageable reduced set of variable clusters.

Availability: proc varclus from SAS/STAT has been used for this study. We found an implementation of variable clustering in R, function varclus, while we did not experiment with it.

Keywords: variable reduction, clustering, statistical method, data mining, predictive modeling.

Volume
Winter
Page
89 - 100
Year
2006
Keywords
predictive analytics
Categories
Financial and Statistical Methods
Statistical Models and Methods
Data Mining
Financial and Statistical Methods
Statistical Models and Methods
Data Visualization
Financial and Statistical Methods
Statistical Models and Methods
Exploratory Data Analysis
Publications
Casualty Actuarial Society E-Forum
Authors
Kevin Lonergan
Robert Sanche