Variable Reduction for Predictive Modeling with Clustering

Link

https://www.casact.org/sites/default/files/database/forum_06wforum_06w93.pdf

Abstract

Motivation: Thousands of variables are contained in insurance data warehouses. In addition, external sources of information could be attached to the data contained in data warehouses. When actuaries build a predictive model, they are confronted with redundant variables which reduce the model efficiency (time to develop the model, interpretation of the results, and inflate variance of the estimates). For these reasons, there is a need for a method to reduce the number of variables to input in the predictive model.

Method: We have used Proc varclus (SAS/STAT) to find clusters of variables defined at a geographical level and attached to a database of automobile policies. The procedure finds cluster of variables which are correlated between themselves and not correlated with variables in other clusters. Using business knowledge and 1-R2 ration, cluster representatives can be selected, thus reducing the number of variables. Then, the cluster representatives are input in the predictive model.

Conclusions: The procedure used in the paper for variable clustering quickly reduces a set of numeric variables to manageable reduced set of variable clusters.

Availability: proc varclus from SAS/STAT has been used for this study. We found an implementation of variable clustering in R, function varclus, while we did not experiment with it.

Keywords: variable reduction, clustering, statistical method, data mining, predictive modeling.

Volume

Winter

Page

89 - 100

Year

2006

Keywords

predictive analytics

Variable Reduction for Predictive Modeling with Clustering

Search CAS