For modeling claims within the GLM framework, the Poisson distribution is a popular distribution choice. In the presence of overdispersion, the negative binomial is also sometimes used. The statistical literature has suggested that taking excess zeros into account can improve the fit of count models when overdispersion is present. In insurance excess zeros may arise when claims near the deductible are not reported to the insurer, thus inflating the number of zero policies when compared to the predictions of a Poisson or Negative Binomial distribution.
In predictive modeling practice, data mining techniques such as neural networks and decision trees are often used to handle data complexities such as nonlinearities and interactions. Data mining techniques are sometimes combined with GLMs to improve the performance and/or efficiency of the predictive modeling analysis. One augmentation of GLMs uses decision tree methods in the data preprocessing step. An important preprocessing task reduces the number of levels on categorical variables so that sparse cells are eliminated and only significant groupings of the categories remain.
Method: This paper addresses some common problems in fitting count models to data. These are:
- Excess zeros
- Parsimonious reduction of category levels
- Nonlinearity
Results: The research described in this paper applied zero-inflated and hybrid models to claim frequency data. The research suggests that mixtures of GLM models incorporating adjustments for excess zeros improves the fit of the model compared to single distribution count models on some count data. The analysis also indicates that variable preprocessing using the CHAID tree technique can help reduce the complexity of models by retaining only category groupings that are significant with respect to their impact on the dependant variable.
Conclusions: By incorporating greater flexibility into GLM count models, practitioners may be able to improve the fit of models and increase the efficiency of the modeling effort. Use of the ZIP or ZINB improves the model fit for an illustrative automobile insurance database. The ZIP or ZINB distributions also provided a better overall approximation to the unconditional distribution of the data for the fit of a few additional insurance and noninsurance database. While the categorical variables in our illustrative data contained only a few categories compared to most realistic applications databases encountered in insurance, the fit of several predictive models. We also illustrate how the procedure can be applied to efficiently preprocess categorical variables with large numbers of categories.
Availability: Excel spreadsheets comparing the Poisson, negative binomial, zero-inflated Poisson and zero-inflated negative binomials well as R code for reproducing many models used in this paper will be available on the CAS Web Site.
Keywords: Predictive modeling, automobile ratemaking, generalized linear models, data mining