Dancing With Dirty Data: Methods for Exploring and Cleaning Data

Abstract
Motivation: Much of the data that actuaries work with is dirty. That is, the data contain errors, miscodings, missing values and other flaws that affect the validity of analyses performed with such data.

Methods: This paper will give an overview of methods that can be used to detect errors and remediate data problems. The methods will include outlier detection procedures from the exploratory data analysis and data mining literature as well as methods from research on coping with missing values. The paper will also address the need for accurate and comprehensive metadata.

Conclusions: A number of graphical tools such as histograms and box and whisker plots are useful in highlighting unusual values in data. A new tool based on data spheres appears to have the potential to screen multiple variables simultaneously for outliers. For remediating missing data problems, imputation is a straightforward and frequently used approach.

Availability: The R statistical language can be used to perform the exploratory and cleaning methods described in this paper. It can be downloaded for free at http://cran.r-project.org/

Keywords: data quality, data mining, ratemaking, exploratory data analysis.

Volume
Winter
Page
198-254
Year
2005
Categories
Actuarial Applications and Methodologies
Data Management and Information
Data Administration, Warehousing and Design
Financial and Statistical Methods
Statistical Models and Methods
Data Diagnostics
Financial and Statistical Methods
Statistical Models and Methods
Data Mining
Actuarial Applications and Methodologies
Data Management and Information
Data Quality
Actuarial Applications and Methodologies
Data Management and Information
Data Reconciliation
Financial and Statistical Methods
Statistical Models and Methods
Exploratory Data Analysis
Publications
Casualty Actuarial Society E-Forum
Prizes
Management Data and Information Prize
Authors
Louise A Francis
Documents