Blog

Modeling Group Life Mortality Experience: A How-To Guide - Part 3 : Data Validation

By Jonathan Polon on 3/15/2018

Part 3: Data Validation

Overview

The focus of this, the third part of our series on modeling Group Life Mortality Experience, is to review the Basic Life Death and Disability dataset – not yet for analysis but simply to validate the quality of the data.

We will be working with industry data that was provided to the Society of Actuaries (by several contributing Group Life carriers) for its 2016 Group Term Life Experience Study. If you have not yet acquired this dataset, please refer to Part 2 of this series.

Summary of Previous Posts

Before we move forward, here is a brief reminder of what we covered in previous parts of this series.

Part 1: Introduction

• Listed potential applications for the model of industry Group Life Mortality experience

• Outlined future parts in this blog series

• Explained the rationale for performing the analysis in R

Part 2: Data Acquisition

• Read the data into R

• Determined the size of the dataset (21,848,920 records and 25 data fields)

• Found the names and data types for the 25 data fields.

General Approach to Data Validation

Again, our objective at this stage of the project is to validate the data quality. We’re not yet ready to begin analyzing mortality experience. We simply want to review each data field to ensure that (i) we understand the data within and (ii) the distribution of the data is generally consistent with our expectations.

You may recall from Part 2 that the dataset included three types of data classes:

• Factor (categorical data)

• Integer

• Numeric

Our approach to data validation is similar for each of these three data classes. However, we will require slight differences in our R coding to perform the data validation. We will not explore each data field within this blog post. Rather, for illustrative purposes, we will focus on one or two fields from each type of data class: factor, integer and numeric.

Before diving in, let’s list the 25 data fields (column names) of the dataset and note the data class.

Data Fields: Basic Life Death and Disability Dataset

Field Name

Data Class

exposure_coverage_type

factor

face_amount_band

factor

salary

factor

group_size

factor

sex

factor

central_age

integer

observation_year

integer

five_digit_zip_code

factor

four_digit_zip_code

factor

three_digit_zip_code

factor

four_digit_sic_code

factor

three_digit_sic_code

factor

two_digit_sic_code

integer

death_policies_exposed

numeric

death_amount_exposed

numeric

death_count

integer

death_claim_amount

integer

expected_death_by_policy

numeric

expected_death_by_amount

numeric

disability_amount_exposed

numeric

disability_claim_amount

integer

expected_disability_by_amount

numeric

disability_policies_exposed

numeric

disability_count

integer

expected_disability_by_policy

numeric

 

The data field names are self-evident so there is no need for me to describe the data fields before we commence the data exploration.

Data Exploration in R

Similar to last post, we will not require any scripting for this initial data exploration in R. We will continue to work interactively on the command line. For convention, I will show the commands that I enter in grey script and I will show the output returned in blue script. The number sign (#) will be used to denote comments.


Data Fields of “Factor” Class

This class is used for categorical data fields, meaning data fields that are non-quantitative but instead classify each record into one of a finite set of classes. Examples of categorical data include:

• Marital status (single, married, divorced, etc.)

• Occupation (actor, actuary, arborist, etc.)

• Hair color (black, blond, brown, grey, red, white, etc.)

The first two functions we will use in R will be:

• levels: Will output the class labels for a data field with class type of “factor”.

• summary: Will output the class labels and number of observations for a data field with class type of “factor”.

Let’s begin with a look at the sex field of our dataset.

# Find the levels of the “sex” data field from the SOA_Basic_Data data frame

 

levels(SOA_Basic_Data$sex)

"Female" "Male"

#The “sex” data field has two levels – “Female” or “Male”

 

 

# Find the number of observations for each level of the “sex” data field

 

summary(SOA_Basic_Data$sex)

Female       Male

9676233    12172687

# The data frame has 9,676,233 rows where sex = “Female” and 12,172,687 where sex = “Male”

 

We now know that the sex data field includes values of either “Female” and “Male” and we know how many records there are of each. However, the number of records doesn’t tell us anything about the distribution of the exposures – neither by number of lives nor by amount insured – by sex. In order to get this information, we need to aggregate two other data fields – death_policies_exposed and death_amount_exposed – by sex. We can achieve this using the function called, aggregate.

• aggregate: Splits the data into subsets and computes summary statistics for each subset.

• cbind: Combines a sequence of elements by column. In this case, we will use cbind to pass multiple data fields into the aggregate function.

# Find the number and amount of exposures by “sex” in the SOA_Basic_Data data frame:

# Aggregate two data fields (“death_policies_exposed” and “death_amount_exposed”)

# To get the “sum” by the “sex” data field from the SOA_Basic_Data data frame

 

aggregate(cbind(death_policies_exposed, death_amount_exposed) ~ sex,

data = SOA_Basic_Data, sum)

sex                  death_policies_exposed           death_amount_exposed

Female                     19311789                                  8.781057e+11

Male                        25150958                                   1.334024e+12

 

 

# Find the proportion of death_policies_exposed with sex = “female”

19311789/(19311789+25150958)

0.4343364

# 43% of lives exposed are female

 

 

# Find average policy amount where sex = “female”

8.781057e+11/19311789

45469.93

# The average policy amount for females is about $45,470

 

Sex is a classic example of a categorical data field. There are distinct classes (“Female” and “Male”) but no quantitative meaning associated with the classes – the classes are different but we cannot order them from largest to smallest (or vice-versa).

Within this dataset there are some categorical (data class = “factor”) data fields where the classes are ordinal (i.e., can be ordered from largest to smallest). For example, the salary data field:

# Find the levels of the “salary” data field from the SOA_Basic_Data data frame

 

levels(SOA_Basic_Data$salary)

"A. < 25" "B. 25-49" "C. 50-74" "D. 75-99" "E. 100-149" "F. 150-249"

"G. 250-499" "H. 500-749" "I. 750-999" "J. 1000-1999" "K. 2000+" "L. Unknown"

#The “salary” data field has twelve levels

 

The salary data has been grouped – so it is categorical. However, there is a natural order to these categories. For example, we know that “B. 25-49” represents higher salary levels than “A. < 25”. This creates options that we will need to explore when we are ready to model the data. For example:

• Leave the data representation as it is, with 12 distinct salary classes

• Maintain the data structure as categorical, but combine some of the classes to reduce dimensionality

• Convert the data to numeric and model as a continuous variable

Data Fields of “Integer” Class

This class is, of course, used for data fields where the values are all integers.

One example of “integer” class data fields in our dataset is central_age. Let’s look at the distribution of exposures (number of lives) for this data field to test if the results are consistent with expectations.

For central_age, our expectation for basic Group Life is that most exposures will have central age within the prime working ages of 20-65. There will be some exposures at ages less than 20. There should be a decline in exposures as employees begin to reach early retirement around age 55 and very few exposures beyond age 70. And, of course, there should be no exposures at ages beyond maximum human lifespan of 115 or so.

For central_age, let’s try a different approach than we applied for sex and salary – we can look at the proportion of exposures by central_age rather than the number of exposures.

# Find the number of exposures by “central_age” in the SOA_Basic_Data data frame

# Aggregate death_policies_exposed”

# To get the “sum” by “central_age” data field from the SOA_Basic_Data data frame

# Assign this to a new data frame “exp_by_age”

 

exp_by_age <- aggregate(death_policies_exposed ~ central_age, data = SOA_Basic_Data, sum)

 

 

# Next, calculate the proportion of exposures by each central_age as

# exposure for each age divided by the sum of all exposures

# Save this to a data field called “prop”

 

exp_by_age$prop <- exp_by_age$death_policies_exposed / sum(exp_by_age$death_policies_exposed)

 

 

# To be easier on the eyes, recalculate “prop” data field by (i) multiplying it by 100 and

# (ii) rounding to one decimal point

 

exp_by_age$prop <- round(exp_by_age$prop * 100, 1)

 

 

# Look at the exp_by_age data frame

 

exp_by_age

central_age                   death_policies_exposed               prop

17                                               98746.450                                0.2

22                                              1994296.569                            4.5

27                                             4654949.028                            10.5

32                                             5173976.054                            11.6

37                                             5079490.157                            11.4

42                                             5502073.334                           12.4

47                                            5763312.303                            13.0

52                                           5748444.639                             12.9

57                                           4794869.157                             10.8

62                                             3207853.515                            7.2

67                                              1433690.604                           3.2

72                                              499119.847                             1.1

77                                              238528.134                             0.5

82                                              145787.079                             0.3

87                                              86315.608                               0.2

92                                              33993.632                               0.1

97                                              7300.553                                 0.0

 

Many people prefer to view a distribution in the form of a chart rather than as a table of numbers. In R we can use the barplot function to quickly create such a chart.

# Display the proportionate exposures by age in a barchart

# Plot the data field “prop” from the data frame “exp_by_age”

# Set the data labels to be “central_age” from the same data frame

# Set the color of the bars to be “red”

 

 

barplot(exp_by_age$prop, names.arg = exp_by_age$central_age, col = "red")

 

 

 

The distribution of exposures by central_age is very consistent with expectations.

Data Fields of “Numeric” Class

This class is, of course, used for data fields where the values are all numeric – and not just integers.

There are some “numeric” data fields within our dataset. However, all of the “numeric” data fields relate to exposures and claims; none of the “numeric” data fields relate to risk factors. For illustrative purposes, we will continue to work with the central_age data field.

There isn’t much difference in our approach to validating “integer” vs. “numeric” data as both data classes are used to represent continuous values. The rationale to create a separate data class for integers is likely to minimize use of memory.

In the example of validating integer data, we simply looked at the distribution of exposures by central_age. If there were a data field that represented actual age, especially if age was not rounded to a whole year, then a data table of age and exposures would be difficult to interpret. There would potentially be far too many rows to take in visually and the data would be sparsely distributed. Thus, for “numeric” data, it will often make sense to “bin” the data into groups. We can perform binning by using two functions:

cut : Divides the range of a vector into intervals.

seq : Generates sequences of numbers, with specified start point, end point and step.

 

# Create a new data field for the exp_by_age data frame to indicate age band

# Age bands will range from age 15 to 105 and have length of 10 years: seq(15,105,10)

 

exp_by_age$ageband <- cut(exp_by_age$central_age, seq(15,105,10))

 

 

# Next, apply “aggregate” function to summarize proportion of exposures (“prop”) by age band

 

aggregate(prop ~ ageband, data = exp_by_age, sum)

ageband              prop

(15,25]                   4.7

(25,35]                   22.1

(35,45]                   23.8

(45,55]                  25.9

(55,65]                 18.0

(65,75]                  4.3

(75,85]                  0.8

(85,95]                  0.3

(95,105]               0.0

 

 

# Save the above table to a new data frame and then create a chart

 

exp_by_ageband <- aggregate(prop ~ ageband, data = exp_by_age, sum)

 

barplot(exp_by_ageband$prop, names.arg = exp_by_ageband$ageband, col = "red")

 

 

 

The chart shows that, consistent with expectations, the vast majority of exposures relates to individuals in the prime working ages of 26-65.

Area and Industry Data

For this specific dataset, a thorough data validation exercise will reveal concerns with the area (zip code) and industry (SIC code) data.

We will review the distribution of exposures for these two risk factors. However, this blog post already covers a lot of ground. Thus, we will save this topic for the next post.