Part 4: Validation of Industry and Area Data
Overview
The focus of this, the fourth part of our series on modeling Group Life Mortality Experience, is to continue the data validation of the Basic Life Death and Disability dataset that we started in the previous post.
Specifically, we will focus on the industry (SIC code) and area (zip code) data. The data validation exercise will uncover some concerns about the integrity of the data in these data fields.
Summary of Previous Posts
Before we move forward, here is a brief reminder of what we covered in the most recent post in this series.
Part 3: Data Validation
• Outlined a general approach to data validation
• Provided examples of validating data fields of different data classes, including:
o Factor
o Integer
o Numeric
• Indicated that, for this specific dataset, the data validation approach would reveal concerns for both Industry and Area data fields... and that these would be reviewed in the following post
General Approach to Data Validation – Quick Reminder
The general approach to data validation was outlined in the previous post.
As a reminder, our objective at this stage of the project is to validate the data quality. We’re not yet ready to begin analyzing mortality experience. We simply want to review each data field to ensure that (i) we understand the data within and (ii) the distribution of the data is generally consistent with our expectations.
Similar to previous posts, we will not require any scripting for this data exploration in R. We will continue to work interactively on the command line. For convention, I will show the commands that I enter in grey script and I will show the output returned in blue script. The number sign (#) will be used to denote comments.
The previous post included a list of the data fields in our dataset. Three of these data fields relate to Industry:
• four_digit_sic_code (data class = “Factor”)
• three_digit_sic_code (data class = “Factor”)
• two_digit_sic_code (data class = “Integer”)
Notice that the data class differs across the three Industry-related data fields. For the 2-digit SIC codes the data class, as expected, is Integer. But for both the 3-digit and 4-digit SIC codes the data class is Factor. It is possible that the 3-digit and 4-digit SIC code data fields have integer-type data that has, for some reason, been stored in R as Factor data class. But it is also possible that there are some records for which these fields do not contain integers – or even numbers. Let’s begin our validation of Industry data with an attempt to convert the 4-digit SIC code data to a new data field with data class of Numeric.
The first two functions we will use in R will be:
• as.character: Will coerce the inputted data to a field where data class = “Character”.
• as.numeric: Will coerce the inputted data to a field where data class = “Numeric”.
# Create a new data field where four_digit_sic_code is converted to a “Numeric” data class
|
# This is achieved by first converting the data field from “Factor” to “Character” and then
|
# Converting the data from “Character” to “Numeric”
|
|
SOA_Basic_Data$sic4 <- as.numeric(as.character(SOA_Basic_Data$four_digit_sic_code))
|
Warning message:
|
NAs introduced by coercion
|
# Some of the data entries could not be converted to numeric
|
# This likely indicates that some of the entries include non-numeric characters
|
We now know that the four_digit_sic_code data field includes values that could not be converted to Numeric. However, we do not know the contents of these data entries nor do we know the number of exposures and deaths related to these records. In order to get this information, we can use the is.na, summary and sum functions.
• is.na: The generic function is.na indicates which elements have a missing value indicator.
• summary: A generic function used to produce result summaries of the results of various model fitting functions.
• sum: Returns the sum of all the values present in its arguments.
# Find the non-numeric entries in the four_digit_sic_code data field
|
# Apply “is.na” function to the “sic4” data field.
|
# Then summarize on the 4-digit SIC code.
|
# The “maxsum” parameter indicates to display 4 values – the 3 most common and “Other”
|
|
summary(SOA_Basic_Data[is.na(SOA_Basic_Data $sic4),]$four_digit_sic_code, maxsum = 4)
|
00UN 0000 0013 (Other)
|
871 0 0 0
|
# All of the non-numeric entries in the four_digit_sic_code field have value of “00UN”
|
# And there are 871 such entries
|
|
|
# Find the number of death_policies_exposed and death_count with sic4 is “NA”
|
sum(SOA_Basic_Data[is.na(SOA_Basic_Data$sic4),]$death_policies_exposed)
|
845.5364
|
sum(SOA_Basic_Data[is.na(SOA_Basic_Data$sic4),]$death_count)
|
0
|
I was a little concerned about the presence of non-numeric data in the Industry data fields. However, we have now determined that these records make up an immaterial amount of our experience – only 845 exposures (out of 25 million total exposures) and 0 deaths.
Of course, looking at the non-numeric values is just the first step in validation the integrity of the Industry data. As a next step, let’s apply the summary function to the sic4 data field that we created (which contains 4-digit SIC codes as a numeric data class).
# Summarize the “sic4” data field from the SOA_Basic_Data data frame
|
|
summary(SOA_Basic_Data$sic4)
|
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
|
0 4911 6722 6220 8211 9999 871
|
The quartiles and averages are not meaningful because the SIC codes are just that – codes, not values. But there are two interesting observations:
• The minimum value is 0, meaning an SIC code of “0000”.
o There is no industry with SIC code of “0000” – this is most likely used for unknowns.
• The number of NA records is 871 – consistent with our earlier findings.
Let’s also summarize the 2-digit SIC code data field to find out if the records with 4-digit code of “00UN” have value of “0” or “NA” for the 2-digit SIC code.
# Summarize the “two_digit_sic_code” data field from the SOA_Basic_Data data frame
|
|
summary(SOA_Basic_Data$two_digit_sic_code)
|
Min. 1st Qu. Median Mean 3rd Qu. Max.
|
0.00 49.00 67.00 61.77 82.00 99.00
|
All entries in the two_digit_sic_code data field are, in fact, integers. There are no “NA” values.
Convert SIC Codes to Industry Class
The 2-digit SIC code field will contain, at most, 100 unique values. This is clear because the data class is “Integer” and the values range from 0 to 99.
We would like to look at the distribution of exposures by Industry. We could look at the distribution for the 100 or so unique values of 2-digit SIC codes. But it is probably more meaningful to consolidate these values to a smaller number of industry groupings.
There are a couple of different approaches that we could apply to map SIC codes to industry groupings. One option is to import a translation table – but we will save this approach for zip code analysis. The other option, which we will apply here, is to use “if” statements. In R, we do this using the ifelse function.
• ifelse: Returns a value with the same shape as the object being tested which is filled with elements selected from either yes or no depending on whether the element of the test is TRUE or FALSE.
• as.factor: Will coerce the inputted data to a field where data class = “Factor”.
Before we start working in R, here is the translation we will use:
2-Digit SIC Code
|
Industry Group
|
00
|
Unknown
|
01-09
|
Agriculture, Forestry and Fishing
|
10-17
|
Mining and Construction
|
18-19
|
Not in Use
|
20-39
|
Manufacturing
|
40-49
|
Transportation, Communication and Utilities
|
50-59
|
Trade
|
60-67
|
Finance, Insurance and Real Estate
|
68-69
|
Not in Use
|
70-89, ex 80 and 82
|
Services
|
80
|
Health Services
|
82
|
Educational Services
|
90-99
|
Public Administration
|
The use of “If” statements will be a little bit cumbersome, given that we are mapping to 12 unique Industry groupings. But part of the objective of this blog series is to provide some readers with exposure to the R environment – and so we will take this opportunity to gain experience with “If” statements.
# Map 2-digit SIC codes to a new data field for Industry Groupings
|
# Call this new data field “IndGrp”
|
|
SOA_Basic_Data$IndGrp <- as.factor( ifelse(SOA_Basic_Data$two_digit_sic_code == 0, "Unk",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 10, "Agri",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 18, "Mining",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 20, "NotUsed",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 40, "Manu",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 50, "Trans",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 60, "Trade",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 68, "Fin",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 70, "NotUsed",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 80, "Srvcs",
|
ifelse(SOA_Basic_Data$two_digit_sic_code == 80, "Health",
|
ifelse(SOA_Basic_Data$two_digit_sic_code == 81, "Srvcs",
|
ifelse(SOA_Basic_Data$two_digit_sic_code == 82, "Edu",
|
ifelse(SOA_Basic_Data$two_digit_sic_code < 90, "Srvcs",
|
"PubAd")))))))))))))))
|
Next, we will view the distribution of exposures by industry as a table. A bar chart would have been a nice way to summarize the data visually; however, it would be difficult to squeeze in the data labels for all 12 categories.
# Display the proportionate exposures by IndGrp in a table
|
|
# First, get the exposure by industry
|
exp_by_ind <- aggregate(death_policies_exposed ~ IndGrp, data = SOA_Basic_Data, sum)
|
|
# Next, calculate the exposures as a proportion
|
exp_by_ind$prop <- round(exp_by_ind$death_policies_exposed/
|
sum(exp_by_ind$death_policies_exposed) * 100,1
|
|
# Finally, view the distribution of exposures by industry
|
exp_by_ind
|
|
IndGrp death_policies_exposed prop
|
Agri 2.738011e+05 0.6
|
Edu 2.690616e+06 6.1
|
Fin 3.609390e+06 8.1
|
Health 4.238318e+06 9.5
|
Manu 6.619285e+06 14.9
|
Mining 1.952551e+06 4.4
|
NotUsed 7.093636e+02 0.0
|
PubAd 2.229721e+06 5.0
|
Srvcs 1.154338e+07 26.0
|
Trade 6.130366e+06 13.8
|
Trans 1.874369e+06 4.2
|
Unk 3.300237e+06 7.4
|
|
I didn’t have any strong a priori assumptions about the distribution of exposures by industry. On the whole, it seems quite plausible to me. My only concern is that 7.4% of exposures relate to lives where the industry is Unknown. That is higher than I would have liked. More on this later…
Area Data Fields
The previous post included a list of the data fields in our dataset. Three of these data fields relate to Area:
• five_digit_zip_code (data class = “Factor”)
• four_digit_zip_code (data class = “Factor”)
• three_digit_zip_code (data class = “Factor”)
Two quick observations about the Area-related data fields. First, it is only zip code data – no state or region indicator. Second, all three data fields are of data class “Factor” – likely indicating that there are some non-numeric data entries.
For data validation, we will first want to learn more about the non-numeric entries in the zip code data fields. Then, we will want to consolidate the zip code data to something higher-level to facilitate a look at the distribution of exposures. A carrier could use its own area groupings. In this, a generic example, consolidate zip code to state makes good sense. However, to simplify the visuals, I will instead consolidate from zip code to region (as defined in Appendix II of the SOA’s Group Life Mortality Report).
We will begin working with the Area data fields in a similar manner to how we worked with the Industry data fields. We will convert the 3-digit zip codes to numeric and then identify those values that could not be converted.
# Create a new data field where three_digit_zip_code is converted to a “Numeric” data class
|
# This is achieved by first converting the data field from “Factor” to “Character” and then
|
# Converting the data from “Character” to “Numeric”
|
|
SOA_Basic_Data$zip3 <- as.numeric(as.character(SOA_Basic_Data$three_digit_zip_code))
|
Warning message:
|
NAs introduced by coercion
|
# Some of the data entries could not be converted to numeric
|
# This likely indicates that some of the entries include non-numeric characters
|
|
|
# Summarize the “zip3” data field from the SOA_Basic_Data data frame
|
|
summary(SOA_Basic_Data$zip3)
|
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
|
0.0 235.0 480.0 491.4 752.0 999 5588
|
|
# Find number of exposures for entries where 3-digit zip is non-numeric
|
sum(SOA_Basic_Data[is.na(SOA_Basic_Data$zip3),]$death_policies_exposed)
|
17607.67
|
|
# Find the non-numeric values in the 3-digit zip code data field
|
|
summary(SOA_Basic_Data[is.na(SOA_Basic_Data$zip3),]$three_digit_zip_code, maxsum = 40)
|
L4B R3B M1S L4M M9C V6X T2P M2N T4B M2K K2K V6P H3B
|
979 528 404 351 330 324 320 228 209 201 157 138 119
|
V6B R3Y M1H R3C K6H L9T V7V L8L M4S T6E T2H S4P H4P
|
113 108 101 99 98 96 94 79 59 55 54 48 41
|
V8V J2G M1K R2J H3Z G0R N1G N2H H4S T2C L7N M5J 000 (Other)
|
40 34 30 30 25 24 19 16 12 12 9 4 0 0
|
From the above analysis, we learn the following about the non-numeric data in the 3-digit zip code field:
• These cases have exposures of about 17,600. This is quite small relative to our total exposures of 25 million.
• All of these cases appear to relate to exposures in Canada. Canadian postal codes are of the form “A1A 1A1”. I.e., 6 characters, alternating between letters and digits. Thus, the first three digits of a Canadian postal code has the form “A1A”.
Our next step is to translate the zip code data to a higher-level area grouping, such as state or region. As previously mentioned, for this exercise, I will translate to the regions as defined in the SOA report.
My first step is to create a translation table from 5-digit zip codes to region. I will create this as a csv file using Excel. The table will contain nearly 100,000 entries so I will not publish it within this blog post. Feel free to contact me if you would like a copy of the file.
We will read the translation table into R. Then we will use the merge function to append the appropriate region category to each record in the SOA_Basic_Data data frame.
• merge: Merge two data frames by common columns or row names, or do other versions of database join operations.
# First, read in the zip code to region translation table
|
zip_region <- read.csv("C:\\SOA 2016 GL Exp\\zip_region.csv")
|
|
|
# Next, take a quick look at column names and data classes
|
sapply(zip_region,class)
|
five_digit_zip_code State Region
|
"factor" "factor" "factor"
|
|
|
# Append the “region” to the SOA_Basic_Data data frame
|
## use “merge” function to merge two data frames:
|
## data frame “x” is SOA_Basic_Data, data frame “y” is zip_region
|
## merge using the “five_digit_zip_code” field from both data frames
|
## Note the last argument in the function: “all.x = TRUE” – this ensures that all rows from
|
## “SOA_Basic_Data” will remain in the new data frame – even if there is no match of 5-digit zip
|
## codes to the “zip_region” data frame
|
|
SOA_Basic_Data <- merge(x = SOA_Basic_Data, y = zip_region, by = "five_digit_zip_code", all.x
|
= TRUE)
|
Now let’s get the distribution of exposures by region:
# Calculate the distribution of exposures by region
|
|
# First, aggregate exposures by region
|
exp_by_reg <- aggregate(death_policies_exposed ~ Region, data = SOA_Basic_Data, sum)
|
|
# Calculate the exposures by region as a proportion
|
exp_by_reg$prop <- round(exp_by_reg$death_policies_exposed/
|
sum(exp_by_reg$death_policies_exposed) * 100,1
|
|
# View the distribution of exposures by industry
|
exp_by_reg
|
|
Region death_policies_exposed prop
|
ENC 8224359.97 18.5
|
ESC 2428963.83 5.5
|
MidAtl 5181649.64 11.7
|
Mountain 2191707.68 4.9
|
NewEng 2557672.01 5.8
|
Other 28155.48 0.1
|
Pacific 5388243.70 12.1
|
SthAtl 6857990.72 15.4
|
Unknown 2815669.90 6.3
|
WNC 3935781.53 8.9
|
WSC 4852552.20 10.9
|
|
# Or view the distribution of exposures in chart form
|
barplot(exp_by_reg$prop, names.arg = exp_by_reg$Region, col = "red")
|
|
|
|
The distribution of exposures is reasonably consistent with general population numbers. East North Central (“ENC” includes IL, IN, MI, OH, WI) stands out as being a few percentage points higher than expected, but within the realm of reasonability. An insurer working with its own data would have the ability to review and validate these distributions against internal reports. One challenge of working with aggregated industry data is that our ability to validate against other information sources is limited.
Similar to the distribution of exposures by Industry, there are a lot of exposures where the Region is unknown (6.3%). At this point, it makes sense to look in greater detail at the records missing Industry and Region information.
Industry and Area Data
Thus far we have uncovered that Industry is unknown for 7.4% of exposures and Area is unknown for 6.3% of exposures. The numbers are quite similar and so I wonder if there is a lot of overlap – are the exposures that are missing Industry information the same as those that are missing Area information?
Let’s create a data field that indicates which of Industry and Area data are known / unknown. Then we can aggregate by this data field to determine the proportion of exposures where both Industry and Area are unknown.
# Create a data field to indicate known/unknown status of Industry and Area
|
|
SOA_Basic_Data$IndArea <- as.factor(ifelse(SOA_Basic_Data $IndGrp == "Unk" &
|
SOA_Basic_Data $Region == "Unknown", "Both Unk",
|
ifelse(SOA_Basic_Data $IndGrp == "Unk", "Ind Unk",
|
ifelse(SOA_Basic_Data $Region == "Unknown",
|
"Area Unk", "Both Known"))))
|
|
|
# Aggregate exposures by this new data field
|
|
exp_by_indarea <- aggregate(death_policies_exposed ~ IndArea, data = SOA_Basic_Data, sum)
|
|
exp_by_indarea$prop <- round(exp_by_indarea$death_policies_exposed/
|
sum(exp_by_indarea$death_policies_exposed)*100,1)
|
|
# View the distribution of exposures by Industry/Area known/unknown indicator
|
exp_by_indarea
|
|
IndArea death_policies_exposed prop
|
Area Unk 7342.843 0.0
|
Both Known 41155166.460 92.6
|
Both Unk 2808327.059 6.3
|
Ind Unk 491910.301 1.1
|
For nearly all the exposures where the Area is unknown, the Industry is also unknown – this is 6.3% of total exposures for which we are missing both Industry and Area information. This is big enough that we should do further investigation into these exposures.
If this data were our own company’s internal data then we would likely begin our investigation at the data source and try to learn why these data fields are not entered for so many exposures. But, as we are studying industry data, we do not have this luxury. Instead, we will focus on the data – and look at these exposures in more detail.
Let’s begin by comparing the age and gender distributions of the exposures where Industry and Area are unknown and to the age and gender distributions for the rest of the exposures. First, we will introduce a new R function.
• tapply: Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors.
# Calculate the distribution of exposures by gender
|
## Separately for exposures where Industry/Area are unknown vs all other exposures
|
|
# First, aggregate exposures by Industry/Area indicator, gender and age
|
exp_by_mult <- aggregate(death_policies_exposed ~ IndArea + sex + central_age,
|
data = SOA_Basic_Data, sum)
|
|
# Demonstrate the “tapply” function
|
## Determine total exposures by gender for all cases
|
### Use “tapply” function on “death_policies_exposed” by “sex” and get the “sum”
|
tapply(exp_by_mult$death_policies_exposed, exp_by_mult$sex, sum)
|
|
Female Male
|
19311789 25150958
|
|
# Let’s repeat, but only including cases where Industry/Area are both unknown
|
## Ie, rows where: exp_by_mult[exp_by_mult$IndArea == “Both Unk”,]
|
|
tapply(exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$death_policies_exposed,
|
exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$sex, sum)
|
|
Female Male
|
900169.6 1908157.5
|
|
# Let’s repeat, but (i) dividing each value by the sum to get a proportion, (ii) multiplying by 100
|
## and rounding to one decimal point to simplify viewing and (iii) saving to a variable
|
|
both_unk_by_sex <- round(100 *
|
tapply(exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$death_policies_exposed,
|
exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$sex, sum) /
|
sum(exp_by_mult[exp_by_mult$IndArea == "Both Unk",]$death_policies_exposed), 1)
|
|
# View Results
|
both_unk_by_sex
|
|
Female Male
|
32.1 67.9
|
|
# Let’s repeat, but for cases where at least one of Industry or Area is known
|
|
other_by_sex <- round(100 *
|
tapply(exp_by_mult[exp_by_mult$IndArea != "Both Unk",]$death_policies_exposed,
|
exp_by_mult[exp_by_mult$IndArea != "Both Unk",]$sex, sum) /
|
sum(exp_by_mult[exp_by_mult$IndArea != "Both Unk",]$death_policies_exposed), 1)
|
|
# View Results
|
other_by_sex
|
|
Female Male
|
44.2 55.8
|
|
# Now let’s combine the output into one table – to view “unknown” and “other” side-by-side
|
|
# First, create an empty data frame of 2x2 dimension
|
df_sex <- data.frame(matrix(ncol = 2, nrow = 2))
|
|
# Assign the values of “Unknowns” to the first column and “Others” to the second colum
|
|
df_sex[,1] <- both_unk_by_sex
|
df_sex[,2] <- other_by_sex
|
|
# Finally, add row and column names
|
|
row.names(df_sex) <- c("Female", "Male")
|
colnames(df_sex) <- c("Both Unk", "Other")
|
|
# Let’s view the results
|
|
Both Unk Other
|
Female 32.1 44.2
|
Male 67.9 55.8
|
When we compare cases where both Industry and Area are unknown to the Other cases, there are some differences in the distribution of exposures by gender. However, these differences aren’t large enough to raise immediate concerns.
Next, we can repeat this exercise to get the distributions of exposures by age (rather than by gender). I won’t provide the R code as it is the same as above – just substituting “central_age” for “sex”. Here are the results:
|
Both Unk
|
Other
|
17
|
0.1
|
0.2
|
22
|
1.8
|
4.7
|
27
|
5.6
|
10.8
|
32
|
8.2
|
11.9
|
37
|
9.1
|
11.6
|
42
|
10.5
|
12.5
|
47
|
11.7
|
13.0
|
52
|
12.1
|
13.
|
57
|
11.4
|
10.7
|
62
|
9.4
|
7.1
|
67
|
9.9
|
2.8
|
72
|
3.3
|
1.0
|
77
|
2.5
|
0.4
|
82
|
2.1
|
0.2
|
87
|
1.6
|
0.1
|
92
|
0.6
|
0.0
|
97
|
0.1
|
0.0
|
There is a significant difference in the distribution of exposure by age for these two subsets of the data. In particular, focus on central age of 67 or greater (i.e., at or beyond normal retirement age). For the cases where both Industry and Area are unknown, 20.1% of exposures are at these ages versus only 4.5% of the cases where at least one of Industry or Area are known.
Take a moment to reflect on these numbers and consider the context. Ask yourself what proportion of Group Life exposures would you expect to be aged 65 or greater. I think a good estimate is somewhere around 5% - and so I am comfortable with the data for the cases where at least one of Industry and Area are known (as the observed proportion is 4.5%). For the cases where both Industry and Area are unknown, the observed proportion is 20.1% - much higher than my expectations and so I have concerns about the quality of this data.
Claims Experience for Exposures Missing Industry and Area Data
Thus far, we have been focused on risk characteristics of the exposures and we have not given any consideration to the outcome (mortality rates) we will be studying. This is a deliberate decision which will be explained in more detail in a future post. But, at a high-level, the explanation is that we will want to put aside some of the data – before we begin looking at the outcomes – so that we will have out-of-sample data to use to validate our final model.
But in our current situation – where we have serious concerns about data quality – it really is necessary to peek at the outcomes (mortality rates). There are two fields we will look at:
• death_count: the number of claims
• expected_death_by_policy: the expected number of claims, as determined by the study’s authors, based on the observed mortality rate by age and gender
Specifically, we are interested to see the (i) proportion of total deaths that are related to exposures where both the Industry and Area data are missing and (ii) the Actual/Expected ratios by known status of Industry and Area data.
# Determine # death claims for cases where Industry and Area are unknown
|
|
aggregate(death_count ~ IndArea, data = SOA_Basic_Data, sum)
|
IndArea death_count
|
Area Unk 15
|
Both Known 67226
|
Both Unk 27894
|
Ind Unk 1754
|
|
## 27,894 death claims relate to exposures where Industry and Area are both unknown
|
## What is this as a proportion of total death claims?
|
|
sum(SOA_Basic_Data[SOA_Basic_Data$IndArea == "Both Unk",]$death_count) /
|
sum(SOA_Basic_Data$death_count)
|
0.2878965
|
|
# What are the Actual and Expected death counts by Area/Industry indicator?
|
|
aggregate(cbind(death_count, expected_death_by_policy) ~ IndArea, data = test_data, sum)
|
IndArea death_count expected_death_by_policy
|
Area Unk 15 10.76177
|
Both Known 67226 74010.58226
|
Both Unk 27894 21021.78983
|
Ind Unk 1754 1845.86614
|
|
|
# What are the Actual/Expected ratios by Area/Industry indicator?
|
## First, where Industry and Area are both unknown
|
|
sum(SOA_Basic_Data[SOA_Basic_Data$IndArea == "Both Unk",]$death_count) /
|
sum(SOA_Basic_Data[SOA_Basic_Data$IndArea == "Both Unk",]$expected_death_by_policy)
|
1.326909
|
|
## Next, all other cases
|
|
sum(SOA_Basic_Data[SOA_Basic_Data$IndArea != "Both Unk",]$death_count) /
|
sum(SOA_Basic_Data[SOA_Basic_Data$IndArea != "Both Unk",]$expected_death_by_policy)
|
0.9094179
|
The cases where neither Industry nor Area are known account for 6.3% of lives exposed but 28.8% of deaths. This discrepancy is partly explained by the very high proportion of these exposures relating to individuals aged 65 or greater (20.1%) and it is partially explained by poor mortality experience (Actual/Expected ratio of 132.7%).
Again, one challenge of working with data collected from multiple carriers, is that our ability to further investigate the data is limited. In this specific instance, I suspect that our data may have been tainted with some Retiree life insurance data – as this would explain both the high proportion of exposures at or beyond age 65 as well as the high Actual/Expected ratios that are observed for cases where both Industry and Area data are missing.
I was able to present this data to the SOA and they were able to confirm that nearly all of the death counts for exposures missing both Industry and Area (27,820 of 27,893) were submitted by one carrier. My decision will thus be to exclude these exposures from the analysis.
Final Thoughts on Data Validation
Data validation is critical to any analytics exercise. And so, we spend a lot of time and effort to review the data and ensure its integrity prior to commencement of our modeling. But it is important to remember that data validation is not a one-time exercise. We need to be maintain our vigilance in questioning data integrity throughout the modeling process. As we begin to analyze the data, if we see results that are inconsistent with our expectations, we need to make extra effort to validate the accuracy of the results – and that often begins with another review of the integrity of the underlying data.