Blog

Modeling Group Life Mortality Experience: A How-To Guide - Part 2: Acquiring the Data

By Jonathan Polon on 3/2/2018

Part 2: Acquiring the Data

Overview

The focus of this, the second part of our series on modeling Group Life Mortality Experience, is to acquire the dataset that will be the foundation of our model. In this post we will also take a very quick first peek at the dataset but we will hold off on a more thorough exploration of the dataset until the next posting. As a reminder, we will be working with industry data that was provided to the Society of Actuaries (by several contributing Group Life carriers) for its 2016 Group Term Life Experience Study.

In fact, the SOA has made two datasets available:
• Basic Life Death and Disability
• Supplemental Life Death and Disability

This blog series will work with the Basic Life dataset. Both datasets, as well as the report and pivot tables, can be found here:

https://www.soa.org/experience-studies/2016/2016-group-life-mortality-study/

Acknowledgements 

Before we move forward, I would like to pause briefly to thank the Society of Actuaries – especially the research staff and the Group Life Insurance Experience Committee – for making the underlying data from the 2016 Group Term Life Experience Study available to the industry, the actuarial profession and the general public.
This dataset can be used by Group Life carriers to gain a better understanding of Group Life mortality experience than would be possible using only the carrier’s own experience. This dataset also provides an excellent opportunity for people studying actuarial science or data science to further their knowledge by working with real-world data.
I hope in future the Society of Actuaries will continue to make available the underlying data from its experience studies.

First Look at the Data

You will recall that in Part 1 of this series I stated that we would, as much as possible, perform the data analysis in R. Thus, we will begin by reading the data into… Excel?

This is an optional step and we won’t perform any analysis the data in Excel. I simply like the interface provided by Excel – the opportunity to scroll through the rows and columns of the data table and see a few of the individual records.

The dataset is quite large – over 4 GB. And although we do not yet know the table dimensions (number of rows and columns) it is unlikely that the table will be small enough to be read into Excel in its entirety. So, instead, we will simply read in the first 10,000 records. This can be accomplished using the VBA script provided below.

I won’t comment on the findings from this first look at the data. We will do all of the exploration in R. But some modelers will find that having a few records read into Excel can be a useful reference point over the course of our analysis and modeling.

Excel VBA Script to Read in First 10,000 Records

 

Sub read_part_file()

 

Dim strFileName As String: strFileName = "C:\SOA 2016 GL Exp\Group-Life-2010-13-Basic-Life-Death-and-Dis-AE-Ratios.csv"

 

Dim TextLine As String

Dim Text As String

Dim iFile As Integer: iFile = FreeFile

 

Open strFileName For Input As #iFile

 

Application.ScreenUpdating = False

 

Count = 1

 

Do While Count < 10002

     Line Input #1, TextLine

     Sheets("Sheet1").Select

     Range(Cells(Count, 1), Cells(Count, 1)).Value = TextLine

     Count = Count + 1

Loop

 

Close #iFile

 

End Sub


Read Data into R and Take a Quick First Look

We will not require any scripting for this initial step in R. We will simply work interactively on the command line. For convention, I will show the commands that I enter in grey script and I will show the output returned in blue script. The number sign (#) will be used to denote comments.

Read the data into R

The first five functions we will use in R will be:

•   memory.limit: Check how much memory is allocated to R.

 memory.size: Ensure that the amount of memory allocated to R is sufficient to store the very large dataset that we will be working with.

•   read.csv: To read the data file (which is in csv format) into the R environment

•   dim: To get the dimensions of the dataset

•   colnames: To get the names of the columns

Note the following:

•   R is case-sensitive

•   When entering directory paths, subdirectories are separated by a double backward slash (\\) rather than the single backward slash (\) used in Excel and many other     Windows applications.

•   The help operator in R is a question mark (?). For example, to get help on the read.csv function, go to the command line and type: ?read.csv

Now, let’s get load our dataset into R.

# Check amount of memory allocated to R, in MB

 

memory.limit()

8071

 

# Increase memory allocation to 16000 MB to accommodate size of dataset

 

memory.size(max = 16000)

16000

 

# Read in the dataset from csv file and assign to a data frame called “SOA_Basic_Data”

 

SOA_Basic_Data <- read.csv("C:\\SOA 2016 GL Exp\\Group-Life-2010-13-Basic-Life-Death-and-Dis-AE-Ratios.csv")

 

# Find the dimension of the SOA_Basic_Data data frame

 


dim(SOA_Basic_Data)

21848920 25

# The data set has 21,848,920 records and 25 data fields (columns)

 

# Find the column names of the SOA_Basic_Data data frame

colnames(SOA_Basic_Data)

[1] "exposure_coverage_type" "face_amount_band" "salary”

[4] "group_size" "sex" "central_age"

[7] "observation_year" "five_digit_zip_code" "four_digit_zip_code"

[10] "three_digit_zip_code" "four_digit_sic_code" "three_digit_sic_code"

[13] "two_digit_sic_code" "death_policies_exposed" "death_amount_exposed"

[16] "death_count" "death_claim_amount" "expected_death_by_policy"

[19] "expected_death_by_amount" "disability_amount_exposed" "disability_claim_amount"

[22] "expected_disability_by_amount" "disability_policies_exposed" "disability_count"

[25] "expected_disability_by_policy"


Our dataset has now been read into a data frame in R. The dim function tells us that the dataset contains 21,848,920 records and 25 columns. From the colnames function we know the names of each column in the dataset.

Two other functions that may be considered at this point of the data exploration are:

•   head(SOA_Basic_Data)

o   This will output the first few rows of the data frame for viewing.

•  sapply(SOA_Basic_Data, class)

o   This is a nested function that will apply the class function to each column (data field) and output the class for viewing.

o   In this data frame there are three types of data classes present:

  • §  Factor (categorical data)
  • §  Integer
  • §  Numeric

Additionally, now may also be a good time to look at the 10,000 records of the dataset that we have already read into Excel.

Preview of the Next Post in this Series

In the next post we will explore the dataset in more detail.

For example, we will look at the distribution of records, exposures and claims for a few of the data fields to 
     (i) better understand the fields within the dataset and 
     (ii) to validate the quality of the data.