|Problem:||Data collectors for a survey about worker benefits receive PDF versions of standardized health insurance benefit pamphlets from employers and have to copy the information to an electronic instrument manually.|
|Solution:||Use machine learning to automate the extraction and coding of data from the PDF documents.|
|Skills Used:||Natural Language Processing; Named Entity Recognition; Machine Learning; Decision Trees; PDF Data Extraction|
|Languages Used:||Python; Visual Basic for Applications|
Data collectors for the National Compensation Survey, referred to as Field Economists, ask businesses to report on not only the wages they offer their employees but also the benefits. One of those benefits, healthcare, can consist of a lot of fine details, such as what is the deductible, how much can an employee be expected to pay out of pocket, etc. Occasionally, rather than report these nuances, whoever is reporting for the business (often an HR or accounting professional) will give the Field Economist a pamphlet or booklet known as a Summary Plan Description. The Field Economist then has to read through the booklet and extract the information we need for the survey. This can be a time consuming and error prone process.
Then, in 2014, the Patient Protection and Affordable Care Act (ACA) came into effect. Among the many changes that the ACA ushered in was a requirement that every health insurance plan must come with a simplified, standardized pamphlet of what that insurance plan offers in terms of coverage. These pamphlets are known as Summary of Benefits and Coverage (SBC) documents. SBCs range from 5 to 10 pages and have a table format like the example below (see a full, PDF example). The questions listed in the left-hand margin are required (though how they are written can vary a bit). The contents of the columns also follow general formatting guidelines but is also variable.
Over the last several years, Field Economists have started to receive these along with the more complicated Summary Plan Descriptions. They, and national office staff, began to realize that this semi-standardized format offers us an opportunity.
Given the standardized nature of the form. We wanted to see if we can automate the process of data extraction. That is, our Field Economists could get the SBC document from an employer (hopefully in a native PDF format) and submit it through the regular data collection process. We could then feed that document through some programs that extract and code the pieces of interest into data tables.
The Work In Progress
This started a multi-phase research effort to test the feasibility of accurately extracting and coding SBC information. I started this project after some initial work had been done to establish a rule-based system for parsing statements about out-of-pocket maximum payments like the one below. This early phase got accuracy rates between 80-85% for coding values related to in-network individual and family coverage. No attempt was made to get out-of-network plans.
- For network providers $7,900 individual / $15,800 family. For out-of-network providers $15,800 individual / $31,600 family.
I started from scratch when I joined the project. I wrote new programs for extracting the statements from the PDFs and rewrote most of the code related to the rule-based system. I managed to increase accuracy to 85-91% but still on just in-network individual and family coverage. While I was working on this, I discovered a treasure trove of SBC documents in the healthcare.gov Plan Attributes Public Use File. Just as important as access to 12,000 SBCs (NOTE: we only had 500 SBCs of questionable scan quality before) was the fact that healthcare.gov also provides tabular data for many of the elements we were interested in extracting.
With this new set of data we began to explore machine learning methods for labeling the dollar values we were extracting from these statements as belonging to one of four classes (In-Network Individual, In-Network Family, Out-of-Network Individual, and Out-of-Network Family). We used scikit-learn in Python to train a decision tree model using the words appearing before and after each dollar value and several other values as features. This approach was made more complicated by the fact that some statements can simply say “$0” to mean that all four types of plans have an out-of-pocket maximum or deductible of $0. So a single dollar value can have multiple labels.
With all of this figured in, we obtained accuracy rates of > 99% on an out-of-sample test set from the healthcare.gov SBCs. Unfortunately, when we used this same model trained on the healthcare.gov SBCs on the SBCs collected through the National Compensation Survey there was a significant decrease in accuracy to < 90%. This is due to overfitting. The healthcare.gov SBCs and the NCS ones differ in one significant way, the plans offered through the healthcare.gov exchanges are specifically for people who don’t receive healthcare through their employer while the NCS plans are strictly employer-provided healthcare plans. So the language differs slightly; for example the use of the word “employee” and higher frequency of self+one plans in the NCS set.
Given this information the NCS has revised its data collection methods to build a larger dataset of SBCs (in addition to the still-prevalent Summary Plan Descriptions). Once we have a large enough set of coded SBCs we can begin to train machine learning models on employer-provided SBCs only and hopefully increase the accuracy of coding. In the meantime, we are refining the models and the code used to extract and code data in the healthcare.gov dataset, including downloading data from subsequent years and using one year’s SBCs to train models that label the following year’s data as would happen in production. We are also coding data for elements in the SBCs that we would like to extract, but do not appear in the healthcare.gov dataset using Amazon Mechanical Turk.