Automating the insurance claim reassessment process using applied machine learning

The Challenge

Claim assessments produced by insurance agencies in the USA in general tends to a standard structure in terms of the breakdown of its line items. A claim would consist of a list of sections that corresponds to different rooms of the property. Each section would consist of a table of line items that details individual repair, replace, or other necessary action that need to be done for reconstructing the damages done by a natural disaster. 

In producing the statements, there are often instances where the estimates for line items are quoted below the industry standard prices. In such instances, a 3rd party reassessment is done in order to determine the excess amount that can be claimed by aligning the individual line items with the industry standard prices.

The reassessment process is a labor intensive task that require assessors verifying each item manually versus a standard market dataset such as the Craftsman database. There are no or little automation used in the task. Our challenge was to introduce automation to the above process, while making sure that the solution has the capability to scale and function with the layman users. 

The Goal

Our engineering process starts by first identifying the key outcomes of the solution. Those goals will be used as the foundation for further requirement analysis as well as the success metrics along the journey of the project.

Automation for the line item mapping

The key piece of the solution is the ability for the system to extract and map line items with industry standard items. This requires a combination of process automation technologies to work seamlessly.

Intuitive and robust UX

The solution is intended to be used by the end users who doesn’t possess domain knowledge on insurance and construction cost estimations. The system needed to be simple while being informative to the extend required. 


The system need to be built with the future in mind. To adapt for the changes in data, policy, and other domain level variables. The client need to be in complete control over all aspects of automation, while been kept away from the code level complexities.

The Solution

When we first visualized the solution, we had it broken into 3 components; the app, the backend, and the data service. Our data engineers started off R&D efforts on the data service component while the product engineering team took over rolling out the web app, mobile app, and the backend in phases. 

Product Engineering

Creating products that scale is a non-trivial task when asynchronous data processing elements are involved. Keeping that in mind, our first key decision was the technology stack selection. Just as any MVP product, we understand the need for a product to reach the market as soon as possible. But the speed should not compromise the foundation of the solution, rendering it unusable when scaling. 

The alpha build

The alpha build or the PoC build is the first release artifact we developed. This is the pre-cursor for the MVP that would validate the uncertainities around the process and the solution. The web app was done with ReactJs with Firebase for authentication, file and data storage. Going serverless allowed us to rapidly build out the alpha while putting the key focus on the data processing elements. By using React Native, we were able to build up the mobile app alongside the web app. The logic sharing was a crucial factor in its success.


The MVP build

With the alpha build out and about, we were able to receive constructive feedback from the client as well as the selected control group that tested in the perspective of end users. With a modified build path, the engineering team next started off on the MVP build.

The MVP unlike the alpha, is meant to be built for scale from day 01. We decided to move to a monolithic backend. Python/Django was selected to be the backend along with PostgreSQL as the database. Firebase was still an active component in the solution powering the real-time data aspects for asynchronous data processing.


Data Engineering

Data engineering for this solution was a challenging piece of engineering. Avantrio – our partners in data lead the R&D efforts along with our product engineering team. The approach is simple; make a PoC model that works, validate it over a range of test cases, and implement it in a scalable manner. Automating the mapping process has 2 pieces,

  1. Extracting the line items from the PDFs user uploaded
  2. Mapping the line items to a standard item from the Craftsman database


Extracting line items

Extracting texts from a PDF is a standard OCR task that has been well researched. But extracting with a structure is a bit difficult. The challenges included recreating the tabular structures from the raw OCR data given from the OCR model, and making sure line items and corresponding quantity figures do not get distorted. Initial approach was using AWS Textract as the OCR along with a bundle of custom logic handling the post processing of the OCR data. While this worked quite well for the training cases, the model became too complex when incorporating additional claim document types. 

Second approach was to use a service that provides a pre-trained, mature model for the use case. By decouppling the business logic aspects of the extraction we were able to use Nanonet API for the extraction process. Nanonet has a well trained model that can easily identify a range of table structures and provide the output in a standard structure. Using these structures, post processing was simplified. With that the extraction task was completed.

Mapping line items

Mapping line items was seen as a sentence pair mapping problem. The system had to find a matching sentence from a pre-defined list of sentences to a another sentence extracted by the OCR system. The challenges include having poor extractions, and having different wording in the line item from the standard line item. In order to resolve it, we approached the problem in 3 different perspectives,

  1. A string matching problem
  2. An information retrieval (IR) problem
  3. An NLP problem
After trying out PoC approaches on all 3 cases above, the best performing model was creating as a combination of IR and string matching. We modeled the Crasftman database into a document corpus that can be digested by am inverted index to create a high recall mapping model. This was used as the first level of filtering in the search process. Then a state-of-the-art sentence embedding model was used as the converging model. The output of the index is fed into the embedding model to calculate semantic aware similarities between sentence pairs, providing high precision. Together, the combined model achieve high recall and high precision – the ideal conditions.

Client Testimonial

"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo."

Issac Olson