Document OCR & data extraction

Goal of the project was to extract data from company’s various documents like driver licenses, insurance cards.

Project Details

Sick.org/Curogram helps customers to get an appointment with doctors and tests. As an integral part of their business users supply their documents which should be entered into Curogram’s system. As the number of users grows large there was a vital need of automating this procedure in order to process large volumes of requests.

The solution was to develop an algorithm that does this job instead of manual data entry. In the prototyping stage, we have tried various methods like applying OCR i.e. extracting text and creating an NLP model which extracts custom labels. Another method was to create a Computer Vision model which detects each field like name, address, gender, etc. We build and trained custom models with Yolo and used AWS services like Textract, Comprehend, Rekognition. However, inference time was high for the client around 10 seconds for each image. That is why I have applied Textract + Regex to extract details from driver licenses and insurance cards in under 5 seconds and deployed it as REST API.

Document OCR & data extraction

Project Details

Date

Categories

Client