Today, I am very excited to share our latest AI Lab project with you: an AI model designed to locate and extract important information across a large number of different document types. This article will introduce our new computer vision model, discuss what problems we were trying to solve, how it was designed and executed and finally what this product has to offer over alternative solutions. 

But before we dive in, it is important to point out that the results are the culmination of extraordinary efforts by a wonderfully talented team. Over the course of many months, our data science team has been preparing data, researching state-of-the-art technologies, developing prototypes and implementing a slew of functional utilities. To cap off this long development cycle, engineers from the backend and the UI all came together in a 24-hour hackathon to turn a proof of concept machine learning (ML) model into a customer-facing portal that allows us to serve our models in an interactive and visual way. Extending beyond just this model, our new AI Lab allows us to showcase more of our experimental technologies outside our immediate team. Although this article will focus on the technical machine learning aspect, it should not downplay the enormous contributions of the other teams needed to take the prototype and make it widely available.

The Challenge

To better understand how it works and why it’s such impressive technology, let’s first understand the challenges and why we wanted to build something better. At a high level, the task we are solving is entity recognition, or key-value extraction. The purpose of this task is to take unstructured or semi-structured data and find the meaningful information therein. For example, when processing invoices, it may be necessary to extract the invoice number from the document for record-keeping purposes. Entity recognition would automatically find that number and the information it is associated with (i.e. the invoice number) and extract the data. 

When we initially approached this problem, we had accounts payable (AP) document processing in mind. However, as our research progressed, we decided that a truly intelligent model should be able to process more than just invoices. While increasing the scope of the challenge, our machine learning team felt up for the task and we expanded our goal to be a more universal approach to entity recognition. The final consideration that we took into account during our development process was that whatever model and system we designed, it had to be lightweight enough to perform not only in a cloud environment but also in on-premises solutions. To this end, we restricted ourselves in such a way that we required our final solution to be able to consistently process greater than 100 pages per minute on a simple 4-core CPU at a clock speed no greater than 2Ghz. In the era of GPU and TPU acceleration, this was no simple task. 

The Process

With our work cut out for us, we began the process of evaluating possible model architectures that could meet all of our stringent requirements. Based purely on our speed and efficiency conditions for success, the use of a computer vision solution jumped out as a perfect fit. By using state-of-the-art, real-time neural network architectures, we could deliver on speed, but the lack of text data to learn important information from a document was a significant hurdle. Further complicating the situation was the fact that our dataset was almost entirely based on invoices — a reflection of our previous work. 

To compensate for these problems, we structured our data in a unique way that gave the model emphasis on learning information patterns rather than language. The idea was that by focusing on certain patterns within a document, our model should be able to quickly learn visual representations of how information is presented in documents. Furthermore, the repeating nature of these information organization patterns allowed us to get away with a relatively small dataset of only several thousand images.


With the data in hand and a neural network architecture ready to learn, we began training prototypes for evaluation. The bulk of our efforts focused on object detection models – a family of computer vision architectures well-suited for speed and learning visual patterns. We treated the training objective such that the model would learn to find an entity identifier (a key), the entity (a value) and to define the relationship between the two as a single, distinct instance (i.e. if a key and value form a pair). 

Object detection neural networks often rely on a metric known as intersection over union (IOU) which tells how much of the prediction overlaps the ground truth data. Although this worked most of the time as an adjunct metric to tell us how well our model was generally performing, we had to develop other metrics to truly understand if our predictions were good. In the case of a key and value that were separated by a large amount of white space in a document, an IOU metric might report 95% performance, but might still miss capturing the key or value within the bounding box. This also occurred on pairs where either the key or value was very short and occupied only a small area of the bounding box. 

Therefore, to more accurately measure performance, we blended our detection results with a proven OCR application. Our proprietary metric and method of handling our network’s loss function showed improvements in our ability to intuitively and quickly assess performance as our customers would experience it. In short — we were able to directly show if we were able to capture the information, rather than just the general locality. Additionally, by focusing on the information content rather than simple IOU, we could quickly disambiguate models and select only those with the best performance based on our objective.

The Result

By the end of our research and development efforts, we had identified a performant model that met our requirements for speed and showed a significant level of performance across our testing set. Furthermore, our model design resulted in several highly desirable traits:

  1. Because the model operates at a visual level first, it is naturally agnostic to the OCR engine that powers the actual data extraction. As long as there are coordinates for the text recognized by the OCR engine it can be used with our Semantik Recognition Engine.
  2. As was demonstrated in further testing, the model appears to be largely language agnostic. Although we only trained with English-language documents, we found equivalent or nearly-equivalent performance of our model when applied to Spanish, French, German, Danish and Italian. The one caveat with this language agnosticism is that performance is far greater in languages that utilize the Latin alphabet than with logographic or abjad languages. In addition to novel organizational concepts, such as right-left orthography present in these languages, obtaining good OCR results in logographic and abjad languages requires a custom engine.
  3. To push the limits of our testing, we applied the model that was trained only on invoices to other document types including credit applications, explanations of benefit, insurance claims and certain tax forms. Incredibly, the model performed at very high levels, extracting entity information at levels approaching our pure invoice results. This indicated that our initial experimental design was a success. The model was capable of learning visual structures most commonly associated with entity pairs and was not limited by the language used or general appearance of training data.

What’s Next

As we continue to develop and mature our AI model, our efforts will be focused on several optimizations. First, although we met our basic speed requirements, our goal for lightweight and rapid processing is to more than double our speed to greater than 200 images per minute. Second, although our final model performance showed tremendous results of higher than 95% in many of our AP-related fields, initial follow-up shows that we may improve these results even further by targeted optimizations of the model architecture and system. Finally, we are looking at methods to rapidly expand our training set in terms of both size and scope. By varying the document type beyond invoices and including more training data, we expect the voracious appetite of our neural network solution to continue to deliver state-of-the-art results.

We expect this breakthrough technology to take intelligent document processing to the next level by giving customers a competitive advantage with processing power and speed. 

To put our AI model to work with your own data in our AI Lab, go to