Reflection on data extraction accuracy stats and actionable take-aways

There is a lot of debate and hype about the accuracy of intelligent document processing (IDP) solutions. I am sure you’ve seen claims boasting 80%, 90% and even 100% accuracy rates and everything in-between. The important question to ask here is, what does ’accuracy’ in this context actually mean?

Accuracy Rates Might Be Different Than What You Think

You guessed it, accuracy rates are not created equal. There is an important distinction to make between ‘machine accuracy’ and ‘machine + human-in-the-loop (HITL) accuracy’ which many people are not considering when assessing an IDP solution. In my experience, most people assume the advertised accuracy rates are achieved by the machine alone – yet, this is rarely the case. Let’s take a look at why exception handling by humans is, in fact, not only inevitable in almost all cases, but actually a critical step in the process.

I would like you to consider all of the documents entering your organization – are they all in legible machine print with no smudges, stains, missing words and other errors? Are they all from the same vendor, using the same format and the same language? If the answer was yes to all, 100% accuracy is technically possible. However, this does not sound like a likely scenario for enterprises and government organizations that are processing thousands or millions of documents every year. Does it? The moment you introduce just one variation, 100% accuracy achieved by the machine alone is not realistic for a high volume of documents.

Let’s take invoices for example. They come in many shapes and forms and from many, often varying, vendors. Can you set up your IDP solution to process known invoices and achieve high accuracy rates? Yes, absolutely if you have a good, customizable solution. But what if you get an invoice from a new vendor using complex tables or if one of your existing vendors changes up their invoice format? The accuracy rate will be impacted.

Human-in-the-loop (HITL)

So, what is needed to address this issue? A human-in-the-loop. Leading IDP systems, like Ephesoft, use business tolerance rules or confidence scores when extracting data. For example, if our system can’t detect a letter or other mishap, it will get kicked out of the touchless process and trigger an exception alert for human review. This keeps the user in control and assures high accuracy rates as an end result. In addition, Ephesoft Transact allows users to improve the automated data extraction results over time. Ephesoft makes this process intuitive and easy for the user.

Improvements Over Time

With the help of AI, machines are certainly improving to accommodate variations and errors in documents and I am sure we’ll see significant progress in the near future. However, as it stands today, a 100% accuracy rate requires a human-in-the-loop (unless you only have perfect documents). The fact is, the accuracy rate will depend on your specific documents and how flexible and adaptable your IDP solution is to get the best results for your particular use case(s).

See It to Believe It

My recommendation is to put your IDP solution to the test with sample documents and see the results rather than relying on advertised accuracy stats. You’ll not only see the results firsthand but also experience how it is to work with the solution provider, which should also be an important criterion in your selection process.

The Inner Workings of Automation and Accuracy

Another interesting dimension when discussing accuracy specific to IDP is the various stages a document goes through during the data extraction process each of which can introduce accuracy failures:

1) Image Processing – During this stage the system normalizes, cleans up, rotates and makes other adjustments to the image which is critical to how well the image can be read. If there is a problem at this point, accuracy will be impacted down the line. Ephesoft leverages best-in-class image cleanup technologies achieving the most optimal results.

2) Document Classification – During this stage, the document type entering the process is identified; it is also commonly called “indexing.” This is where the software detects if the document is, for example, a bank statement, tax form, identification card, invoice or any other document type. Be aware that there are significant differences in classification capabilities across IDP vendors. Ephesoft Transact is recognized to provide the most advanced classification methodologies with a user-trainable model for ML search classification and separation, pre-trained ML classification models, key-value classification and HITL classification review and model updates.

3) Data Extraction – In this stage the rubber meets the road. The system pulls the data out of the previously cleaned-up and classified document. Here is where you can measure the true machine accuracy. During your assessment make sure to look at real life and a good variety of documents for your specific use case(s). As mentioned above, human-in-the-loop processes are important here to address exceptions and make system improvements over time. Note that some IDP providers send the data abroad to substitute their automated data extraction with manual data entry behind the scenes. For everybody who has documents with private information, this might introduce compliance and data privacy issues.

In Ephesoft Transact, the user can look at any of their documents to determine whether the information was correct. The system is color-coded to help the user easily identify if there are any documents that did not get accurately pulled from the identified fields. The output of the data is influenced by the previous image processing step. Behind the scenes, the application is programmed to the customer’s business rules and tolerance or confidence levels. Based on the level, the customer can set it up to always have a HITL or push the data straight-through if there are no potential errors detected.


While accuracy rates are essential for the evaluation of your IDP system, they are not as straightforward as you might think. Make sure to consider the different definitions of accuracy and understand what they translate to for your specific needs both in the short and long term.

If you like to see your documents in action, send us your sample documents and we’ll process them at no cost to you! Contact us here to get started.