Fuzzy DB Extraction Plugin


Fuzzy DB plugin is used to extract the document level fields of a document from records in the database on the basis of the matched value of the HOCR content or the previously extracted value of a document level field. This plug-in involves creation of search engine based indexing and extracting document level field value based on fuzzy match of HOCR content against index. User can configure any Vendor database in order to capture Vendor name, Vendor ID or any other field from the incoming invoices. This can be done simply by mapping the document to the Vendor database table and the index fields of the document to the columns in the database table. The plugin will find the matching vendor from the database and update the fields in the document.


Configurable properties

Following are the configurable properties available for the Fuzzy Db plugin:

Configurable property Type of value Value options Description
Minimum Word Length Integer N-A The minimum word length below which words will be ignored from the HOCR content.
Minimum Term Frequency Integer N-A The frequency below which terms will be ignored in the source document.
Minimum Doc Frequency Integer N-A Sets the frequency at which words will be ignored which does not occur in at least this many documents
Maximum Query Terms Integer N-A The maximum number of query terms that will be included in any generated query.
Minimum Confidence Threshold Integer N-A Minimum threshold value required for a Fuzzy Db row to be selected for Fuzzy Extraction.
Date Format String N-A Date format to be used for identifying the date field
No Of Pages Integer N-A Maximum Number of pages to be included while querying for the content
Option To Include Pages List of values
Determines whether all the pages or the first page of the document will be chosen for fetching the HOCR content.
FuzzyDB Extraction switch List of values
  • ON
  • OFF
Determines whether or not the fuzzy extraction should work or not.
Query Delimiters String N-A Delimiters to be used while using the fuzzy text search in the validation phase.
Ignore Words List Multi select
  • Name
  • Title
List of words to be ignored from HOCR content
Fuzzy Extraction Search Columns based on Fields String N-A This property defines the name of the Document Level Field for which the user wants to search. E.g. for the value “$City, $State” The values of the “City” and “State” document level fields would be queried in the lucene content directly and appropriate row for database table is returned. Document level fields for the concerned document are populated accordingly.
Fuzzy Extraction HOCR Switch List of values
  • ON
  • OFF
This property defines if no value corresponding to the above mentioned column is found, whether or not to continue searching the complete HOCR content. ON signifies whether to continue searching with HOCR content in case the value specified in “Fuzzy Extraction Search Columns based on Fields” is not found. OFF signifies to search on the values extracted by previous extraction plugin in case the value specified in “Fuzzy Extraction Search Columns based on Fields” is not found.

Steps for configuring the plugin

  • User can select the batch class module and navigate to fuzzy DB plugin config page as shown below:

  • User can map the document type to a database table by clicking on “Mapping” as shown below:
    • The document type can be mapped to a database table (having data records to be indexed) for the list of tables provided.
    • The document level fields can be mapped to table columns for extraction.

  • Once the mapping is defined, the user can click on “Learn DB” to create indexes of all the records present in the database.
    • Lucene indexing is generated against all database records belonging to all document types which have been mapped for current batch class. Only mapped columns are indexed.
    • Indexes are built on a string which is the combined text of all the fields mapped to various columns of the database table.
    • Separate index directories are created to store indexes per document type per batch class. The hierarchy used for storing index files against each document level field is: <Shared-Folder-Path>\<Batch-Class>\fuzzydb-index\<Database-Name>\<Table-Name>.

Steps of execution

  • Plug-in uses HOCR content of a document and generate a query comprising of the keywords based on their occurrence in the document. It then compares the HOCR based query against indexes on DB table rows.
  • Lucene returns the matching records among which the record with the highest confidence score is selected. If the score is greater than the threshold then the corresponding values will be stored in document level fields’ values in batch xml file.
  • Following are cases that can occur in execution of the plugin:
“FuzzyDB Extraction switch” Value “Fuzzy Extraction Search Column” Value “Fuzzy Extraction HOCR Switch” Value Result
OFF N.A. N.A. No Fuzzy Extraction.
ON <Empty> N.A. Usual Fuzzy Extraction using HOCR content.
ON “$City,$State” OFF Searches the value of “City” and “State” document level fields extracted by previous extraction plugins and search for them in the learned Lucene content and if some data is found, it is used else the data from previous extraction remain.
ON “$City,$State” ON search the value of “City” and “State” document level fields extracted by previous extraction plugins and search for them in the learned Lucene content and if some data is found, it is used else the usual Fuzzy Extraction using HOCR content is done.


  • Lucene engine is used over the SQL query for fetching every word in the html file as it provides an edge in terms of speed and efficiency. Sql query would be too slow and furthermore Lucene will provide results even if the OCR is not perfect on every character in the word.
  • It is possible that query might not give any results. In such cases, no document level field is updated.
  • It is possible that query might give multiple results. In such cases, the one with the highest confidence score entry will be used to populate document level fields.
  • The plug-in does not involve manual intervention and will be an automated step.


Following are few common error messages seen due to malfunctioning of the plugin:

S no. Error message Possible root cause
1 CorruptIndexException while reading Index The Lucene indexes are either locked or corrupted.
2 The base fuzzy db index folder does not exist. So cannot extract database fields. Fuzzy database has not been learned yet.