Fuzzy DB plugin is used to extract the document level fields of a document from records in the database on the basis of the matched value of the HOCR content or the previously extracted value of a document level field. This plug-in involves creation of search engine based indexing and extracting document level field value based on fuzzy match of HOCR content against index. User can configure any Vendor database in order to capture Vendor name, Vendor ID or any other field from the incoming invoices. This can be done simply by mapping the document to the Vendor database table and the index fields of the document to the columns in the database table. The plugin will find the matching vendor from the database and update the fields in the document.


Configurable properties

Following are the configurable properties available for the Fuzzy Db plugin:


[table caption=”” width=”800″ colwidth=”100|100|200|200″ colalign=”center|center|center|left”]
Configurable property,Type of value,Value options,Description
Minimum Word Length,Integer,N-A,The minimum word length below which words will be ignored from the HOCR content.
Minimum Term Frequency,Integer,N-A,The frequency below which terms will be ignored in the source document.
Minimum Doc Frequency,Integer,N-A,Sets the frequency at which words will be ignored which does not occur in at least this many documents
Maximum Query Terms,Integer,N-A,The maximum number of query terms that will be included in any generated query.
Database Password,String,A valid password value to connect to database,The password for connecting to the user SQL account.
Database User Name,String,A valid username value to connect to database,The username for connecting to the user SQL account.
Database Driver,List of~~values,net.sourceforge.jtds.jdbc.Driver~~~~com.mysql.jdbc.Driver,The database driver to be used this will DBMS specific.
Database Connection URL,String,A valid database connection URL.,The database connection URL required for connection this will DBMS specific.
Minimum Confidence Threshold,Integer,N-A,Minimum threshold value required for a Fuzzy Db row to be selected for Fuzzy Extraction.
Date Format,String,N-A,Date format to be used for identifying the date field
No Of Pages,Integer,N-A,Maximum Number of pages to be included while querying for the content
Option To Include Pages,List of values,ALLPAGES~~~~FIRSTPAGE,Determines whether all the pages or the first page of the document will be chosen for fetching the HOCR content.
FuzzyDB Extraction switch,List of values,ON~~~~OFF,Determines whether or not the fuzzy extraction should work or not.
Query Delimiters,String,N-A,Delimiters to be used while using the fuzzy text search in the validation phase.
Ignore Words List,Multi select,Name~~~~Title,List of words to be ignored from HOCR content
Fuzzy Extraction~~Search Columns~~based on Fields,String,N-A,”This property defines the name of the Document Level Field for which the user wants to search. E.g. for the value ‘City, $State’ The values of the ‘City’ and ‘State’ document level fields would be queried in the lucene content directly and appropriate row for database table is returned. Document level fields for the concerned document are populated accordingly.”
Fuzzy Extraction~~~~HOCR Switch,List of values,ON~~~~OFF,”This property defines if no value corresponding to the above mentioned column is found, whether or not to continue searching the complete HOCR content. ON signifies whether to continue searching with HOCR content in case the value specified in ‘Fuzzy Extraction Search Columns based on Fields’ is not found. OFF signifies to search on the values extracted by previous extraction plugin in case the value specified in ‘Fuzzy Extraction Search Columns based on Fields’ is not found.”



Steps for configuring the plugin

  • User can select the batch class module and navigate to fuzzy DB plugin config page as shown below:







The User can edit the above settings by clicking on “Edit” in order to connect to the vendor database.

  • User can map the document type to a database table by clicking on “Mapping” as shown below:
    • The document type can be mapped to a database table (having data records to be indexed) for the list of tables provided.




  • The document level fields can be mapped to table columns for extraction.




  •  Once the mapping is defined, the user can click on “Learn DB” to create indexes of all the records present in the database.
    • Lucene indexing is generated against all database records belonging to all document types which have been mapped for current batch class. Only mapped columns are indexed.
    • Indexes are built on a string which is the combined text of all the fields mapped to various columns of the database table.
    • Separate index directories are created to store indexes per document type per batch class. The hierarchy used for storing index files against each document level field is: <Shared-Folder-Path>\<Batch-Class>\fuzzydb-index\<Database-Name>\<Table-Name>.

Steps of execution

  • Plug-in uses HOCR content of a document and generate a query comprising of the keywords based on their occurrence in the document. It then compares the HOCR based query against indexes on DB table rows.
  • Lucene returns the matching records among which the record with the highest confidence score is selected. If the score is greater than the threshold then the corresponding values will be stored in document level fields’ values in batch xml file.
  • Following are cases that can occur in execution of the plugin:


[table caption=”” width=”800″ colwidth=”100|100|200|200″ colalign=”center|center|center|left”]
FuzzyDB Extraction switch Value,Fuzzy Extraction Search Column Value,Fuzzy Extraction HOCR Switch Value,Result
OFF,N.A.,N.A.,No Fuzzy Extraction.
ON,(Empty),N.A.,Usual Fuzzy Extraction using HOCR content.
ON,”$City,$State”,OFF,”Searches the value of ‘City’ and ‘State”document level fields extracted by previous extraction plugins and search for them in the learned Lucene content and if some data is found it is used else the data from previous extraction remain.”
ON,”$City,$State”,ON,”search the value of ‘City’ and ‘State’ document level fields extracted by previous extraction plugins and search for them in the learned Lucene content and if some data is found, it is used else the usual Fuzzy Extraction using HOCR content is done.”