A well-formed set of HOCR xml files which are placed in a hierarchical structure such as: Batch Class > Document type > Page type are used for the purpose of registering few standard HOCR xml documents with Lucene search engine. This process is called learning because it is like feeding the xml files into Lucene’s memory by creating Lucene indexes. HOCR files in batch instance are compared with these memorized indexes to find a best match and classify the pages. Note that learning is a one-time-process. Learning makes classification process fast as no index needs to be generated at runtime to classify the documents.

Steps of learning

  1. First create document type that Ephesoft has to recognize. Suppose user has created HUD-1 document type in batch class BC1.
  2. Edit BC1 and click on ‘Generate Folder’ button.




  1.  Generate Folder will create necessary folders where tiff files to be learned can be placed. In this case folder will be created under“Ephesoft-install-dir\SharedFolders\BC1\lucene-search-clasification-sample” folder. Following three subfolders will be created in this case –
  2. HUD-1_First_Page
  3. HUD-1_Last_Page
  4. HUD-1_Middle_Page




The first and last page of the document must be copied in the HUD-1_First_Page and HUD-1_Last_Page respectively and all other pages of the document must be copied in HUD-1_Middle_Page.

In the provided sample for HUD-1, image 000001 is the first page and image 000002 is the last page of the document type HUD-1. All other pages are different document types. The sample does not have middle pages for HUD-1 document type.


  1. Click the Learn Files button.



The Ephesoft software is now ready and has learned the document type of HUD-1.


Following are few common error messages that might be seen due to mal-functioning of the learning:


[table caption=”” width=”800″ colwidth=”20|100|50″ colalign=”left|left|center|left|right”]
S. No.,Error message,Possible root cause
1.,Problem occurred while learning/Problem learning files.,Network connection failure ~~Multiple networks connected to system. E.g. LAN and WLAN connected at a same time.~~License is not installed or invalid.~~Tomcat server is not up.