A Document-Centric Strategy for GDPR Compliance

With the effective date for the new General Data Protection Regulation (GDPR) fast approaching, now is the time to put in a solid strategy when it comes to documents and images. Organizations not only need to implement process and procedure for handling private information, but also need a firm evaluation of "current state" to understand high risk areas of their business and their understand their exposure. Below are the four key steps, as outlined by Microsoft's GDPR Strategy, and how you can incorporate a document-centric view within your plan:


Discovery will probably be the most challenging step when it comes to documents and GDPR. When it comes to the enterprise, the majority has many document repositories. Just think of the modern workplace and all the locations where documents reside:

  • Network folders
  • Local folders
  • Sync technologies like Box, OneDrive, Dropbox, Google Drive
  • Corporate Enterprise Content Management (ECM) and Document Management (DM) systems
  • Line of Business systems that house documents
  • Emails and attachments

The ability to crawl and identify high risk entities within these locations is critical for compliance. Here is a checklist of required functionality when it comes to a technical solution:

  • Two-phase Identification - most of the technologies on the market just use pattern matching to identify personal information within documents. This can be problematic, and burden staff with false-positives, and require immense time requirements to validate. With two-phase identification systems (like Ephesoft), documents are first classified as a specific type: agreement, application, correspondence, etc. This classification can be configured for an organization's specific document requirements, and can immediately ID a document as high risk. The second phase of risk identification is pattern matching, fuzzy DB correlation and key value searching. This two-phase approach is required for accuracy and high confidence.
  • Optical Character Recognition (OCR) - images can be a very high risk type of document. To properly evaluate an image for risk, there needs to be a text conversion process. It goes much further than that, the application also needs a voting and confidence engine. Images vary in quality, and a fax or "copy of a copy" can be problematic. With a confidence flag on both the overall document and identified private information, images can be graded on overall quality, and quality of data.
  • Open Architecture - proprietary systems cannot meet all the requirements that will be necessary for GDPR Discovery, and most organizations will need ultimate flexibility to modify and customize software for their unique needs and requirements. Using modular and open platforms will guarantee the best solution and fit for your needs.
  • Machine Learning - using a system that gets smarter with each day of use is required in today's modern world. A data capture solution with machine learning capabilities can learn new
  • high risk documents, and evolve as an organization changes. Using ML tools can help organizations detect and access data that is at risk with GDPR.


Once a GDPR document inventory is complete and an organization understands their areas of document risk and exposure, a plan can be put in place to manage and govern the assets of their data subjects. This phase or step within your GDPR document strategy can include the following:

  • Migrating high risk documents to a managed repository - if high risk documents exist outside of a governed and managed repository, the same tool that can help in discovery can also help with migration. As documents are classified, metadata can also be extracted, and the document moved into a new or existing system of record. You can see an example of contract migration to SharePoint online here: Migrating Contracts and Data to SharePoint.
  • Implementing an intelligent document transport layer - creating a repeatable, standardized process for document ingestion and processing can flag new documents as they enter an organization's digital realm. This insures proper governance, and placement of high risk assets.


In the protection step, organizations need to put security controls on all documents deemed as high risk. But the protection step also requires thought on future documents, and protecting new private assets. As outlined in the "Manage" section above, an effective document transport technology will identify and route newly ingested documents to a protected location. Organizations also need to implement real-time controls for high risk identification and classification. Here are some examples:

  • Continuously discover - you can protect those documents that are in your managed repository, but what about newly generated personal data? As new policies and procedures are implemented, organizations need to use their discovery technology to constantly monitor and find new high risk entities.
  • Embed classification technology - enabling detection in your everyday applications can reduce risk, and insure compliance. Modern classification platforms have web services enabled in cloud and on premises solutions to help. You can see an example here: Real-time GDPR Scanning and Detection in SharePoint.


The new GDPR standard is all about accurate record keeping, which provides transparency and overall accountability. Knowing all the document types that can be classified as having personal information, and the processes around them, are critical to insure compliance. An audit of policies and procedures is sure to require records of document creation, or ingestion, how it was handled, and where it was ultimately placed under management. All of the technologies mentioned in this article have broad reporting and analytics capabilities.

GDPR Dashboard in Ephesoft Insight

With the complexities of GDPR, standard reporting will not suffice in most cases, and the ability to perform deep analytics to track and identify key data and documents will be a requirement. For more information on GDPR and how Ephesoft can help you in your strategy, please contact info@ephesoft.com.

OCR 2.0 - Using Adaptive OCR Technology & Analytics to Drive Efficiency

Optical Character Recognition technology, or OCR, has been around for quite some time.

OCR 2.0It became mainstream back in the '70s when a man named Ray Kurzweil developed a technology to help the visually impaired.    He quickly realized the broad commercial implications of the technology, and so did Xerox, who purchased his company.   From there, the technology experienced broad adoption across all types of use cases.

At its simplest, OCR is a means to take an image and convert recognized characters to text.  In the Enterprise Content Management (ECM) world, it is this technology that provides a broad range of metadata and content collection methods as documents are scanned and processed.   Here are the basic legacy forms of OCR that can be leveraged:

  • Full-Text OCR- converts the entire document image to text, allowing full-text search capabilities.  Using this OCR type, documents are typically converted to an Image+Text PDF, which can be crawled, and the content made fully searchable.
  • Zone OCR- Zoning provides the ability to extract text from a specific location on the page.  In this form of "templated" processing, specific OCR metadata can be extracted and mapped to an ECM system index field or column.  This method is appropriate for structured documents that have the data in the same location.
  • Pattern Matching OCR- pattern matching is purely a method to filter, or match patterns within OCR text.  This technique can provide some capabilities when it comes to extracting data from unstructured, or non-homogeneous documents.  For example, you could extract a Social Security Number pattern (XXX-XX-XXXX) from the OCR text.

These forms of OCR are deemed as legacy methods of extraction, and although they can provide some value when utilized with any document process, they are purely data driven at the text level.

In steps OCR 2.0.  Today, at Ephesoft  we leverage OCR as the very bottom of our document analytics and intelligence stack.   The OCR text is now pushed through algorithms that create meaning out of all types of dimensions: location, size, font, patterns, values, zones, numbers, and more (You can read about this patented technology here: Document Analytics and Why It Matters in Capture and OCR ).  So rather than just being completely data-centric, or functioning at the text layer, we now create a high-functioning intelligence layer that can be used beyond just text searching and metadata.  Moreover, the best part?  This technology has been extended to non-scanned files like Office documents.

Examples?  See below:

  • Multi-dimensional Classification- using that analysis capability (with OCR as algorithm input), and all the collected dimensions of the document, document type or content type can now be accurately identified.  As documents are fed into any system, they can be intelligently classified, and that information is now actionable with workflows, retention policies, security restrictions and more.
  • Machine Learning- legacy OCR technology provided no means or method to "get smarter" as documents were processed.  Just looking at the pure text, it either recognized it or not.  With a machine learning layer, you now have a system that gets more efficient the more you use it.   The key here is that learned intelligence must span documents, it cannot be tied to any one item.  It's this added efficiency that can drive usage and adoption through ease of use.
  • Document Analytics, Accuracy, and Extraction- with legacy OCR, extracting the information you need can be problematic at best.  How do you raise confidence that the information you have is accurate?  With an analysis engine, we look not just at the text, but where it sits, what surrounds it, and know patterns or libraries.  This added layer provides the ability to express higher confidence in data extraction, and make sure you are putting the right data into your backend systems.