Regular Expressions

Introduction

This document provides information about the regular expression patterns used by the extraction module of Ephesoft.

Overview

Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set. They can be used to search, edit or manipulate text and data. Below is a table of basic regular expression constructs. The left-hand column specifies the regular expression constructs, while the right-hand column describes the conditions under which each construct will match.

Construct	Description
[abc]	a, b, or c (simple class)
[^abc]	Any character except a, b, or c (negation)
[a-zA-Z]	a through z, or A through Z, inclusive (range)
[a-d[m-p]]	a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]	d, e, or f (intersection)
[a-z&&[^bc]]	a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]	a through z, and not m through p: [a-lq-z] (subtraction)

Predefined character classes of regular expressions

Predefined character classes, which offer convenient short hands for commonly used regular expressions:

Construct	Description
.	Any character (may or may not match line terminators)
\d	A digit: [0-9]
\D	A non-digit: [^0-9]
\s	A whitespace character: [ \t\n\x0B\f\r]
\S	A non-whitespace character: [^\s]
\w	A word character: [a-zA-Z_0-9]
\W	A non-word character: [^\w]

Quantifiers

Quantifiers allow you to specify the number of occurrences to match against. List of quantifiers:

Pattern	Meaning
X?	X, once or not at all
X*	X, zero or more times
X+	X, one or more times
X{n}	X, exactly n times
X{n,}	X, at least n times
X{n,m}	X, at least n but not more than m times

Capturing groups

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters “d” “o” and “g”. The portion of the input string that matches the capturing group will be saved in memory for later recall via backreferences.

Capturing groups are numbered by counting their opening parentheses from left to right. In the expression ((A)(B(C))), for example, there are four such groups:

((A)(B(C)))
(A)
(B(C))
(C)

Backreferences

The section of the input string matching the capturing group(s) is saved in memory for later recall via backreference. A backreference is specified in the regular expression as a backslash (\) followed by a digit indicating the number of the group to be recalled. For example, the expression (\d\d) defines one capturing group matching two digits in a row, which can be recalled later in the expression via the backreference \1.

Example: To match any 2 digits, followed by the exact same two digits, use (\d\d)\1 as the regular expression:

Regular expression: (\d\d)\1

Input string: 1212 Result: found the text “1212” starting at index 0 and ending at index 4.

Capturing Groups and Character Classes with Quantifiers

Examples: (abc)+ (the group “abc”, one or more times). [abc]+ (a or b or c, one or more times)

Boundary Matchers

With boundary matchers the location of the match can be found within a particular input string e.g. if it appears at the beginning or end of a line, on a word boundary, or at the end of the previous match. The following table lists and explains all the boundary matchers.

Boundary Construct	Description
^	The beginning of a line
$	End of a line
\b	A word boundary
\B	Non-word boundary
\A	Beginning of the input
\G	The end of a previous match
\Z	The end of the input for the final terminator, if any
\z	The end of the input

Example: Regular expression: ^dcma\w* Input string: dcma ephesoft Match Found: true

Grouping Constructs

Grouping constructs allows to capture groups of sub-expressions and to increase the efficiency of regular expressions with non-capturing lookahead and lookbehind modifiers. The following table describes the Regular Expression Grouping Constructs.

Grouping Construct	Description
(?i)	Turn on case insensitivity for the remainder of the regular expression. (Older regex flavors may turn it on for the entire regex.)e.g. te(? i)st matches teST but not TEST.
(?: )	Non-capturing group.
(?= )	Zero-width positive lookahead assertion. Continues match only if the sub-expression matches at this position on the right. For example, \w+(?=\d) matches a word followed by a digit, without matching the digit. This construct does not backtrack.
(?! )	Zero-width negative lookahead assertion. Continues match only if the sub-expression does not match at this position on the right. For example, \b(?!un)\w+\b matches words that do not begin with un.
(?<= )	Zero-width positive lookbehind assertion. Continues match only if the sub-expression matches at this position on the left. For example, (?<=19)99 matches instances of 99 that follow 19. This construct does not backtrack.
(?<! )	Zero-width negative lookbehind assertion. Continues match only if the sub-expression does not match at the position on the left.

Sample regular expressions:

Regular expression for email address: [_A-Za-z0-9-]+(\.[_A-Za-z0-9-]+)*@[A-Za-z0-9]+(\.[A-Za-z0-9]+)*(\.[A-Za-z]{2,})

Regular expression for date: (0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.]\d\d([0-9]{2})? This will match the date in the following formats: mm/dd/yyyy or mm.dd.yyyy or mm-dd-yyyy or mm/dd/yy or or mm.dd.yy or mm-dd-yy Regular expression for Time in 12-Hour Format: (1[012]|[1-9]):[0-5][0- 9](\s)?(?i)(am|pm) This will match the time in the following format: 12:45am or 1:34pm or 7:56AM or 2:57PM or 1:45 PM or 2:34 AM

Regular expression for the price: \d+[,]{0,1}\d+[\.]?\d{1,2} This will match the prices in the following formats: 123.89 or 12,889.90 It will not match a single digit or two digit prices.

Word boundary match example: To match ‘whole word only’,’\b’ is used in the regular expressions: e.g. to match a word “dcma” but only if it is whole word, for the input data: “dcma dcmaEphesoftData”, use the below regular expression

LookAhead and lookBehind example: To match something not followed or preceded by something else, use lookahead and lookbehind assertions. Matching “date” not preceded by “due” Input data: due date is 22/11/2012 and the end date is 11/11/1999 Regular expression: (?<! due\s)date Regular expression: \bdcma\b Matching “date” not followed by “due” Input data: Payment date due is 22/11/2012 and actual date is 11/11/1999 Regular expression: date(?!\sdue) In both the cases there will be only one match of “date” string.

Pattern matching two words near each otherThis pattern consists of three parts: the first word, a certain number of unspecified words, and the second word. An unspecified word can be matched with the shorthand character class ‘\w+’. The spaces and other characters between the words can be matched with ‘\W+’ (uppercase W this time). Finding any pair of two words viz. ‘payment’ and ‘bank’ in the data: Regex pattern: payment\W+(?:\w+\W+){1,6}?bank The above regex pattern will match pair of words (payment, bank) separated by at least one word and at most 6 words between them.

Regex Patterns in Ephesoft

Regular expression behavior in table extraction: It supports multiword capture. Regular expression behavior in KV extraction: Uses word based extraction

Usage of ‘pattern’ field in ‘document Index Field Details’

In the document index field details, admin can enter some comma separated values in the ‘pattern’ field. The last value in the pattern field is a regular expression used to match the data and the previous values are used as key values.i.e. There may be multiple matches for the regex pattern but we want only those matches which are preceded by some specific values (key values specified in the pattern list). For example: Regex pattern: Invoice; Date; [0-9]{2}/ [0-9] {2}/[0-9]{2,4} Will only match those dates which are preceded by the strings ‘Invoice’ and ‘date’.

Usage of Multi word in Key Pattern for K-V Extraction

The multiword capturing in K-V extraction is present only in the key extraction and not in value extraction. For example: To capture the value ‘22/09/2011’ for the input data: Invoice date 22/09/2011 The following key and value patterns can be used. Key pattern: Invoice date Value pattern: (0[1-9] |1[012]) [- /.](0[1-9]|[12][0-9]|3[01])[-/.]\d\d ([0-9]{2})?

How not to capture certain values for Key Pattern and Value pattern

To match something not followed or preceded by something else, use lookahead and lookbehind assertions.e.g Matching “date” not preceded by “due” Input data: due date is 22/11/2012 and end date is 11/11/1999 Regular expression: (?<!due\s)date

Matching “date” not followed by “due” Input data: Payment date due is 22/11/2012 and the actual date is 11/11/1999 Regular expression: date (?!\sdue)

In both the cases there will be only one match of “date” string.

Usage of multi word capture in Table Extraction, which is different than Value Pattern in K-V Extraction

For example consider following image data:

START

Date	Product	Quantity	Price
11/22/2012	iPod touch	5	25000.00
22/05/2012	Laptop	2	30000.50

END

In the above table the multiword data “iPod touch” can be captured using the regular expression: [A-Za-z\s].

But in KV-extraction multiword data capturing is not supported for ‘value’ pattern.

Transact