Regular Expressions

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.

Regex examples

A simple example for a regular expression is a (literal) string. For example, the Hello World regex matches the "Hello World" string. . (dot) is another example for a regular expression. A dot matches any single character; it would match, for example, "a" or "1".

The following tables lists several regular expressions and describes which pattern they would match.

Regex Matches

this is text

Matches exactly "this is text"

this\s+is\s+text

Matches the word "this" followed by one or more whitespace characters followed by the word "is" followed by one or more whitespace characters followed by the word "text".

^\d+(\.\d+)?

^ defines that the patter must start at beginning of a new line. \d+ matches one or several digits. The ? makes the statement in brackets optional. \. matches ".", parentheses are used for grouping. Matches for example "5", "1.5" and "2.21".

 

 

Common matching symbols

Regular Expression Description

.

Matches any character

^regex

Finds regex that must match at the beginning of the line.

regex$

Finds regex that must match at the end of the line.

[abc]

Set definition, can match the letter a or b or c.

[abc][vz]

Set definition, can match a or b or c followed by either v or z.

[^abc]

When a caret appears as the first character inside square brackets, it negates the pattern. This pattern matches any character except a or b or c.

[a-d1-7]

Ranges: matches a letter between a and d and figures from 1 to 7, but not d1.

X|Z

Finds X or Z.

XZ

Finds X directly followed by Z.

$

Checks if a line end follows.

 

 

Meta characters

The following meta characters have a pre-defined meaning and make certain common patterns easier to use, e.g., \d instead of [0..9].

Regular Expression Description

\d

Any digit, short for [0-9]

\D

A non-digit, short for [^0-9]

\s

A whitespace character, short for [ \t\n\x0b\r\f]

\S

A non-whitespace character, short for

\w

A word character, short for [a-zA-Z_0-9]

\W

A non-word character [^\w]

\S+

Several non-whitespace characters

\b

Matches a word boundary where a word character is [a-zA-Z0-9_]

These meta characters have the same first letter as their representation, e.g., digit, space, word, and boundary. Uppercase symbols define the opposite.

 

Quantifier

A quantifier defines how often an element can occur. The symbols ?, *, + and {} define the quantity of the regular expressions

Regular Expression Description Examples

*

Occurs zero or more times, is short for {0,}

X* finds no or several letter X, <sbr /> .* finds any character sequence

+

Occurs one or more times, is short for {1,}

X+- Finds one or several letter X

?

Occurs no or one times, ? is short for {0,1}.

X? finds no or exactly one letter X

{X}

Occurs X number of times, {} describes the order of the preceding liberal

\d{3} searches for three digits, .{10} for any character sequence of length 10.

{X,Y}

Occurs between X and Y times,

\d{1,4} means \d must occur at least once and at a maximum of four.

*?

? after a quantifier makes it a reluctant quantifier. It tries to find the smallest match. This makes the regular expression stop at the first match.

 
 

Negative look ahead

Negative look ahead provides the possibility to exclude a pattern. With this you can say that a string should not be followed by another string. Negative look ahead are defined via (?!pattern). For example, the following will match "a" if "a" is not followed by "b": a(?!b)

Specifying modes inside the regular expression

You can add the mode modifiers to the start of the regex. To specify multiple modes, simply put them together as in (?ismx).

  • (?i) makes the regex case insensitive.
  • (?s) for "single line mode" makes the dot match all characters, including line breaks.
  • (?m) for "multi-line mode" makes the caret and dollar match at the start and end of each line in the subject string.


Backslashes

The backslash \ is an escape character. That means backslash has a predefined meaning. You have to use double backslash \\ to define a single backslash.