link to cgstock.com homepage
home site info image licensing newest video prints

Other sections:

Regular Expression Examples

"Regular Expressions" refer to a standardized pattern-matching syntax. It can look something like "/[a-z_.\-]{3,8}/", difficult to grasp upon first exposure. However, regular expressions are a powerful tool for processing text.

Regular expressions are supported by the perl programming language, MySQL database(see below), and Vim text editor(see below). All are open source and available for Linux. This page has examples of regular expressions for each of these software tools.

Perl regular expressions

In perl, a pattern-matching expression has an equal sign(=) with a tilde(~) after it, which looks like this:

$string =~ /regular-expression/
The string to be tested goes on the left, and the pattern to test for goes on the right. The pattern is enclosed by a delimiting character, such as quotation marks, although the standard convention is to use a forward slash ("/"), like above. The example below will exit if '404' is present in the string $text:
if($text =~ /404/) { exit }

The pattern to be tested can have special meta-characters to represent character classes. Some of the basic meta-characters in perl's implementation of regular expressions are:

\s	one whitespace character
 	(space, return, tab, etc.)
\S	one non-whitespace character
 	(a-z, 0-9, etc.)

.	one character, whitespace or not
\d	a digit(0-9)
\n 	a newline

Elements in a pattern can be given a quantity with the meta-characters below:

+	matches the preceding character
 	one or more times. So while \d matches one
 	digit, \d+ will match a series of consecutive digits.

*	matches the preceding character
 	zero or more times, making it's presence optional.
{2}  match the preceding character 2 times exactly
{12,}  match the preceding character 12 or more times
 ^	anchor pattern to the beginning of the string.
 	/^prefix/ will only match if "prefix" is at
 	the beginning of a string (usually this is the beginning of
                a line of text).
 $	anchor pattern to the end of a string.
 	/html$/ will only match if "html"
 	is at the end of a string.
 
A forward slash is used to force a meta-character to be interpreted as a normal character. So to match ".html" you would use "\.html" to make the "." a literal period instead of a meta-character.

Perl regexp examples

Below are some regular expressions I have found useful in perl scripts that generate webpages.

URL-ify filenames

Terms like "Linden Hills" and "Lake Calhoun" are categories in the Phototour of Minneapolis website. However, those terms won't work in URLS, because they contain a space. I also standardize all my webpage filenames to lower-case (for consistency). The following regular expression and function coverts phrases like the two above into filenames like "linden_hills.html" and "lake_calhoun.html".
$file =~ s/ /_/g; #this is a substitution
 $file = lc($file); #'lc()' returns all lower-case
 $file .= ".html"; #'.=' adds to a string

Modify navigation links

It is a standard convention to have a row or column of internal links on webpages, to make navigating among pages as easy as possible. Assume you have a list of 5 internal hyperlinks, similar to the following:

[ home ] [ all ] [ all2 ] [ all3 ] [ all4 ]

Assume the HTML code for these links is stored in the string variable $links, and is re-used among all web pages.

However, when a visitor is reading a given page, it should be listed but not hyperlinked, to indicate their current position within the website. So the links on the page "all3.html" should look like this:

[ home ] [ all ] [ all2 ] [ all3 ] [ all4 ]

The regular expression below is one way to effect this. Assume the HTML code for your hyperlinks is in a variable "$links", and "$file" is the name of the current webpage(such as "index.html" or "all.html").

$links =~ s/(.*?)<\/a>/$1/i;
 
The "(.*?)" part of the regular expression will match everything until the first "" that's encountered. The question mark makes this pattern "non-greedy"; that is, matching the least number of characters rather than the most. If the question mark were absent, the pattern would match everything until the last "", instead of the first one.

Break long strings of text

If a single line of text is several screens wide(so that it runs off the page) it can make HTML source code awkward to read. There are several ways to deal with this in perl(such as using the split function or Text::Wrap module); I use the regexp below. It wraps a single line of text with a newline at the first blank space after every 70 characters.
$text =~ s/.{,70} /$&\n /sg;

MySQL regular expressions

MySQL allows you to use regular expressions in "select where" clauses. For example,the below statement would fetch all names where the last name begins with "G":
mysql>select first_name,last_name from staff where last_name regexp "^G";
To make a regexp case sensitive, use "regexp binary":
mysql>select first_name,last_name from staff where last_name regexp binary "^G";

I have found regexp select statements useful for databases which have fields of comma-separated numbers. If I need to select all rows with a given number in such a comma-separated list, a regular expression using word boundries would work:

[[:<:]]left word boundary
[[:>:]]right word boundary

The below will locate all rows with number "23" in their comma-seperated list:
mysql>select id,photos from categories where photos regexp "[[:<:]]23[[:>:]]";

MySQL supports the following regexp meta characters "normally":
^ and $ -- anchors for the begining and end of string.
*?+ -- quantifiers for zero or more, zero or one, one or more.
. -- any one character, including a newline.
[a-zA-Z] -- character classes, such as a-zA-Z.
[^0-9] -- exclude a character class, such as 0-9

See Appendix H of the MySQL manual (linked to below) for details on MySQL's regexp support.

Vim regular expressions

Vim's regexp support requires escaping of many regexp meta-characters with a backslash. For example, the quantifier "+" must be "\+", and capturing a match in a substitution uses escaped parenthesis "\(...\)".

Some other Vim regexp tips:

s/pattern/&/ matches the entire previously matched pattern
s/\(pattern\)/\1/ matches the first captured pattern
\{-} quantifier to match the shortest, instead of longest (greedy), version of the pattern
:%s/<.\{-}>//g matches (and removes) all HTML tags

This page last modified on 2008-04-10

  • tacua calgary -- 2008-04-10

    The below will locate all rows with number "23" in their comma-seperated list:
    mysql>select id,photos from categories where photos regexp "[[:<:]]34[[>:]]";

    It works but I don't know why it said that it will be looking for 23 and they end up looking for 34 :-)
    Anyways the problem is missing a colon on the second delimiter. It should read [[:>:]]

  • Ap -- 2008-03-08

    "The below will locate all rows with number "23" in their comma-seperated list:
    mysql>select id,photos from categories where photos regexp "[[::]]";"
    Hm, didn't work for me. Too bad, it's elegant simplicity was appealing ;)

Post a comment on this page

cgstock.com provides quality stock photos for commercial, fine-art, education, and non-profit use, with an emphasis on pictures of the Twin Cities of Minneapolis and St. Paul, Minnesota and China & the Philippines.
phone cgstock.com at 612-245-4306   email us:chris@cgstock.com
Chris Gregerson, 150 Green Ave. N., New Richmond, WI 54017 USA
home   |   licensing information   |   site info   |   web development services
http://www.cgstock.com/