Table of Contents
MARC Review and PCRE
PCRE stands for Perl Compatible Regular Expressions, an open source library supported by MARC Review and MARC Global.
A copy of the official documentation for using PCRE patterns can be found here.
(The license for the PCRE documentation is the same as for the PCRE code, which can be found here ).
The goal of this page is to provide some examples of the new regular expression usage in MARC Review and MARC Global.
If you need help with regular expressions, please email us–that's one of the things we are here for.
There are many tutorials for regular expressions available on the web. Keep in mind that these tutorials are often written for programmers, so you may want to check for one that matches your own level of understanding.
http://www.regular-expressions.info/tutorial.html
A good R/E site for beginners, although it does contain advertisements.
http://en.wikipedia.org/wiki/Regular_expression_examples
A page of basic R/E examples from wikipedia.
http://www.php.net/manual/en/reference.pcre.pattern.syntax.php
Perhaps more readable version of the PCRE documentation.
MARC Review introduction to regular expressions
This very brief introduction to R/E is from the current MARC Review help page
REGULAR EXPRESSION
If the Regular Expression box is selected, the program will treat the pattern entered in the DATA box as a regular expression. The most common metacharacters used in pattern matching are:
. matches any single character ^ anchors match to the beginning of the data $ anchors match to the end of the data * matches 0 or more of the preceding expression [ begin character class definition ] end character class definition - within a character class, indicates a range of characters Eg., [a-d] matches a, b, c, or d Eg., [^a-d] matches any character except a, b, c, or d Eg., [A-Z] matches any uppercase character \ removes special meaning from above metacharacters
As an example, if the Regular Expression box is checked, and your data pattern contains:
200[0-2]
the program will match any data that contains '2000', '2001', or '2002'. If the Regular Expression box was not checked, the program would literally try to match the string '200[0-2]'.
NOTE: within square brackets, '^' negates a match. Therefore, to find all instances of invalid subfield coding, we could use the following expression:
±[^0-9a-z]
This would match any subfield delimiter ± that is followed by a character not in the character class 0-9a-z.
Do not use commas to separate individual values in a character class. For example, this is the correct way to pattern match the ten numeric digits and the uppercase letters 'A', 'B', and 'C':
[0-9ABC]
But the following regular expression will also match any string containing a comma in it:
[0-9,A,B,C]
MARC Review and PCRE
With PCRE, the program's metacharacter support is increased a great deal:
. matches any single character ^ anchors match to the beginning of the data $ anchors match to the end of the data * matches 0 or more of the preceding expression + matches 1 or more of the preceding expression ? matches no more than 1 of the preceding expression [ begin character class definition ] end character class definition - within a character class, indicates a range of characters | alternative pattern separator Eg., red|black matches "red" or "black" ( begin subpattern ) end subpattern { begin repetition quantifier } end repetition quantifier \ if followed by one of the above, treat the character literally Eg., \* in a pattern will match the asterisk character A second use of backslash provides a way of encoding non-printing characters in patterns in a visible manner: \cx "control-x", where x is any character \n linefeed (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) \ddd character with octal code ddd \xhh character with hex code hh Another use of backslash is for specifying generic character types. The following are always recognized: \d any decimal digit \D any character that is not a decimal digit \h any horizontal whitespace character \H any character that is not a horizontal whitespace character \s any whitespace character \S any character that is not a whitespace character \v any vertical whitespace character \V any character that is not a vertical whitespace character \w any "word" character \W any "non-word" character These character type sequences can appear both inside and outside character classes. In addition, the following use of backslash is available in MARC Global: \L .. \E lowercase all characters between '\L and '\E' \U .. \E uppercase all characters between '\U and '\E' REPETITION The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets braces), separated by a comma. For example: z{2,4} matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special character. If the second number is omitted, but the comma is present, there is no upper limit; if the second number and the comma are both omitted, the quantifier specifies an exact number of required matches. Thus [aeiou]{3,} matches at least 3 successive vowels, but may match many more, while \d{8} matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters. For convenience, the three most common quantifiers have single-character abbreviations: * is equivalent to {0,} + is equivalent to {1,} ? is equivalent to {0,1}
MARC Review PCRE examples
1. How to find titles that contain an acronym in the subfield $c
TAG=245 SUBF=c DATA=[A-Z]{2,}\W REGULAR EXPRESSION=Yes
Explanation: an uppercase letter [A-Z] occurring two or more times {2,} followed by a non-word character \W
2. How to find titles that contain two (or more) consecutive acronyms in the subfield $a. This example demonstrates the use of a subpattern–a regular expression enclosed in parens:
TAG=245 SUBF=c DATA=([A-Z]{2,}\W){2,} REGULAR EXPRESSION=Yes
Explanation: the subpattern ([A-Z]{2,}\W) –consisting of an uppercase letter occurring two or more times followed by a non-word character –occurring two or more times.
Some results of #2:
$aProfessional ASP.NET 1.1 $aIBM PC update. $aRF/IF signal processing handbook. $aCyberlaw @ SA II $aScholarly book reviews on CD-ROM $aKI-ES-KI, directory of key contacts in Canadian education
Note: '.', '/', '-' are non-word characters.
3. How to find subject headings that contain word(s) followed by the word 'fiction'. Easy one!
TAG=650 SUBF=a DATA= fiction
4. How to find subject headings that contain two words followed by the word 'fiction'.
TAG=650 SUBF=a DATA=\w+\s\w+\sfiction REGULAR EXPRESSION=Yes
Explanation: One or more word characters \w+ followed by a single space character \s followed by one or more word characters \w+ followed by a single space character \s followed by 'fiction'.
Using a subpattern we can restate #4 as:
DATA=(\w+\s){2}fiction
This doesn't make the pattern any shorter, but we can now simply change the '2' if we want to change the number of words that must precede 'fiction'.
Some results of #4:
$aLatin American fiction $aSpanish American fiction $aStar Trek fiction. $aStar Wars fiction $aYoung adult fiction $aYoung adult fiction, English
5. How to find summaries (520) containing more than a certain number of words:
TAG=520 SUBF=a DATA=(\w+\s){50,} REGULAR EXPRESSION=Yes
Explanation: the subpattern (\w+\s) –i.e., a word–occurring at least 50 times {50,}
6. How to find summaries (520) containing more than 50 words but less than 100:
TAG=520 SUBF=a DATA=(\w+\s){50,99} REGULAR EXPRESSION=Yes
Explanation: the subpattern (\w+\s) –i.e., a word–occurring at least 50 times but no more than 99 times {50,99}
Compatibility note
Customers that interface directly to MARC Report's validation module from their own software, and change the location of this file, will find that valplus.dll will fail to load if the PCRE library (pcrelib.dll), now distributed with the program, cannot be found.
The solution is to simply copy the file pcrelib.dll to wherever valplus.dll is being copied (if applicable), as valplus expects the PCRE library to be in the same folder.
We are not going to change valplus to look elsewhere for the file.