Table of Contents
MARC Global and PCRE
Getting to know subpatterns
Many of the MARC Review PCRE examples demonstrate the use of sub-paterns in MARC Review. These same sub-patterns may also be used in the top form of the Change Data task in MARC Global.
In brief, when patterns enclosed in parens in the top pattern match, the matching data is saved into a memory space; these memory spaces1) can then be accessed in the bottom pattern using a 'back reference' technique, where:
'\1' refers to memory space #1, ie. whatever matches the first subpattern '\2' refers to memory space #2, ie. whatever matched the second subpattern
The subpattern numberings–i.e. the determination of 'first', 'second', etc.–follow simple sequential order.
Here is a quick example to illustrate. Suppose you have sentences in your abstracts that are missing the blank space between the end of one sentence and the start of another. And, for some unknown reason , you have decided that you want to add the missing blank spaces.
Without PCRE, you would have to change every period (full-stop) to period + blank space–and take your chances. But now, the following will do the job quite nicely, with little need for a subsequent clean-up review:
In the example we create two subpatterns:
([a-z]\.) ([A-Z])
And in the 'New Data' form we reference them as:
\1 \2
If the 520 field looks like this:
They include excellent overviews of history for each country.Long available as a print series, the web versions are easy to navigate and read.
then the first subpattern ([a-z]\.) matches a lowercase letter followed by a period:
y. (each countr'y.')
and the second subpattern ([A-Z]) matches an uppercase letter following the first subpattern:
L ('L'ong available)
Therefore, when we refer to the subpattern matches in the 'New data' section,
\1 is equivalent to 'y.' \2 is equivalent to the 'L' (after 'y.')
and thus, our replacement pattern
\1 \2
basically says
replace 'y.L' with 'y. L'
Note: the blank space between \1 and \2 is not a separator–it is data to be added between the two subpatterns!
Much of the power of PCRE lies in, so its important to know a little bit about them, and thus, a few more examples from real-life (or should that be 'real data'?) follow.
Example: Changing variant values to a common value
With the changes made to form subdivisions over the years, its often difficult to predict in what subfield a subdivision might be found.
Lets say you want to find all fields with the subdivisions “Maps, Physical” and “Bathymetric maps”, whether they appear in $x or $v, and change them all to $v “Maps”.
The best way to do this in the past would be to use two2)separate reviews. Beginning with version 236, this type of task can now be performed in a single review:
Data to Change:
TAG=651 DATA=±[xv](Maps, Physical|Bathymetric maps) REGULAR EXPRESSION=Yes
New data:
REGULAR EXPRESSION=No DATA=±vMaps
Explanation: In tag 651, match a subfield delimiter followed by either 'x' or 'v' (the character class '[xv]'), followed by the subpattern '(Maps, Physical|Bathymetric maps)', where '|' indicates an 'or' condition.
Replace whichever phrase matches the sub-pattern with '±vMaps'
In fact, if you had more variants that you wanted to change to 'Maps', you could simply add them to the subpattern, separated by pipes:
(Maps, Physical|Bathymetric maps|Maps, Tourist|Maps, Topographic)
etc.
Example: Changing a 'c' into a ©
MARC Global recognizes the PCRE syntax to specify diacritics, both in the 'Data to Change', and the 'New Data' boxes.
For example, to change the letter 'c' into the copyright symbol, you would setup a Change Data form either like this
or like this
Of course, when dealing with diacritics, you will need to know the character encoding of your records, so each of the above examples should be combined with an appropriate pre-processing pattern.
For example, for unicode records, setup a pattern like the following before enter the 'Change data' parameters:
IMPORTANT
Do not use the older MARC Review curly brace syntax, eg.
New data={xC2}{xA9}
as a replacement pattern in MARC Global, as it will, in this case, replace the 'c' with the string '{xC2}{xA9}'.
The reason for this is twofold.
First, the curly brace syntax is now deprecated. Replace the old curly brace syntax with the standard backslash notation in your reviews.
Second, the old curly brace syntax requires the regular expression checkbox be selected, and that checkbox isn't usually enabled for the 'New data' option.
Example: Fixing uppercase titles
In MARC Global, it is now possible to use a regular expression in the bottom part of the 'Change Data' form:
In the screenshot above, we are matching uppercase words in the title (literally, a subpattern of one or more uppercase characters followed by a non-word character).
In the 'New Data' section, we now have the opportunity to use regular expressions. The example in the screenshot shows another use of the backslash: to refer to a subpattern occurrence.
When a subpattern is referenced this way in a replacement pattern, it's like saying 'insert that subpattern here'. Thus, the replacement pattern can be read as:
\L Begin lowercase \1 insert subpattern 1 (followed by a blank space) \E End lowercase
Whatever matches subpattern 1 (an uppercase word) will be converted to lowercase. Since we have set the 'Data Occ' option to 'All' in the top section, the change will be applied to all uppercase words in the title.
This is, in essence, the first step in creating an autoreview to convert upper-case titles to 'library' title case:
'BEST OF THE BOLSHOI.' Changed to: 'best of the bolshoi.' 'MUSIC FOR AMERICA.' Changed to: 'music for america.' 'THE CINNAMON BEAR.' Changed to: 'the cinnamon bear.' 'ORCHESTRA WORKS' Changed to: 'orchestra WORKS' 'BUFFALO DANCE /' Changed to: 'buffalo dance /' 'WRC RADIO AIRCHECK :' Changed to: 'wrc radio aircheck :'
There are two steps left:
1. Take care of titles with ending words that did not match (because they were not followed by a non-word character):
TAG=245 SUBF=a DATA=(\W[A-Z]+)$ DATA OCC=First REGULAR EXPRESSION=Yes
NEW DATA=\L\1\E REGULAR EXPRESSION=Yes
This will change 'orchestra WORKS' to 'orchestra works', etc.
2. And, uppercase the first word of the title:
TAG=245 SUBF=a DATA=^([a-z]) DATA OCC=First REGULAR EXPRESSION=Yes
NEW DATA=\U\1\E REGULAR EXPRESSION=Yes
This will give us the final result:
'best of the bolshoi ' Changed to: 'Best of the bolshoi ' 'music for america ' Changed to: 'Music for america ' 'the cinnamon bear ' Changed to: 'The cinnamon bear ' 'orchestra works' Changed to: 'Orchestra works' 'buffalo dance /' Changed to: 'Buffalo dance /' 'wrc radio aircheck :' Changed to: 'Wrc radio aircheck :'
These few examples also illustrate the complexity of dealing with all-uppercase data, since we have now lowercased proper names like 'bolshoi', lost acronyms ('WRC'), etc.
Example: Putting names in direct order
Using sub-patterns and PCRE, it is now possible to change the positions of pieces of data in a MARC record quite simply. For example, a common task in some data manipulations is to put names into direct order:
Satie, Erik --> Erik Satie
To do this in MARC Global, go to the Change Data form and enter, at the top:
TAG=100 SUBF=a DATA=(\w+), (.*) REGULAR EXPRESSION=Yes
And in the 'New data' section:
NEW DATA=\2 \1 REGULAR EXPRESSION=Yes
The results will be as follows:
'Geptner, V. G.' Changed to: 'V. G. Geptner' 'Rachmaninoff, Sergei,' Changed to: 'Sergei Rachmaninoff,' 'Beethoven, Ludwig van,' Changed to: 'Ludwig van Beethoven,' 'Dingfelder, Ingrid.' Changed to: 'Ingrid. Dingfelder' 'Hantaèi, Pierre,' Changed to: 'HantaèPierre, i' 'Gilmore, Horace W.,' Changed to: 'Horace W. Gilmore,'
At first glance, this looks OK, but look closer at the name containing the diacritic. It got jumbled, because of the following quirk in PCRE:
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. This is true even when Unicode character property support is available. These sequences retain their original meanings from before UTF-8 support was available, mainly for efficiency reasons.
So, if we are working with diacritics, and that's often going to be the case with name fields, we must adjust our top pattern as follows:
TAG=100 SUBF=a DATA=([\w\x80-\xFF]+), ([\w\x80-\xFF]+) REGULAR EXPRESSION=Yes
We create a character class […] containing \w (any ASCII 'word' character), and to it we add the upper ASCII bytes (\x80-\xFF) used to carry diacritics (whether in MARC-8 or UTF-8).
Re-running this job, the jumbled entry now becomes:
'Hantaèi, Pierre,' Changed to: 'Pierre Hantaèi,'
Of course, there are still several clean-up jobs to follow, some of which may depend on how dates and name qualifiers will be handled; We will leave these tasks as a (hopefully) enjoyable puzzle for the user to solve.
MARC Review help for diacritics
This excerpt is based on the current MARC Review help page and is followed by some suggestions of how diacritics should be searched in the PCRE environment. Its worth mentioning a second time, the following excerpt from the PCRE documentation:
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or \w, and always match \D, \S, and \W. This is true even when Unicode character property support is available. These sequences retain their original meanings from before UTF-8 support was available, mainly for efficiency reasons.
DIACRITICS
To search for a character not on your keyboard, enter the numeric value of the character enclosed in curly braces. You may use either decimal or hexadecimal notation for this number; decimal numbers must be zero-filled to three digits and fall within the range 000-255; hex numbers must begin with a 'x' and fall within the range 00-FF.
For example, decimal {031} or hex {x1F} will match the MARC subfield delimiter.
Note that entering a character is this manner, as of version 236, requires that the regular expression option be selected.
A more interesting example; entering:
[{x7F}-{xFF}]
as a pattern and selecting the regular expression option will find all diacritics in a field. This works because MARC Review performs a character substitution for 'curly brace' expressions before it processes a regular expression (thus, the regular expression engine never sees the curly braces).
However, if you are planning to make full use of the PCRE support in the program, then it might be a good idea not to use MARC Review's curly brace technique for matching diacritics inside a regular expression((especially as this usage could be interpreted as a 'repetition quantifier' by PCRE)).
Instead, use PCRE's '\x' + the hex code of the character. For the example above, use:
[\x7F-\xFF]
and select the regular expression checkbox.
MARC Report customizations and PCRE
In MARC Review, it has always been possible to embed multiple patterns inside one pattern using double pipes (||). For example–
TAG=100 SUBF=a DATA=Doe, John|Doe, Jane
–will find all authors named 'Doe', whether 'John' or 'Jane'. This is not a regular expression. Watch out in PCRE, however, where single pipes (which function in a similar way) do imply a regular expression.
As mentioned above in the section on diacritics, enclosing a character in curly braces–to search diacritics–is not a regular expression in MARC Report; whereas, using curly braces to indicate the minimum/maximum number of occurrences to match in a subpattern is indeed a PCRE.
In MARC Global, entering '^' or '$' alone, as a pattern, has always matched the beginning and end of a field or subfield, respectively. This makes '^' a synonym for inserting data, and '$' a synonym for appending data. This is an extension of true regular expression usage, including PCRE, where by themselves, '^' and '$' do not match anything.
'newlines', 'returns', 'tabs', etc., are not (or should not be) found in MARC, so ignore discussion of these when reading PCRE docs. As a rule, PCRE's metacharacter for a space character– \s –should find only blank spaces in MARC data.
'perl-compatible …' is not quite the same as 'perl …' regular expressions; for this reason we advise not to use 'perldocs' documentation as a regular expression reference.